Saturday, August 24, 2013

Ethiopian telecom needs to wake up

It is time the Ethiopian government starts to free up the telecom industry. Communication is at the core of 21st century economy, and Ethiopia is lagging behind badly. 25% mobile penetration and less than 4% Internet penetration (and a terrible service at that) prevent high potential economic advances from taking foot in the country. Technology based investment and innovation is suffering from the lack of affordable, reliable and high speed communication in Ethiopia. The government is being short sighted in holding telecommunications in a state monopoly---whether it is to crack down on free speech or derive state revenue. Ethiopia's Telecom is Africa's *only* state monopoly, and in the bottom 5 in cellular coverage. The state of Internet connectivity even more grave, and is *actively*, *negatively* influencing.

In my field alone, I have met many bright, high potential students and professionals, who I know could accomplish amazing things if they had the access to affordable, high speed communication. My company would find it much easier to open an office in Ethiopia if there was access to affordable, high speed communication. In a short few years, Ethiopia can significantly increase the wage earned by technology workers, and create high quality jobs if there was access to affordable, high speed communication. The government *can't do this* by itself, and needs to free up the industry for the private sector. Now is our opportunity to seize on the technology boom in Africa and build the foundation for a global working class based in Ethiopia, which is only possible through affordable, high speed communication. Let's not sleep through this opportunity.

Thursday, January 10, 2013

Heroku always downloading git repositories?

At Meritful, we use Heroku a lot.

One small annoyance has been adding requirements from repositories. The problem was that it keeps re-downloading the repository every time we pushed to Heroku. We initially had our requirements with the edit (-e) flag, which grabs the source, and added the egg name at the end, but to no avail.

At the end, we specified the exact commit and removed the edit flag to prevent this behaviour, with the requirement like:
And now, the build process doesn't re-download the repository because it couldn't find the source locally.

Wednesday, September 5, 2012

Asynchronous Tasks: A Celery with Django Tutorial

Some tasks are meant to be asynchronous during web development. For example, when a user posts an item, we might need to broadcast it on facebook and twitter, and send mail to followers.

One way to do this is call the appropirate post_facebook and post_twitter functions in the Django view for posting the item. But calls which involve HTTP to other sites might be slow, and during that time, the page which is rendered from the view will just be loading.

This is where asynchronous tasks come in place. These tasks do not need to done while the user waits, and it would be nice to do them asynchronously. That is what Celery allows you to do.

Celery is a python library that works nicely with Django to enable async task processing. The general concept to make everything happen is:
  1. You need tasks, that are the units of execution that need to be performed
  2. You need workers, which take tasks and execute them
  3. You need to add tasks to a queue, which lets workers know what tasks to run
  4. You need a notion of a 'queue', which is where your tasks are stored while they wait to be performed. 

Getting your message queue

Before we start working with celery, we need to have this queue in place. Celery can use a number of different queues, including rabbitmq, redis, or even the database backend. After all, the underlying function of a queue is to store tasks, and give it to workers. That said, you should not use the database based queue in production because you will be hitting it often, and that is not good. There are several protocols for the queue to communicate with workers, the most popular being AMQP---basically, your queue and your workers need to speak the same protocol to understand each other.

RabbitMQ is the industry standard. It is performant (and can be configured very much). It is written in Erlang. Your initial step is to get rabbitmq installed. It is actually available in the Ubuntu repositories as rabbitmq-server. You can find more installation help here.

You can run rabbitmq-server using invoke-rc.d, as given in the installation page. Once its running, there is a command line tool called rabbitmqctl which will help you interact with the server.

RabbitMQ has a notion of access control, which describes who is allowed to read from your queue. For that, it needs a user and a password. It also has namespaces for queues---known as a vhost---with the default being '/'. A fresh install of RabbitMQ has a user named guest with password guest. You might want to change it, and add your own. For that, you will need to use the command rabbitmqctl. Details here. You always run rabbitmqctl with sudo.
You can do this to create a new user, and give it permissions on the default vhost

$ rabbitmqctl add_user myuser mypassword
$ rabbitmqctl set_permissions -p / myuser ".*" ".*" ".*"   (NOTE: / is the vhost name) 
(you can also create a new vhost, and then assign permissions to that vhost instead of to / )

At this point, you should have rabbitmq-server running, with a new user. You can check if it is running using 'rabbitmqctl status'
If you want to learn more about other options for brokers, check out this doc.

Celery time

With a message queue in place, the next step is to get celery going
To install celery for django, you can pip install a package known as django-celery (which depends on the underlying celery library), but adds additional django specific tools. The key steps are:

1. pip install django-celery
2. add the following to

import djcelery
3. add djcelery to

4. Create the db tables using python syncdb (or south)

You can find more details here: You then need to tell celery where your broker (queue) is located, and that is done using the BROKER_URL setting, which goes in In our case above, the BROKER_URL will be:

BROKER_URL = 'amqp://myuser:mypassword@localhost:5672//' 
Note that here, we are running rabbit on the same machine as celery, and using localhost to find it. Also notice the double slash at the end...the second one is the vhost. Also, you could break this settings to BROKER_USER, BROKER_PASSWORD etc, but the above is shorter and sweeter.

Defining tasks

With infrastructure in place, the next step is to define some tasks. By convention, tasks for an app are defined in app/ Then, when you start a worker, the worker will go through all the modules that are installed, and import the tasks that are defined in the of any app that has it.

A task is a normal python function, except it is decorated as follows:

import time
from celery import task

def celery_add_tak(x, y):
    return x + y 
There are actually a lot of configurations for tasks. Ignore_result tells celery that you don't care about the results, and it does not need to keep state. Normally, you do this because the result of your task might be some update that is reflected in another state elsewhere. A full list of task configurations can be found here:

Calling your tasks

Now, we have a trivial task defined, that knows how to take some inputs and add them. Therefore, whenever you need to add two things, and this does not need to happen immediately, you can assign this task to a worker. The time.sleep here is to show you that this function can take longer amounts of time to run, and you don't want to be waiting on it in your view. We now need to call add this task to the queue as needed, with the appropriate arguments.

lets say you are in the of that app. To add a task to the queue, you do the following

from app.tasks import celery_add_task

def some_func(request, args,..):
    You do stuff here

    #then you want to add a particular relevant task, which you do as:

apply_async is one way to invoke the task. The nice thing about it is that you can have a number of other parameters that describe how this task is added. For example, if you have a task T:

    T.apply_async((arg, ), {'kwarg': value})

    T.apply_async(countdown=10)  #executes 10 seconds from now.

    T.apply_async(eta=now + timedelta(seconds=10)) #executes 10 seconds from now, specifed using eta

    T.apply_async(countdown=60, expires=120)  #executes in one minute from now, but expires after 2 minutes.

    T.apply_async(expires=now + timedelta(days=2)) #expires in 2 days, set using datetime. 
There is also a shortcut function, which calls the simplest form of apply_async more elegantly:

    T.delay(arg, kwarg=value)   #always a shortcut to .apply_async. 
What this does is give you the same semantics as calling the task directly. You can learn more about calling tasks here:

Time for fireworks

We are all set now. We just need to sit and watch the fireworks (or fix a couple things if needed). First, you will need to get the celery works going, and it is easy to do because of the djcelery app that you have in your installed apps. It gives you a command to do it:

python celery worker --loglevel=info
This command can take a number of switches. For example -c 2 changes the concurrent number of workers to 2 (default is 4). (Source

(often, you would want to run them in the background, look here: You can use the following command to find more about available commands:

python celery help 
When you start running the worker, if you see connection errors, you need to check if the rabbitmq-server is running. Then check if the user and password that are in the URL are correct in rabbitmq, and finally make sure the vhost is correct. Vhost is exactly as you defined it. For example, if you didn't add a trailing slash when you create it, then it doesn't need it in the URL.

Starting the worker should find all tasks that are defined in your installed apps, and then show you a list of all these tasks, and it starts listening.

At this point, you can navigate to the page that will trigger the view which adds the task (some_func as we defined it above). This should add the task, and you will see the worker getting the task from rabbitmq, and running it. There you have it, you now can add a bunch of asynchronous tasks to your apps.

Periodic (non-user-triggered) tasks

Some tasks are not user triggered, and need to happen on schedule. For example, you might recompute a leader board of some sort every 15 minutes. For those tasks, you need periodic tasks. In that case, what you need to do is, set a variable named CELERYBEAT_SCHEDULE which will schedule the tasks as you need them. You can use intervals or more cron like schedules. You can find more about setting them here: 

Here is an example CELERYBEAT_SCHEDULE that points to task using a timedelta schedule.

    'update_search_index_every_15min': {
        'task': 'msearch.tasks.update_index_task',
        'schedule': timedelta(seconds=900)

You then need someone to read this configuration, and then submit the jobs to the queue as the schedule comes up. The scheduler (known as celery beat) can be started by itself, as 'celery beat', but it can also be started embedded in the worker themselves, so you can do:

python celery worker -B
here the -B starts the beat, and your tasks will be added as needed.


This post looked at the key steps in getting Celery for Django to work. These include getting a message queue in place, installing and configuring celery, defining tasks, calling tasks, and starting workers with user-triggered and periodic tasks. Now go make your apps snappier.

Sunday, July 1, 2012

It always seems impossible until it's done

Earlier today, I got reminded of one of my favorite quotes, by Nelson Mandela:

"it always seems impossible until it's done." 

I have liked it for a while, and always thought it was exactly what folks need when things were tough.

However, thinking about the context of the quote, things are a little different. Let's do a bit of history. Mandela was no ordinary man. He was a lawyer who set up shop to provide free or low-cost counsel to many of his fellow South Africans who lacked attorney representation. When he was arrested in 1961, he has already been fighting apartheid for nearly 15 years. And he would spend 27 years in prison to this end, 18 of which was on a small island called Robben.

Of course he would say everything is possible.

He was a guy completely committed to his cause. He laid it out on his closing arguments for his trial: 

"During my lifetime I have dedicated myself to the struggle of the African people. I have fought against white domination, and I have fought against black domination. I have cherished the ideal of a democratic and free society in which all persons live together in harmony and with equal opportunities. It is an ideal which I hope to live for and to achieve. But if needs be, it is an ideal for which I am prepared to die." 

He was sentenced to life in prison.

Yes, maybe it always seems impossible until it is done. But that seems to depend on how committed you are to the cause.

Wednesday, September 7, 2011

1and1 customer service asks for your password over the phone

As if their technical glitches were not enough, as if their terrible tech-support with representatives that hardly say anything that makes was sense not enough, the customer service people at 1and1 hosting actually ask you for your account password over the phone.

Seriously, should that even be legal?

Monday, August 15, 2011

Personalization in web access: developing regions

Cellular services now cover 90% of the world population. This does not mean the penetration will get there sometime soon, but the access exists. Along with the boom in wireless technology in developing regions, data services are also finding their way to first time internet users. Most of the next billion internet citizens will come from developing regions, perhaps accessing the internet for the first time on their mobile phone.

A related area of research I have been following (and working in) is web acceleration. Even as data access increases, the capacity in many of these regions is often over-subscribed, an the experience can be quite frustrating. I have spent a few months in Ethiopia over the previous few years, and checking your email in an internet kiosk is often an affair that took a good portion of your afternoon.

Web acceleration attempts to help with this problem using a number of mechanisms---prefetching, caching, time-shifting access etc. I have believed that these mechanisms can benefit from a deeper understanding of individual data access patterns. One of our previous papers reports on a system we built to allow individuals to user their mobile phones in giving hints to internet kiosk systems (a la 'i will be there in half hour'), which allowed the system to do targeted and efficient prefetching ahead of the user's arrival. We deployed the system in Ethiopia, and browsing time saved was an order of magnitude. Scaling this approach, however, requires it to be automated. We need to understand how people access data, and be able to build web acceleration mechanisms that cater to individuals web access mechanisms.

When I got to Microsoft Research in Bangalore in the winter, one of the ideas I wanted to explore was understanding just how personalized web usage is in emerging market scenarios. Far too many web acceleration mechanisms are built on the assumption that people access similar things, and generic caching and prefetching can usually help. However, intuition suggests otherwise.

Web usage is growing largely personal. The most popular internet destinations increasingly seem to be those that provide individualized experiences to their users. On the other hand, even traditionally `static' content such as news is increasingly localized and personalized. As the amount of information available on the web grows exponentially, personalization is a welcome trend that allows users to focus on what is relevant to them. Just as important, the sheer volume of content available online enables users to choose from a diverse set of services that cater to their personal preferences and interests.

With local partners in Bangalore, we set out to collect the first large scale, personalized web usage dataset in developing countries. While several datasets with aggregate information about web usage patterns in developing countries exist, none of them provides an individualized look into personal web usage. Our data set was collected at two sites over a period of one month, and contains web usage information for about 470 users segmented across several sessions. This data has been anonymized to remove personally identifying information, and is available for researchers to access.

Our analysis suggested many interesting patterns in data access, and points to the need for a personalized approach to web acceleration mechanisms. For example, the overwhelming majority of data accessed at the shared access sites tends to be personal or only personally interesting. The later indicates requests to content where only a very small portion of users ever request. For example, the percentage of requests that were made by at least 10% of the users accounted for less than 2% of the total requests.

On the other hand, data access is very self-similar: if you could attach identity to an access session, you can predict the majority of content that will be accessed in the session. The figure shows a complementary CDF of Jaccard indexes for two parameters, one weighed by the size of data domains generated, and the other by the amount of time users spent on each domain. These metrics roughly correlate with each other, and 40% of our users had a Jaccard index of at least 0.5. This is significant because it indicates the potential of a personalized web acceleration system to model and understand its users, improving its performance.

You can read our analysis in its entirety on our paper published at the ACM Networked Systems for Developing Regions (NSDR 2011) workshop. If you would like to further analyze the dataset, you can read about how to access it here.

Tuesday, August 9, 2011

When teachers don't show up

Attendance is a sticky problem, especially when the teacher is the one missing classes. During my elementary school days in Ethiopia, our principal had pretty much solved the problem. By widely accepted convention, teachers are allowed to whip their students if it was necessary. If you told this to your parents, the best you can expect is 'I am sure you deserved it.' The worst? They might ask you for what you did, and give you some more whipping for it. So, in those days, being late meant a painful morning, and most everybody was on time unless there was a serious problem.

Fast forward two decades to Rajasthan, India. A non-profit organization called Seva Mandir was having an interesting problem with attendance. Seva Mandir runs small 'schools' in over a 100 small villages in northern India. The goal is to educate kids in grades 1-4 near the village until they all old enough to walk over to the government school, perhaps in the next village. Since the number of kids is usually less than 30 or so, they often have only one or two teachers per 'school'. And their attendance problem was not with the kids, rather, it was the teachers that were not showing up. Quite often.

They had a clever solution for this. Equip every school with a camera, and require teachers to take a picture of themselves with the students: one in the morning and one in the afternoon. This goes on everyday of the week, and at the end of the month, someone from the school ferries the memory card of the camera to the headquarters in Udaipur, and grab a fresh one. At the headquarters, 4 people went through every picture taken in every one of their schools, and used the timestamps to make sure the teachers had a shot for every point. Their task was slightly simplified by some help from MIT who built them a software that classified pictures by date. Still, employees had to go through nearly 20,000 pictures in every pay period, and this was taking over a 100 man hours. In addition, for an even slightly motivated teacher, beating the system meant altering the EXIF data on the pictures.

When I got to Microsoft Research India in the winter of 2011, I heard about this organization through Saurabh panjwani, who had done a field visit up north in the fall, and it struck me as an interesting problem. So, one of my projects while in Bangalore was to design a simpler system that would streamline the attendance problem. An attendance system has to do three things: verify identity, verify location and verify time.

Enter Hyke: an attendance tracking system designed for these environments. As you might have heard, mobile phones in India have taken off, and cellular coverage was available to over 80% of the Seva Mandir schools. Hyke uses mobile phones as the platform of authentication and builds on open-source voice biometrics technology in combination with off-the-shelf location tagging tools. At a very basic level, when the teacher initiates the system for attendance recording, we first obtain a fresh location reading (either through GPS or a nearby RFID tag). This is then followed by the generation of a one time passcode given to the teacher. At this point, teachers calls an attendance hotline and reads out the passcode. Using speaker recognition and speech recognition to identify the user and verify passcode freshness respectively, the attendance is recorded.

Hyke has several advantages over prior systems, and particularly targets environments such as Seva Mandir's. Besides reducing cost of operation, it offers the possibility of doing attendance tracking without the presence of a trusted administrative staff on-site---both location and timestamp information for attendance records are generated automatically. Another advantage lies in its utilization of voice as a user biometric---voice is generally regarded as a less invasive and privacy-sensitive form of biometric than fingerprints or pictures. Hyke uses widely available cellular networks with voice and SMS channels for communication.

Most of my time was spent working with Mistral, a voice biometric stack from Avignon University in France. Mistral is a set of tools mostly written in C++ that allows you to do feature vector comparisons on biometric data. And it was rather hard to use, with strong assumptions of its users being experts in the area. So I decided to build a wrapper around Mistral in Python that would make it easy for mere mortals like me to simply drag and drop it in their projects. I have also incorporated SPro, a tool for converting voice recordings into feature vectors that can be processed by libraries like Mistral. The Python wrapper is open-sourced and is available here.

Evaluating the biometric stack was an interesting problem in its own. We needed a range of voices collected over the telephone, preferably from the target population. So, I built an IVR system for collecting voice data, allowing users to call a local number from their phone and record voices. We then posted tasks on Amazon's Mechanical Turk, where we limited participation only to workers located in India. We provided workers with the local phone number, and a set of lines to read-out to the IVR. The lines to be read consist of randomly generated digits of various lengths. Since the Hyke system will need to verify identity based on a text independent, limited vocabulary (digit) passcodes, our data collection also focused on this segment. Our experiments with Indian speakers using audio collected over telephone shows error rates less than 5%, providing sufficient accuracy for most applications.

Another interesting part of designing the system was thinking through the security implications of tracking attendance in the absence of an onsite administrative staff, or principals intent on whipping late comers. Some threats to the model include conference calling, pre-recording voice samples, replacing location tagging with a separate component etc. There are several mechanisms built into Hyke to prevent these attack vectors. For example, passcodes have to be freshly generated from the server, delivered to the user over SMS, and location readings from the designated attendance phone are verified when generating these codes. A paper describing this work was published in the 5th annual Networked Systems for Developing Regions (NSDR 12), and you can read it in its entirety here.