I have worked on big data in my work with QuBit in London. In my research I increasingly find the tools I learnt there to be extremely useful. The keywords are smart data management for big data, such as hadoop and hive for querying just the right set of data to work with. I am currently working on Mobile Phone usage data provided by Orange Senegal as part of their data for development challenge. The scope for using big data for development is immense. Many African countries have seen extremely rapid urbanization, essentially moving from an agrarian economy straight into a service sector economy. Mobile technology is being developed in tech hubs across Africa, be it in Dar Es Salaam, Nairobi or in Dakar. Mobile money is a revolution in itself.
Development planning in Africa can not rely on tools used in developed countries: there are for example, no sophisticated traffic monitoring systems in place to help planners keep track of traffic flows, dispatch police units to congested spots. Mobile phone usage data may turn out to be an incredibly useful tool for development planners as it theoretically allows an analysis of human mobility along the mobile phone mast network. This allows planners to get a sense of population density and how population moves within an urban setting. This may help in development planning, but can turn out to be useful on many other dimensions. For example, epidemiological models of disease may make find in mobile phone data a useful input to model spread of diseases.
In any case, there is little economics literature that has worked with mobile phone use data so far. The two constraints are, first, getting the data in the first place and second, how to work with true “big data”. I have submitted a proposal for the Orange Data for Development Challenge Senegal and obtained the mobile phone Call Detail Records data. This is data collected by mobile phone companies for billing purposes and basically records, for every user, when and where (at which mobile phone tower) they have used their phone, either by receiving or making a call or by receiving or sending a text message. The result is three datasets made available to researchers. The datasets are based on Call Detail Records (CDR) of phone calls and text exchanges between more than 9 million of Orange’s customers in Senegal between January 1, 2013 to December 31, 2013. The datasets are: (1) antenna-to-antenna traffic for 1666 antennas on an hourly basis, (2) fine-grained mobility data on a rolling 2-week basis for a year with bandicoot behavioral indicators at individual level for about 300,000 randomly sampled users, (3) one year of coarse-grained mobility data at arrondissement level with bandicoot behavioral indicators at individual level for about 150,000 randomly sampled users.
I want to fine tune into the Dakar region. The capital is capturing most of mobile phone users and its most interesting to study mobility there. This constraints the focus on the fine-grained mobility data for rolling 2 week time windows. Just to get a sense of the type of data we are talking about. Suppose that every of the 300,000 users received was active in every hour of the time window, so that it is a truely balanced hourly panel, the resulting dataset would have 100.8 million rows. It is not possible to hold such amounts of data in the working memory on a conventional computer running Stata. R is a lot more capable to work with big data.
First, R provides the necessary architecture to run certain routines for efficient data processing directly using C code. This ensures that big R data objects are not moved around in the working memory, but that operations are done using the “physical storage” address of the data using C pointers. Most prominently the R package “dplyr” developed by Hadley Wickham is extremely useful as it allows running operations on these very huge data frames in just a few seconds.
I just wanted highlight here the approach I have taken so far in converting the unwieldy data objects into something useable. The idea is first, to zoom in to the Dakar region. This part is only a small share when it comes to comparing with the size of the country, but it absorbs most of the mobile phone users that are sampled. Out of 300,000 sampled users, on average 120,000 or 40% are located in the Dakar region.
The accumulation of users in the Dakar region implies that there the mobile phone network is quite dense. In total there are 1666 antennas reported, but 489, or 29.3% of these antennas are clustered in the Dakar region.
The idea will be to construct measures of mobility at a spatial resolution. A user will only be registered in the antenna that he or she is locked in. Some antennas are extremely close to one another. In order to get a continous measure across space I compute a grid and associate grid cells with mobile phone masts. The above picture is of a 0.01 degree grid (roughly 1 km since we are not far from the equator). Leaving out the Bambilor arrondisement in Rufisque department (the big and sparse chunk on the right), almost every grid cell contains a mobile phone mast.
The first approach I have taken is to get a sense of population density: where are people usually located in the morning, evening and afternoon hours. The idea is that, especially, the location in the evening hours is the place in which individuals live (spend the night so to say). The location during the day may reflect the location where individuals are working. To highlight how this looks, consider the picture below. This plots out the average evening location of individuals in the central capital region. This information is displayed in two ways. First, the grid cell coloring is related to the number of people inside that grid cell, on average, for the evening hours in a two week time window. Second, the red dots indicate individual locations. If people were only logged in to one mobile phone mast every evening, then this would show up as a cluster of points around the mobile phone mast. In this case, the information at the grid level is useful. The fact that we see many individual red dots suggests that also in the evening, there is some mobility even if it is only within a grid cell.
We can construct crude average morning, afternoon and evening locations. This then allows a construction of a measure of average mobility during the day time in a two week time window. The distances travelled can be visually plotted.
The figure below is a first attempt. This is obviously ongoing research…