# Predicting the Locations of ‘Emergency’ Ushahidi Reports in Port-au-Prince, and Implications for Crowdsourcing

February 2, 2010
By

(This article was first published on Zero Intelligence Agents » R, and kindly contributed to R-bloggers)

Recently, Patrick Meier, PhD candidate at Tufts University and member of the Ushahidi Advisory Board, provided me with a dataset containing the first 72 hours of reports registered with Ushahidi in Port-au-Prince after the January 12th earthquake. First, a huge thank you to Patrick for providing me with this data and the opportunity to analyze it. If you are unfamiliar with Ushahidi check out their dedicated site to the Haiti deployment, but I believe this technology has great potential for social science research.

The data are quite interesting, with each report including precise longitude and latitude information, date, time, type of report, and a description of the incident. While all of this information is fascinating, the report types acts as an excellent categorical variable and thus are a natural starting point for analysis. Ushahidi defines six general report categories: (1) emergency, (2) threats, (3) vital lines, (4) response, (5) other, and (6) persons news. For every longitude and latitude pair there are often multiple reports of each type observed; therefore, it would be useful to know the count of each type of report in each locale. Once this information is aggregated it may be possible to generate predictions for the probability of observing various report types in given location in Port-au-Prince.

First, using the magical data munging power of R, I created a new dataset containing these counts in the form below:

       long      lat cat1 cat2 cat3 cat4 cat5 cat6 total
1 -72.75553 18.18552    1    0    0    0    0    0     1
2 -73.74829 18.19036    0    0    0    2    1    2     5
3 -73.74985 18.19413    1    0    0    0    0    0     1
4 -73.75627 18.19957    0    0    0    1    0    0     1
5 -73.75000 18.20000    0    0    0    0    1    0     1
6 -72.53472 18.23417    4    0    0    0    0    1     5

Next, to generate predictions we will need to specify a spatial model that assumes a distribution appropriate for the event count data above. Unlike the autoregressive spatial models previously discussed, for count data the generalized linear spatial model proposed by Christensen and Ribeiro, implemented in R with the geoRglm package, provides the necessary Poisson link.

The process for estimating the probabilities begins by laying an imaginary grid over Port-au-Prince. Then, using Markov-chain Monte Carlo simulation we will use the observed occurrences of Category 1 Ushahidi reports to predict the probability of observing these reports for every cell in the grid. For brevity, I will not discuss the mathematical assumptions of this model (for an introduction read the geoRglm vignette), but below I include the final few lines of R code used to generate these predictions.

UPDATE: I have added an R file with the entire data cleaning and analysis, and Ushahidi dataset to the ZIA Code Repository for anyone interested in replicating the analysis.

Notice, this process requires some tuning as well as several provided values, such as the covariance pairs and the initial $phi$ value for the Bayesian prior. After these values have been inputted; however, the resulting analysis is quite interesting.

The above figure depicts the predicted probabilities for each grid cell generated by the GLM-MCMC process as a choropleth, where darker regions indicate a higher probability of an emergency report. Included in this figure are also all of the places (longitude and latitude pairs) where Ushahidi reports were observed, depicted as dark points over the choropleth. This analysis may indicate some interesting aspects crisis areas dynamics, as well as the effect of Ushahidi itself.

The most striking observation is that locations with the highest concentration of Ushahidi reports also have the lowest probability of being emergencies. Instead, places where emergency reports are most likely are sparsely dispersed around throughout the grid. This is counterintuitive, as we might expect that a high concentration of reports in one place would be predicated by several emergency reports, i.e., an emergency occurs, which leads to follow on reports, etc.

There are at least two possible explanations for this; first, that emergency responders (within the first 72 hours) were poorly allocated, since most emergency reports occur in isolated areas. This; however, seems unlikely given the sheer number of responders present in Port-au-Prince during this time period. Alternatively, it may indicate a weakness in the crowd-sourced reporting for this instance, as from this data we would conclude that in worst areas—where there are the most reports—no emergencies are reported. The question; then, is: how accurately are the Ushahidi reports reflecting the reality of the crisis?

This is a fundamental issue with using this data for rigorous analysis, as there are clearly several dynamics contributing to the data generating process. I look forward to exploring this data further in the future, and incorporating more of the covariates provided in the dataset.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...