How Statistics lifts the fog of war in Syria

[This article was first published on Revolutions, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

In June 2013, the conflict between opposition and government forces around the Syrian city of Aleppo had intensified. Rockets struck residential districts, and car-bombs exploded near key facilities.

Many people died. But as is common in conflict areas, the reports of the number of dead varied by the source of the information. While some agencies reported a surge in casualties in the Aleppo area around June 2013, others did not.

Reported deaths
Reported number of deaths in Aleppo, Syria in mid-2013 from four different agencies. Note the spike in deaths as reported by two of the four agencies in June 2013. Chart by Megan Price, HRDAG.


The true number of casualties in conflicts like the Syrian war seems unknowable, but the mission of the Human Rights Data Analysis Group (HRDAG) is to make sense of such information, clouded as it is by the fog of war. They do this not by nominating one source of information as the “best”, but instead with statistical modeling of the differences between sources.

In a fascinating talk at Strata Santa Clara in February, HRDAG's Director of Research Megan Price explained the statistical technique she used to make sense of the conflicting information. Each of the four agencies shown in the chart above published a list of identified victims. By painstakingly linking the records between the different agencies (no simple task, given incomplete information about each victim and variations in capturing names, ages etc.), HRDAG can get a more complete sense of the total number of casualties. But the real insight comes from recognizing that some victims were reported by no agency at all. By looking at the rates at which some known victims were not reported by all of the agencies, HRDAG can estimate the number of victims that were identified by nobody, and thereby get a more accurate count of total casualties. (The specific statistical technique used was Random Forests, using the R language. You can read more about the methodology here.)


HRDAG is doing a noble and difficult job of understanding the facts of war from incomplete data. “If we base our conclusions about what's happening in Syria on the observed data — on the reporting rates — we get those questions wrong”, said Megan in her Strata talk. “When estimate what is missing, we have a much more accurate estimate of reality.”

Strata: Record Linkage and Other Statistical Models for Quantifying Conflict Casualties in Syria

To leave a comment for the author, please follow the link and comment on their blog: Revolutions. offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)