Mapping the GDELT data (and some Russian protests, too)

[This article was first published on Quantifying Memory, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.


In this post I show how to select relevant bits of the GDELT data in R and present some introductory ideas about how to visualise it as a network map. I've included all the code used to generate the illustrations. Because of this, if you here for the shiny visualisations, you'll have to scroll way down

The Guardian recently published an article linking to a database of 250 million events. Sounds too good to be true, but as I'm writing a PhD on recent Russian memory events, I was excited to try it out. I downloaded the data, generously made available by Kalev Leetaru of the University of Illinois, and got going. It's a large 650mb zip file (4.6gb uncompressed!), and this is apparently the abbreviated version. Consequently this early stage of the analysis was dominated by eager anticipation, as the Cambridge University internet did its thing.

Meanwhile I headed over to David Masad's writeup on getting started with GDELT in python

25 minutes later the data was down, and unpacked. It came equipped with a few python and R scripts, which I quickly set about modifying. Although hampered by my lack of python knowledge I got the extract script running easily enough, and soon was pulling out all events associated with Russia. It was still a bit unclear what these 'events' were.

According to David Masad, there are four types of event: 1. Material Cooperation 2. Verbal Cooperation 3. Verbal Conflict 4. Material Conflict

I was still not too sure what these exactly represented, but by now the extract script was ticking over into 2007, so I was more interested in preparing the visualisation scripts in R. On the subject of the extract script - the events recorded are very obviously are skewed towards the present, so any time-series conclusions should be taken with a pinch of salt, or at the very least adjusted somehow. Thankfully the data is not quite up to date, and the output data finally clocked in at 226mb.

I was a bit worried about how R would handle a file that size, so I tried reading in a few lines at a time. The first line looked like this. Clearly the data is tab delimited. The first column is the date, the second and third actor codes, then three columns I later discovered were 'event code, quad category, and Godlstein Scale'. Much excitement to be had figuring out what this all means. The last six columns are longitude and latitude, representing the location of the actors and the event.

By the way, for anyone preferring to do this in Python, which frankly makes a lot of sense, here is a great tutorial for finding exactly the data you need.

[1] "19790101\tATH\tRUS\t120\t3\t-4.0\t52.1256\t140.381\t52.1256\t140.381\t52.1256\t140.381"

Now for loading the data in. I was hopeful this could work, but not much beyond that. Windows task manager suggested I was running out of memory fast. But then it was all ok, and the data beautifully formatted. This was also when I realised just how massive this data set is: more than 3 million events featuring Russia. Phew.

First I renamed the columns, fixed the dates, and saved the file in R's native Rdata format (this got the data down to a positively miniscule 37mb) (I realised yet again that Lubridate is amazing. I was worried this would take ages, but, thankfully, no)

library(lubridate)
t <- read.table("GDELTproject/data/select.outfile.txt", sep = "\t")
colnames(t) <- c("Day", "Actor1Code", "Actor2Code", "EventCode", "QuadCategory", 
    "GoldsteinScale", "Actor1Geo_Lat", "Actor1Geo_Long", "Actor2Geo_Lat", "Actor2Geo_Long", 
    "ActionGeo_Lat", "ActionGeo_Long")
t$Day <- ymd(t$Day)
save(t, file = "gdeltRus.Rdata")

So now for some exploratory analysis - who is involved, and when did these events take place?

print(head(t))

         Day Actor1Code Actor2Code EventCode QuadCategory GoldsteinScale
1 1979-01-01        ATH        RUS       120            3           -4.0
2 1979-01-01        MOS        RUS        40            2            1.0
3 1979-01-01        MOS        RUS        43            2            2.8
4 1979-01-01        MOS        RUS        51            2            3.4
5 1979-01-01        RUS     GUYGOV       841            1            7.0
6 1979-01-01        RUS        IRN       100            3           -5.0
  Actor1Geo_Lat Actor1Geo_Long Actor2Geo_Lat Actor2Geo_Long ActionGeo_Lat
1         52.13         140.38         52.13         140.38         52.13
2         55.75          37.62         55.75          37.62         55.75
3         52.13         140.38         52.13         140.38         52.13
4         55.75          37.62         55.75          37.62         55.75
5          6.80         -58.15         38.00         -97.00          6.80
6         55.75          37.62         55.75          37.62         55.75
  ActionGeo_Long
1         140.38
2          37.62
3         140.38
4          37.62
5         -58.15
6          37.62

tail(sort(table(t$Actor1Code)), 20)


    IRN     CVL     DEU     JPN     CHE  USAGOV     GBR     BUS     LEG 
  21413   21801   22956   24430   25539   25544   25775   28223   31419 
    COP     CHN     UKR  RUSMED  RUSMIL     MIL     MED     USA     GOV 
  31773   35332   35857   35904   36555   40957   50677   89275  128319 
 RUSGOV     RUS 
 283648 1422931 

The list of actors looks promising: Russia, Russian government, a generic 'GOV' (what's that about?), USA, MED and MIL which I imagine are medical and military respectively, followed by the Russian military etc etc.

The actor2 columns looked quite similar, but featured some intriguing entries, such as IGOWSTNAT. Ideas, anyone? [Turns out this IGO refers to international or regional inter-governmental organisations ]

As for date distribution, it is reasonably sparse until the mid 90s, since which there has been a general rise, with sharp increases in events during the crises of the late 90s and late 2000s. This is promising!

library(lubridate)
library(ggplot2)
x <- data.frame(table(year(t$Day)))
ggplot(x, aes(x = Var1, y = Freq)) + geom_bar(stat = "identity") + theme_bw() + 
    theme(axis.text.x = element_text(angle = 90, hjust = 1))

Now, what about plotting some on a map? For this initial plot i ignored actor and date data, and just used the final two columns of event location, grouping them by count. This left 30000 distinct geo locations in which events with Russian involvement have been recorded.

On to the plotting:

library(RgoogleMaps)
library(ggmap)
library(mapproj)
library(plyr)

# Reshape the data so we get the counts of events for each location
t$count <- 1
df <- ddply(t, .(ActionGeo_Lat, ActionGeo_Long), summarize, count = sum(count))

# lets define the scope of our map
lat <- c(40, 70)
lon <- c(20, 100)
center = c(lat = mean(lat), lon = mean(lon))
zoom <- min(MaxZoom(range(lat), range(lon)))

# And download a map file!
map <- get_map(location = c(lon = mean(lon), lat = mean(lat)), zoom = 3, maptype = "terrain", 
    source = "google")
print(ggmap(map) + geom_point(data = df, aes(x = ActionGeo_Long, y = ActionGeo_Lat, 
    size = count), alpha = 0.3))

Clearly there is a lot going on here. Maybe focusing on one type of activity would be more productive. This is where it really gets interesting. The data set uses the CAMEO coding scheme , which includes quite detailed information about event type. Let's isolate examples of civil disobedience: protests, rallies, hunger strikes, strikes, boycotts, obstructions, violent protest, etc.

This reduces the data to 25 000 events. Still a lot.

But the CAMEO specifications include information about actors, e.g. the NGOs involved. Hence the mysterious REB, SPY, JUD, etc.

By keeping only entries where at least one agent is a representative of so-called civil society, we are left with 6443 entries.

t$count <- 1
# selecting the events we are interested in
t2 <- t[t$EventCode > 139 & t$EventCode < 146, ]

# Keep only rows with agents that interest us. Combined these
t3a <- t2[grep("OPP|EDU|REL|CVL|ELI|LAB|IGO|NGO|NGM", t2$Actor1Code), ]
t3b <- t2[grep("OPP|EDU|REL|CVL|ELI|LAB|IGO|NGO|NGM", t2$Actor2Code), ]
t3 <- rbind(t3a, t3b)
# agents: OPP, EDU, REL, CVL, ELI, LAB, IGO, NGO, NGM

df2 <- ddply(t3, .(ActionGeo_Lat, ActionGeo_Long), summarize, count = sum(count))
map <- get_map(location = c(lon = mean(lon), lat = mean(lat)), zoom = 3, maptype = "terrain", 
    source = "google", color = "bw")
print(ggmap(map) + geom_point(data = df2, aes(x = ActionGeo_Long, y = ActionGeo_Lat, 
    size = count), alpha = 0.8, pch = 21, colour = "black", fill = "red2") + 
    scale_size(range = c(2, 7)))

Let's zoom in on Moscow during the recent Russian protests and see what detail we can see.

t3$recent <- 0
t3$Day <- as.Date(t3$Day)
t3$recent[t3$Day > ("2011-07-01")] <- 1
tryCatch(detach("package:Hmisc"), error = function(e) NULL)

df2 <- ddply(t3, .(ActionGeo_Lat, ActionGeo_Long, recent), summarize, count = sum(count))
mos <- get_map(location = "moscow", zoom = 10, maptype = "terrain", source = "google")
print(ggmap(mos) + geom_point(data = df2, aes(x = ActionGeo_Long, y = ActionGeo_Lat, 
    size = count, fill = recent), alpha = 0.8, pch = 21, colour = "black") + 
    scale_size(range = c(4, 20)) + facet_wrap(~recent))

On the left is all civil society protest activity from 1979 until July 2011, and on the right in the following 12 months. I've not analysed the plot in great detail, but to the naked eye it looks like there were half the number of (reports of) protests and acts of civil disobedience in Moscow in 2011-12 as for the preceding 30 years, which you may or may not find surprising.


On the subject of protest movements, Alex Hanna has already looked at what GDELT has to say about the Arab spring.

Back to Russia: a more interesting plot can be achieved by drawing lines between the geo location of actors. For the plot below I have only included events between July 2011 and mid 2012, when the dataset ends.

(most of the code for the visualisation is plundered from here)

#Limit the data to the last year recorded
t3$recent <- 0
 t3$Day <- as.Date(t3$Day)
t3$recent[t3$Day>("2011-07-01")] <- 1
t3 <- t3[t3$recent==1,]

#to avoid conflicts between plyr and Hmisc. If anyone knows a better way of doing this, please let me know!
tryCatch(detach("package:Hmisc"),error=function(e) NULL)


df2 <- ddply(t3,.(Actor1Geo_Lat,Actor1Geo_Long,Actor2Geo_Lat,Actor2Geo_Long,ActionGeo_Lat,ActionGeo_Long),summarize, count=sum(count))
df2 <- df2[complete.cases(df2),]


#remove links with America and southern hemisphere
df2 <- df2[df2$Actor1Geo_Lat>0&df2$Actor2Geo_Lat>0&df2$Actor1Geo_Long>0&df2$Actor2Geo_Long>0,]
#remove Generic Russia
df2 <- df2[df2$Actor2Geo_Lat!=60&df2$Actor2Geo_Long!=100,]
df2 <- df2[df2$Actor1Geo_Lat!=60&df2$Actor1Geo_Long!=100,]


#place points and edges in separate data frames
pointsDf <- df2[,5:7]
colnames(pointsDf)[3] <- "count2"
edgesDf <- df2[,c(1:4,7)]

a <- paste0(edgesDf[,1],edgesDf[,2])#remove points were start and end are the same
b <- paste0(edgesDf[,3],edgesDf[,4])
edgesDf <- edgesDf[!a==b,]


library(sna)
library(Hmisc)

edgeMaker <- function(whichRow, len = 1, curved = TRUE){
  fromC <- c(edgesDf[whichRow,2],edgesDf[whichRow,1])  # Origin
  toC <- c(edgesDf[whichRow,4],edgesDf[whichRow,3]) # Terminus
  weight <- edgesDf[whichRow, 5]  # Terminus

  # Add curve:
  graphCenter <- c(50,50)#colMeans(edgesDf[,1:2])  # Center of the overall graph
  bezierMid <- c(fromC[1], toC[2])  # A midpoint, for bended edges
  distance1 <- sum((graphCenter - bezierMid)^2)
  if(distance1 < sum((graphCenter - c(toC[1], fromC[2]))^2)){
    bezierMid <- c(toC[1], fromC[2])
    }  # To select the best Bezier midpoint
  bezierMid <- (fromC + toC + bezierMid) / 3  # Moderate the Bezier midpoint
  if(curved == FALSE){bezierMid <- (fromC + toC) / 2}  # Remove the curve

  edge <- data.frame(bezier(c(fromC[1], bezierMid[1], toC[1]),  # Generate
                            c(fromC[2], bezierMid[2], toC[2]),  # X & y
                            evaluation = len))  # Bezier path coordinates
  edge$Sequence <- 1:len  # For size and colour weighting in plot
  edge$weight <- weight
  edge$Group <- whichRow
  return(edge)
}
allEdges <- lapply(1:nrow(edgesDf), edgeMaker, len = 100, curved = TRUE)
allEdges <- do.call(rbind, allEdges)  # a fine-grained path ^, with bend ^

#     col <- rep(gray(8/10),length(labels))
#     col2 <- rep(gray(1/10),length(labels))
#     col[labels==term] <- "red3"
#     col2[labels==term] <- "red4"

new_theme_empty <- theme_bw()
new_theme_empty$line <- element_blank()
new_theme_empty$rect <- element_blank()
new_theme_empty$strip.text <- element_blank()
new_theme_empty$axis.text <- element_blank()
new_theme_empty$axis.title <- element_blank()
new_theme_empty$legend <- element_blank()
new_theme_empty$plot.margin <- structure(c(0, 0, 0, -1), unit = "lines",
                                         valid.unit = 3L, class = "unit")

map <-get_map(location=c(lon=mean(lon), lat=mean(lat)),zoom=3,maptype="terrain",color="bw")

p <- ggmap(map)
#p <- ggplot(allEdges,environment=environment())  
p <- p + geom_path(data=allEdges, aes(x = x, y = y,group = Group,  # Edges with gradient
                         size=(weight-1),colour=Sequence),alpha=.6)  # and taper
p <- p + geom_point(data = data.frame(pointsDf),  # Add points
                        aes(x = ActionGeo_Long, y = ActionGeo_Lat,size=log(count2)), pch = 21,
                        colour = "black", fill = "red2")  # Customize gradient 
p <- p + scale_colour_gradient(low = "red3", high = "white", guide = "none")
p <- p + scale_size(range = c(.6, 5), guide = "none")  # Customize taper
p <- p + new_theme_empty  # Clean up plot
p <- p + xlim(min(allEdges$x-1),max(allEdges$x+1))
p <- p + ylim(20,65)
print(p)
ggsave("latestImage.png", p, h = 4, w = 8, type = "cairo-png")

In the plot above the circles represent the number of events occurring in a given location, while the shaded lines represent events involving two actors in different locations. The red end of the edge is the origin, the white the destination. I removed all links to the USA or the southern hemisphere, as these obstruct the map. I was interested to note how few events link Russia and the former East European Satellite states (Poland, Hungary, etc.), while noting that there seems to be an extraordinary amount of activity linking Russia and Israel, and possibly also Syria. Finally, it seems that events taking place in the Russian regions, especially the Caucasus, very rarely elicit an international response:

To leave a comment for the author, please follow the link and comment on their blog: Quantifying Memory.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)