In homage to International Talk Like a Pirate Day…
I recently stumbled across a series of blog posts from the folks at IDV that visualised the archive of recorded pirate attacks which has been collected by the US National Geospatial-Intelligence Agency. It’s a dataset of 6000+ pirate attacks which have been recorded over the last 30 or so years.
This first map shows where the attacks have been recorded, with four clear areas standing out when the data is aggregated into hexagon bins:
Zooming in on the area around Yemen, there’s a clear ramp up in the number of attacks since 2008, which saw a 570% increase compared to the previous year. As noted by the IDV analysis, most attacks take place on Wednesday and during the spring and autumn.
The reaction to the massive increase in attacks in 2008 seems to have been ships not travelling as close to the shore in 2009, leading to more attacks happening further out to sea. This can clearly be seen by looking only at the attacks in 2008 and 2009:
Within the dataset, as well as information around the location of the attack, there are also descriptions of the attacks, which lends itself well to some text analysis to understand where there have been changes in the nature of the attacks above and beyond their distance.
Some analysis of the descriptions of the attacks reveals that the nature of the attacks also changed in 2010, with more featuring terms such as security and speedboats (full details of how these topic groups have been created is below). The analysis was used to identify five different types of attacks.
From the chart below, Topics 4 and 5came to prominence in 2008, with Topic 5 maintaining it’s share in 2010 before Topic 2 then increasing in number in 2011 and 2012. This is just scratching the surface with what can be done with topic analysis and given that all the documents are related to pirate attacks, there’s not the variation compared to what you would see in say, news articles about many subjects. There’s a good walk through of using the TopicModels package here.
And what are these topics? The table below shows the top 10 terms for each of the 5 topics – they’re not as clear cut as you’d hope (mainly because there are a fair few verbs and numbers in there at the moment), but give an idea of some differences – skiffs versus speedboats, topic 2 featuring “security”, the numbers involved and months of the year all hit at different aspects of the attacks which have been picked out.
Topic 1 Topic 2 Topic 3 Topic 4 Topic 5 1 attempted were hijacked boat boats 2 six fire are white four 3 took security attacked port men 4 increased skiffs miles about persons 5 board team vessels small three 6 skiffs when this sep may 7 alarm, jan advised apr speedboats 8 for seven merchant general five 9 chemical had boarded reportedly reports 10 guns which exercise speedboat each
And the reason why pirates are called pirates? Because they Argghhhhhhhhhh.
NB. I haven’t had a chance to check on the copyright, etc. for hosting the pirate dataset, so please download it from here.
Reading the data into R.
library(maps) library(sp) library(maptools) library(ggplot2) library(spatstat) gpclibPermit() library(topicmodels) pirates.data <- readShapePoints("C:\\ASAM 05 SEP 12") pirates.data.2 <- as.data.frame(pirates.data)
How far to the shore?
Next is to turn the data into a Planar Point Pattern in order to allow for calculation of nearest coastal point for each attack. The same technique is then used to create a similar file for the coastline. Use the nncross function to find the distance from each attack to the nearest point on the coast (and identify which point that is).
bb <- c(40, 56, 7, 17) pirates.ppp <- as.ppp(pirates.data.2[,13:14], bb) worldmap <- map_data("world") land.ppp <- as.ppp(worldmap[, 1:2], bb) land.df <- as.data.frame(cbind(land.ppp$x, land.ppp$y)) reg <- as.data.frame(map("world", xlim = c(40, 56), ylim = c(7, 17), plot = FALSE)[c("x", "y")]) nearest.land <- nncross(pirates.ppp, land.ppp) pirates.nearest.land <- as.data.frame(cbind(as.numeric(pirates.ppp$x), as.numeric(pirates.ppp$y), as.numeric(nearest.land$dist))) pirates.data.aden <- merge(pirates.nearest.land, pirates.data.2, by.x=c("V1", "V2"), by.y = c("coords.x1", "coords.x2"))
Calculate various extra columns such as year of date and number of attacks by year
pirates.data.aden$year <- 1900 + as.POSIXlt(pirates.data.aden$DateOfOcc)$year pirates.data.aden$month <- 1900 + as.POSIXlt(pirates.data.aden$DateOfOcc)$year pirates.data.aden$year <- 1900 + as.POSIXlt(pirates.data.aden$DateOfOcc)$year year.stats <- ddply(pirates.data.aden, .(year), summarise, attacks= length(year)) year.stats$Delt <- Delt(year.stats$attacks)
Plot the world map showing attacks binned into hexagons
ggplot()+ stat_summary_hex(fun="length", data=pirates.data.2, aes(x=coords.x1, y=coords.x2, z=coords.x2)) + scale_fill_gradient(low="white", high="red", "Pirate Attacks recorded") + geom_path(aes(x=x, y=y), data = world) + mb.theme + labs(x="", y="") + theme(panel.background = element_rect(fill="white"), axis.ticks = element_line(colour="white"), axis.text = element_text(colour="white"), axis.line = element_line(colour="white"), panel.grid = element_line(colour=NA)) + scale_x_continuous(breaks=NA)+ scale_y_continuous(breaks=NA)
Number of attacks by year near Aden
ggplot(data=pirates.data.aden, aes(x=year))+ geom_histogram(binwidth=1, colour="white", fill="dark blue")+ mb.theme + labs(x="Year", y="Number of attacks recorded")
Attacks by distance from Shore as a histogram
ggplot(data=subset(pirates.data.aden, year%in%c(2008, 2009)), aes(x=V3))+geom_histogram(fill="dark blue", colour="white")+ mb.theme+ facet_wrap(~year, ncol=1) + labs(x="Distance from shore", y="Number of attacks")
Topic Models analysis of attack descriptions near Aden
corpus <- Corpus(VectorSource(pirates.data.aden$Desc1)) dtm <- DocumentTermMatrix(corpus) term_tfidf <- tapply(dtm$v/row_sums(dtm)[dtm$i], dtm$j, mean) * log2(nDocs(dtm)/col_sums(dtm > 0)) dtm <- dtm[, term_tfidf >= 0.1] dtm <- dtm[row_sums(dtm) > 0,] k=5 SEED=2012 TM <- list(VEM = LDA(dtm, k = k, control = list(seed = SEED)), VEM_fixed = LDA(dtm, k = k, control = list(estimate.alpha = FALSE, seed = SEED)), Gibbs = LDA(dtm, k = k, method = "Gibbs", control = list(seed = SEED, burnin = 1000, thin = 100, iter = 1000)), CTM = CTM(dtm, k = k, control = list(seed = SEED, var = list(tol = 10^-4), em = list(tol = 10^-3)))) pirates.data.aden$Topic <- topics(TM[["Gibbs"]], 1) ggplot(data=pirates.data.aden, aes(x=year, fill=as.factor(Topic), group=as.factor(Topic)))+ geom_histogram(binwidth=1, colour="white")+ scale_fill_brewer(palette="Set3", "Topic Group") + mb.theme + labs(x="", y="")