Why are pirates called pirates?

[This article was first published on Drunks&Lampposts » R, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

In homage to International Talk Like a Pirate Day…

I recently stumbled across a series of blog posts from the folks at IDV that visualised the archive of recorded pirate attacks which has been collected by the US National Geospatial-Intelligence Agency. It’s a dataset of 6000+ pirate attacks which have been recorded over the last 30 or so years.

This first map shows where the attacks have been recorded, with four clear areas standing out when the data is aggregated into hexagon bins:

Map showing areas where pirate attacks have been recorded

Zooming in on the area around Yemen, there’s a clear ramp up in the number of attacks since 2008, which saw a 570% increase compared to the previous year.  As noted by the IDV analysis, most attacks take place on Wednesday and during the spring and autumn.

Number of attacks recorded in the Aden region by year

The reaction to the massive increase in attacks in 2008 seems to have been ships not travelling as close to the shore in 2009, leading to more attacks happening further out to sea. This can clearly be seen by looking only at the attacks in 2008 and 2009:

Number of Pirates attacks in the Aden area in 2008 and 2009 (Distance is in degrees)

Within the dataset, as well as information around the location of the attack, there are also descriptions of the attacks, which lends itself well to some text analysis to understand where there have been changes in the nature of the attacks above and beyond their distance.

Some analysis of the descriptions of the attacks reveals that the nature of the attacks also changed in 2010, with more featuring terms such as security and speedboats (full details of how these topic groups have been created is below). The analysis was used to identify five different types of attacks.

From the chart below, Topics 4 and 5came to prominence in 2008, with Topic 5 maintaining it’s share in 2010 before Topic 2 then increasing in number in 2011 and 2012. This is just scratching the surface with what can be done with topic analysis and given that all the documents are related to pirate attacks, there’s not the variation compared to what you would see in say, news articles about many subjects. There’s a good walk through of using the TopicModels package here.

Attacks in the Yemen region classified into one of five topics based on description

And what are these topics? The table below shows the top 10 terms for each of the 5 topics – they’re not as clear cut as you’d hope (mainly because there are a fair few verbs and numbers in there at the moment), but give an idea of some differences – skiffs versus speedboats, topic 2 featuring “security”, the numbers involved and months of the year all hit at different aspects of the attacks which have been picked out.


     Topic 1  Topic 2  Topic 3    Topic 4    Topic 5
1  attempted     were hijacked       boat      boats
2        six     fire      are      white       four
3       took security attacked       port        men
4  increased   skiffs    miles      about    persons
5      board     team  vessels      small      three
6     skiffs     when     this        sep        may
7     alarm,      jan  advised        apr speedboats
8        for    seven merchant    general       five
9   chemical      had  boarded reportedly    reports
10      guns    which exercise  speedboat       each

And the reason why pirates are called pirates? Because they Argghhhhhhhhhh.

NB. I haven’t had a chance to check on the copyright, etc. for hosting the pirate dataset, so please download it from here.

Reading the data into R.

library(maps)
library(sp)
library(maptools)
library(ggplot2)
library(spatstat)
gpclibPermit()
library(topicmodels)

pirates.data <- readShapePoints("C:\\ASAM 05 SEP 12")
pirates.data.2 <- as.data.frame(pirates.data)

How far to the shore?

Next is to turn the data into a Planar Point Pattern in order to allow for calculation of nearest coastal point for each attack. The same technique is then used to create a similar file for the coastline. Use the nncross function to find the distance from each attack to the nearest point on the coast (and identify which point that is).


bb <- c(40, 56, 7, 17)
pirates.ppp <- as.ppp(pirates.data.2[,13:14], bb)

worldmap <- map_data("world")
land.ppp <- as.ppp(worldmap[, 1:2], bb)
land.df <- as.data.frame(cbind(land.ppp$x, land.ppp$y))

reg <- as.data.frame(map("world", xlim = c(40, 56), ylim = c(7, 17), plot = FALSE)[c("x", "y")])

nearest.land <- nncross(pirates.ppp, land.ppp)
pirates.nearest.land <- as.data.frame(cbind(as.numeric(pirates.ppp$x), as.numeric(pirates.ppp$y), as.numeric(nearest.land$dist)))

pirates.data.aden <- merge(pirates.nearest.land, pirates.data.2, by.x=c("V1", "V2"), by.y = c("coords.x1", "coords.x2"))

Calculate various extra columns such as year of date and number of attacks by year

pirates.data.aden$year <- 1900 + as.POSIXlt(pirates.data.aden$DateOfOcc)$year
pirates.data.aden$month <- 1900 + as.POSIXlt(pirates.data.aden$DateOfOcc)$year
pirates.data.aden$year <- 1900 + as.POSIXlt(pirates.data.aden$DateOfOcc)$year

year.stats <- ddply(pirates.data.aden, .(year), summarise, attacks= length(year))
year.stats$Delt <- Delt(year.stats$attacks)

Plot the world map showing attacks binned into hexagons

ggplot()+
	stat_summary_hex(fun="length", data=pirates.data.2, aes(x=coords.x1, y=coords.x2, z=coords.x2)) +
	scale_fill_gradient(low="white", high="red", "Pirate Attacks recorded") +
	geom_path(aes(x=x, y=y), data = world) +
	mb.theme +
	labs(x="", y="") +
	theme(panel.background = element_rect(fill="white"),
		axis.ticks = element_line(colour="white"),
		axis.text = element_text(colour="white"),
		axis.line = element_line(colour="white"),
		panel.grid = element_line(colour=NA)) +
		scale_x_continuous(breaks=NA)+
		scale_y_continuous(breaks=NA)

Number of attacks by year near Aden

ggplot(data=pirates.data.aden, aes(x=year))+
		geom_histogram(binwidth=1, colour="white", fill="dark blue")+
		mb.theme +
		labs(x="Year", y="Number of attacks recorded")

Attacks by distance from Shore as a histogram

ggplot(data=subset(pirates.data.aden, year%in%c(2008, 2009)), aes(x=V3))+geom_histogram(fill="dark blue", colour="white")+
		mb.theme+
		facet_wrap(~year, ncol=1) +
		labs(x="Distance from shore", y="Number of attacks")

Topic Models analysis of attack descriptions near Aden

corpus <- Corpus(VectorSource(pirates.data.aden$Desc1))

dtm <- DocumentTermMatrix(corpus)

term_tfidf <- tapply(dtm$v/row_sums(dtm)[dtm$i], dtm$j, mean) * log2(nDocs(dtm)/col_sums(dtm > 0))
dtm <- dtm[, term_tfidf >= 0.1]
dtm <- dtm[row_sums(dtm) > 0,]

k=5
SEED=2012

TM <- list(VEM = LDA(dtm, k = k, control = list(seed = SEED)),
		VEM_fixed = LDA(dtm, k = k, control = list(estimate.alpha = FALSE, seed = SEED)),
		Gibbs = LDA(dtm, k = k, method = "Gibbs", control = list(seed = SEED, burnin = 1000, thin = 100, iter = 1000)),
		CTM = CTM(dtm, k = k, control = list(seed = SEED, var = list(tol = 10^-4), em = list(tol = 10^-3))))

pirates.data.aden$Topic <- topics(TM[["Gibbs"]], 1)

ggplot(data=pirates.data.aden, aes(x=year, fill=as.factor(Topic), group=as.factor(Topic)))+
		geom_histogram(binwidth=1, colour="white")+
		scale_fill_brewer(palette="Set3", "Topic Group") +
		mb.theme +
		labs(x="", y="")

To leave a comment for the author, please follow the link and comment on their blog: Drunks&Lampposts » R.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)