Text Mining Gun Deaths Data
[This article was first published on Econometrics by Simulation, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
In this post I will explore public data being collected by Slate. I previously released code using a much early set of this data demonstrating how to turn this data into an animated map.
This data began collection as a public response to the horrifying shooting at Sandy Hook Elementary in December of 2012 in an attempt to create a database of all of the deaths as a result of guns in the US. Since its creation, the database has expanded to a list of over 12,000 deaths in a little over a year.
In this post I will explore a little of that database as well scrape the web for additional data from the articles listed on the database.
This data began collection as a public response to the horrifying shooting at Sandy Hook Elementary in December of 2012 in an attempt to create a database of all of the deaths as a result of guns in the US. Since its creation, the database has expanded to a list of over 12,000 deaths in a little over a year.
In this post I will explore a little of that database as well scrape the web for additional data from the articles listed on the database.
# Let's load the raw data:
gun.deaths <-
read.csv(
"http://slate-interactives-prod.elasticbeanstalk.com/gun-deaths/getCSV.php")
tail(gun.deaths)
summary(gun.deaths)
# We can see that the vast majority of gun deaths are among men with 10,153
# and only 1,850 among women. The mean age is 33.34 while the median age is
# a little lower are 30. Interestingly the maximum age is 107. What
# information is not provided though would be interesting would be cause of
# death such murder, suicide, accident, etc.
library(XML)
# Read and parse HTML file
gun.deaths.text <- list()
html.last <- ""
# The following code will grab all 12k+ article text. I have a somewhat
# slow internet
# This will allow you to gather some of the text and come back to it at a
# future point if you decide not to wait for it to ping all 12k websites.
for (i in (length(gun.deaths.text)+1):nrow(gun.deaths)) {
print(paste("reading html #", i))
# The following code I borrow from a post on Quantum Forest.
# It grabs the text between paragrahs from HTML documents.
# www.quantumforest.com/2011/10/reading-html-pages-in-r-for-text-processing/
try(doc.html <- htmlTreeParse(gun.deaths$url[i],
useInternal = TRUE))
# I have added a few 'try' error handling functions so that the web scraping
# loop does not stop when there is a missing URL.
if (is.null(doc.html)) doc.html <- html.last
doc.text = unlist(try(xpathApply(doc.html, '//p', xmlValue)))
doc.text = gsub('\\n', ' ', doc.text)
doc.text = paste(doc.text, collapse = ' ')
if (identical(html.last, doc.html)) doc.text <- "ERROR Source DROPPED"
gun.deaths.text[i] <- list(doc.text)
# Save the last html read as the current so that dropped documents are not
# counted twice.
html.last <- doc.html
}
length(gun.deaths.text)
# for the following results I only collect the first ~3000 results
# I suggest saving the data after you have downloaded it all.
save(gun.deaths.text, file="gun.deaths.text.Rdata")
load("gun.deaths.text.Rdata")
# We will use the text mining library
library(tm)
# We first turn our list of articles into a corpus
gun.deaths.corpus <- Corpus(VectorSource(gun.deaths.text))
# Then we lowercase the words in that list.
gun.deaths.corpus <- tm_map(gun.deaths.corpus, tolower)
# This will create a matrix that lists all of the words
# and how frequently they appear in each article.
# It can be very long.
dtm <- DocumentTermMatrix(gun.deaths.corpus)
freqTerms <- findFreqTerms(dtm, 550)
dtmDic <- as.data.frame(inspect(DocumentTermMatrix(gun.deaths.corpus,
list(dictionary = sort(c("suspect", "suspects", "gunman",
"fatally", "slaying","witnesses",
"victim" , "victims", "homicide",
"drug", "crime", "accidentally",
"multiple", "suicide",
"accidental", "killed","children",
"student", "teacher", "charged",
"arrested", "self-defense", "defend"))))))
ndict <- ncol(dtmDic)
nobs <- nrow(dtmDic)
# Let's drop the information about frequency of word use and just ask whether
# different words were used.
bimydf <- as.data.frame(dtmDic*0+1*(dtmDic>0))
# Let's see some word frequencies plotted
# First we want to count probability of word use for each article
perc <- apply(bimydf,2,mean)
# I will now create my first bead plot to be saved to the hard drive
png("2013-03-13GunDeaths1.png", width = 650, height = 400)
# Adjust the margins
par(mar=c(5,2,3,1))
# Plot the beads
plot(perc, xaxt = "n", xlab="", ylab="%", cex=1.5,
main="Percent of Articles Using Word",pch=19)
# Add guide lines
for (i in 1:ndict) abline(v=i, col=gray(.9-.1*(i %% 4)), lwd=1)
# Add text to identify each bead
text(cex=1, x=1:length(perc)+.5, y=-.015, names(perc),
xpd=TRUE, srt=60, pos=2)
dev.off()
# This is interesting. Homicide and crime are common while accidentally
# and suicide are quite low.
# Let's creating a valiable that is 1 to count how frequently a word such
#as homocide,gunman, victim, etc appears.
violent <- bimydf$homicide+bimydf$gunman+bimydf$victim+
bimydf$victims+bimydf$victims+bimydf$crime+bimydf$suspect+
bimydf$suspects+bimydf$slaying
# The average number of references to any of the above terms per article
# is 2.
summary(violent[doc.text != "ERROR Source DROPPED"])
violent.bi <- as.numeric(violent>0)
summary(violent.bi[doc.text != "ERROR Source DROPPED"])
# 58% of the articles seem to have some reference to violence or crime
with(bimydf,
cor(data.frame(violent.bi,
suicide,
accidental,
accidentally,
multiple,
children,
drug)))
# Looking at a correlation matrix we find more results. Our violent crime
# variable is negatively correlated with suicide, accidental, and accidentally
# while strongly correlated with multiple, children, and drug.
# Next I will plot out our data over time with each death being mapped.
library("RColorBrewer")
# I use color brewer to mix a brew of colors with blues representing the
# youngest aged victims and darkred representing the oldest victims.
collist <-
colorRampPalette(c("blue", "darkred")) (50)
# Prepare to save as png
png("2013-03-13GunDeaths2.png", width = 650, height = 1200)
# I adjust the margins.
par(mar=c(5,1,3,1))
# Open a plot window
plot(0,0, xlim=c(.75,ndict-.5), ylim=c(.03,.97),
type="n", xaxt = "n", xlab="",
yaxt = "n", ylab="Account",
main="Wordcount Beadplot-US Gun Deaths Articles")
# Name the columns
text(cex=1, x=1:ndict, y=-.015, colnames(dtmDic),
xpd=TRUE, srt=60, pos=2)
# Insert column guides
for (i in 1:ndict) abline(v=i-.25, col=gray(.9-.1*(i %% 4)), lwd=4)
# Insert a small horizontal line for each word associated with
# each article
for (i in 1:ndict) for (ii in 1:nobs)
if (dtmDic[ii,i]>0) lines(c(i-.75,i+.15),c(ii/nobs,ii/nobs),
col=collist[min(gun.deaths$age[ii],50)], lwd=.5)
dev.off()
# Blue is young child victim while dark red is a victim who is 50 years or older.
To leave a comment for the author, please follow the link and comment on their blog: Econometrics by Simulation.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.


