[This article was first published on Wiekvoet, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
September last year I made a post using the shootingtracker data. It is attempted in shootingtracker to register all shootings with at least four victims, be they wounded or dead. The data starts January 1st 2013, which means that by now the amount of data has almost doubled. This surely is a dataset where I hope the makers find less and less data to add. Analysis shows Sundays in summer have the highest number of shootings on a day. Three to four shootings on Sunday in July and August.
Data sits in two pages on shootingtracker, 2013 and 2014. In preparation for this post I copied/pasted those data in notepad and removed headers and footers. In the 2013 data I kept names of columns. The first steps of reading the data are removing some of the (for me) extraneous info, such as reference where the data came from. Subsequently the state and town are separated and a few records which do not have the correct state abbreviation are corrected. Finally, 13 is reformatted to 2013 and date is created. The last record used is from July 9th, 2014. r13 <- readLines(‘raw13.txt’) r14 <- readLines(‘raw14.txt’) r1 <- c(r13,r14) head(r1)  “Number\t Date\t Alleged Shooter\t Killed\t Wounded\t Location\t References”  “1\t1/1/13\tCarlito Montoya\t4\t0\tSacramento, CA\t”  ” [Expand] ”  “2\t1/1/13\tUnknown\t1\t3\tHawthorne, CA\t”  ” [Expand] ”  “3\t1/1/13\tJulian Sims\t0\t4\tMcKeesport, PA\t” tail(r1)  “141\t7/8/2014\tUnknown\t1\t4\tSan Bernardino, CA\t”  ” [Expand] ”  “142\t7/8/2014\tUnknown\t0\t5\tProvidence, RI\t”  ” [Expand] ”  “143\t7/9/2014\tRonald Lee Haskell\t6\t1\tHouston, TX\t”  ” [Expand] ” r2 <- gsub(‘\\[[a-zA-Z0-9]*\\]’,”,r1) r3 <- gsub(‘^ *$’,”,r2) r4 <- r3[r3!=”] r5 <- gsub(‘\\t$’,”,r4) r6 <- gsub(‘\\t References$’,”,r5) r7 <- read.table(textConnection(r6), sep=’\t’, header=TRUE, stringsAsFactors=FALSE) r7$Location[r7$Location==’Washington DC’] <- ‘Washington, DC’ r8 <- read.table(textConnection(as.character(r7$Location)), sep=’,’, col.names=c(‘Location’,’State’), stringsAsFactors=FALSE) r8$State <- gsub(‘ ‘,”,r8$State) r8$State[r8$State==’Tennessee’] <- ‘TN’ r8$State[r8$State==’Ohio’] <- ‘OH’ r8$State[r8$State==’Kansas’] <- ‘KS’ r8$State[r8$State==’Louisiana’] <- ‘LA’ r8$State[r8$State==’Illinois’] <- ‘IL’ r8$State <- toupper(r8$State) r7$State <- r8$State r7$Location <- r8$Location r7 <- r7[r7$State != ‘PUERTORICO’,] Sys.setlocale(category = “LC_TIME”, locale = “C”) r7$Date <- gsub(‘/13$’,’/2013′,r7$Date) r7$date <- as.Date(r7$Date,format=”%m/%d/%Y”)
Effect of day and month
Effect of day of the week is pretty easy to plot, just add the day and run qplot(weekday). In this case complexities arise because I want the days in a specific order, Monday to Sunday. Second is that not all days occur equally often in the test period. This is not enough to invalidate the plot, but since I had to correct for occurrence of months for a similar plot, I decided to reuse that code for weekdays. The data.frame alldays is used to calculate the number of days in the data set. I am not going to over analyze this, Sundays stick out in a negative way.
It is not obvious, given the unequal distributions of weekdays over months, how significant a month effect is. To examine this, I have reorganized the data to display shootings per day. Data frame alldays is used again, now to ensure data with no shootings are correctly represented. The modeling shows a clear effect of months and the interaction of days and months on the brink of significance.
To understand the interaction, a plot is made of the expected values by day and month. Looking at this plot, the Sunday effect is most pronounced in Summer. In addition I would not be surprised if next Sunday has four shootings again.
One might ask if shootings which have much media attention give cause to copycats. This is not easy to analyze, given that clearly time effects from day and month exist. Besides, which shootings get a lot of media attention? Yet we can look at the number of shootings over time and at least add the shootings with most victims in the plot. The number of victims to be marked has arbitrarily been chosen as 18 or more. In this plot I cannot see the connection.