Exploring London Crime with R heat maps

[This article was first published on r-tastic, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Recently, I had a real pleasure to work with various types of data pulled from public APIs, one of them being data.police.uk API. Oh, those hours of pure intellectual exploration it’s given me! I have a soft spot for crime data and I explored it using heat maps in the past. Apart from checking and visualising stats for the new area in London we just moved to, it made me think more about good and better ways of presenting complex and multidimensional information. I’m dedicating this post to my favourite heat maps, so expect some lovely colours (side by side with insightful findings on London crime)!

Project description

I’m going to scrape Wikipedia to get coordinates of all Tube Stations in London. Then I’ll pick a random sample of 20 of them and use their latitude and longitude to pull all crime information for those locations between Jan 2016 and June 2018. Then, I’ll explore crime frequency and crime type per location over time using faceted heatmaps. Finally, I play a bit with leaflet package to explore best way of visualising this data on geographical heat maps.

Packages

Here’s a sweet collection of packages required to run this analysis:

library(httr)
library(rvest)
library(purrr)
library(dplyr)
library(ggplot2)
library(jsonlite)
library(lubridate)
library(stringr)
library(viridis)
library(leaflet)
library(leaflet.extras)
library(htmltools)

Scraping Wiki

Let’s first start with scraping a table from Wikipedia website that holds coordinates for all London Tube stations. I used Google Selector tool for it, you can learn more about it here:

### scraping with rvest() ####

# Wikipedia linke
wiki_link <- "https://wiki.openstreetmap.org/wiki/List_of_London_Underground_stations"

# scraping information from the table in the above URL 
wiki_tbl <- wiki_link %>%
  read_html() %>%
  html_nodes(xpath='//*[@id="mw-content-text"]/div/table[1]') %>% 
  html_table(fill = TRUE) %>% 
  as.data.frame()

head(wiki_tbl)
##            Name    Latitude    Longitude          Platform...Entrance
## 1    Acton Town   51.502500    -0.278126                     Platform
## 2 Acton Central 51.50883531 -0.263033174                     Entrance
## 3 Acton Central 51.50856013 -0.262879534                     Platform
## 4       Aldgate    51.51394     -0.07537 Aldgate High Street entrance
## 5  Aldgate East    51.51514     -0.07178                     Entrance
## 6      Alperton    51.54097     -0.30061                     Platform
##        Collected.By Collected.On                         Line Step.free
## 1    User:Gagravarr     24/11/06         District, Piccadilly      <NA>
## 2    User:Firefishy   08/05/2007            London Overground      <NA>
## 3    User:Firefishy   08/05/2007            London Overground      <NA>
## 4       User:Morwen    28/4/2007                 Metropolitan        No
## 5 User:Parsingphase       (2006) District, Hammersmith & City      <NA>
## 6 User:Mattwestcott   28/01/2007                   Piccadilly      <NA>

First thing done. Now, it would take a stupid amount of data to pull all the crime information for extended periods of time for ALL of the Tube stations, so let’s pick a sample of 20 places instead and take it from there:

# pick only coordinates and a station name from Wikipedia table
wiki_selected <- 
  wiki_tbl %>% 
  select(name = Name,
         lat = Latitude,
         long = Longitude)

# pick a random sample of 20 Tube stations
set.seed(999)
k <- sample(as.numeric(rownames(wiki_selected)), 20)
wiki_sample <- wiki_selected[k, ]
wiki_sample
##                         name         lat         long
## 118 Heathrow Terminals 1-2-3  51.4712899   -0.4527438
## 176              North Acton    51.52370     -0.26019
## 29              Boston Manor   51.495371    -0.325573
## 255         Tooting Broadway 51.42780008 -0.167910811
## 235           South Woodford    51.59088     +0.02726
## 36          Brondesbury Park    51.54060     -0.21093
## 180            North Wembley 51.56258091 -0.304072648
## 24             Bethnal Green    51.52718     -0.05504
## 115      Harrow & Wealdstone 51.59205973 -0.334725352
## 182                 Northolt   51.548054    -0.368257
## 84                    Epping    51.69365     +0.11495
## 161              Manor House    51.57038     -0.09601
## 9                    Archway    51.56536     -0.13474
## 286           West Hampstead 51.54755518 -0.190999685
## 30              Bounds Green    51.60700     -0.12418
## 226            Sloane Square 51.49258474 -0.156090904
## 48        Chalfont & Latimer    51.66822     -0.56053
## 298                Wimbledon    51.42200     -0.20544
## 38                 Burnt Oak    51.60300     -0.26427
## 20                 Bayswater    51.51224    -0.187569

Pulling data from data.police.uk API

Now that we have coordinates, we can feed them into the UK Police API that requires the following parameters: latitude, longitude and year-month which we want to pull crime data for. To make everyone’s life easier, I define a function get_crime() which will require all three parameters but also a name of the station that coordinates correspond to:

get_crime <- function(lat, long, date, name) {
  # pull specified data from the API 
  base <- "https://data.police.uk/api/crimes-street/all-crime"
  police_url <- paste0(base, "?", "lat=", lat, "&lng=", long, "&date=", date)  
  get_police <- GET(police_url)
  
  # parse JSON file into a clean data frame
  police_data_json <- content(get_police, "text")
  police_data <- fromJSON(police_data_json, flatten = TRUE) %>% 
    mutate(location = name)
  
  return(police_data)
  
}

So far so good. However, our brand new get_crime() function can pull data only for a single location and a single month. Let’s use it in the for loop together with amazing pmap() function from purrr package to get all the data we need. WARNING: the code below needs good 15 minutes to run, so think twice before you execute it!

# create a series of months that we want data for
iter_months <- str_sub(seq(ymd('2016-01-01'), ymd('2018-06-20'),
                           by = 'month'),
                       start = 1, end = 7)

# pull data from API for all the locations over all the months
final_df<-data.frame()
for(i in 1:length(iter_months)){
  # result will be a list
  pre_final_list <- pmap(list(lat = wiki_sample$lat,
                              long = wiki_sample$long,
                              name = wiki_sample$name,
                              date = iter_months[i]),
                         get_crime)
  # turn a list of locations in one month into a data.frame
  pre_final_df <- bind_rows(pre_final_list) 
  # put all the clean data.frames together
  final_df <- bind_rows(final_df, pre_final_df)
  
}

Phew! That was hard work, but it was clearly worth it! Let’s have a glimpse()!

## Observations: 421,496
## Variables: 14
## $ category                <chr> "anti-social-behaviour", "anti-social-...
## $ location_type           <chr> "Force", "BTP", "Force", "Force", "For...
## $ context                 <chr> "", "", "", "", "", "", "", "", "", ""...
## $ persistent_id           <chr> "", "", "", "", "", "", "", "", "", ""...
## $ id                      <int> 46149941, 64792634, 46150855, 46150854...
## $ location_subtype        <chr> "", "LU STATION", "", "", "", "", "", ...
## $ month                   <chr> "2016-01", "2016-01", "2016-01", "2016...
## $ location.latitude       <chr> "51.469708", "51.471600", "51.479579",...
## $ location.longitude      <chr> "-0.451842", "-0.452836", "-0.463993",...
## $ location.street.id      <int> 944231, 1488721, 944266, 944296, 94426...
## $ location.street.name    <chr> "On or near Airport/airfield", "Heathr...
## $ outcome_status.category <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ outcome_status.date     <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ location                <chr> "Heathrow Terminals 1-2-3", "Heathrow ...

Looks great, but did we really pull data for all the locations and all months?

table(final_df$location, final_df$month)
##                           
##                            2016-01 2016-02 2016-03 2016-04 2016-05 2016-06
##   Archway                      831     832     814     809     862     914
##   Bayswater                   1377    1198    1365    1495    1524    1617
##   Bethnal Green               1803    1734    1827    1891    2144    2251
##   Boston Manor                 291     239     278     323     324     294
##   Bounds Green                 523     512     581     509     587     590
##   Brondesbury Park             870     777     805     847    1025     987
##   Burnt Oak                    557     478     600     544     643     661
##   Chalfont & Latimer            17      25      20      26      18      14
##   Epping                        68      77      93     107      92     115
##   Harrow & Wealdstone          484     423     498     486     530     541
##   Heathrow Terminals 1-2-3     166     143     152     137     212     209
##   Manor House                  821     819     944     926    1045    1057
##   North Acton                  349     331     346     356     391     400
##   North Wembley                472     453     506     477     574     590
##   Northolt                     307     257     275     326     349     387
##   Sloane Square               1248    1040    1084    1260    1414    1312
##   South Woodford               225     169     261     251     236     217
##   Tooting Broadway             554     561     621     651     743     662
##   West Hampstead               790     784     790     807     973     909
##   Wimbledon                    378     346     422     297     348     359
##                           
##                            2016-07 2016-08 2016-09 2016-10 2016-11 2016-12
##   Archway                     1020     918     876     983     816     871
##   Bayswater                   1757    2119    1494    1596    1505    1608
##   Bethnal Green               2600    2459    2296    2278    2078    2036
##   Boston Manor                 351     358     313     327     277     279
##   Bounds Green                 642     550     572     622     567     610
##   Brondesbury Park            1104    1179    1043    1055     891     977
##   Burnt Oak                    648     639     628     628     644     563
##   Chalfont & Latimer            18      23      26      22      20      21
##   Epping                        81      92      87     108     108     103
##   Harrow & Wealdstone          608     573     521     526     471     536
##   Heathrow Terminals 1-2-3     208     221     210     211     207     262
##   Manor House                 1218    1096    1045    1163    1020     987
##   North Acton                  394     461     459     441     405     427
##   North Wembley                669     568     539     597     503     594
##   Northolt                     462     393     381     315     313     347
##   Sloane Square               1465    1526    1327    1343    1249    1342
##   South Woodford               247     256     229     255     227     218
##   Tooting Broadway             836     751     771     746     711     653
##   West Hampstead               973     894     888     925     829     956
##   Wimbledon                    454     424     394     390     416     362
##                           
##                            2017-01 2017-02 2017-03 2017-04 2017-05 2017-06
##   Archway                      867     851    1004    1021    1186    1187
##   Bayswater                   1339    1227    1744    1579    1537    1507
##   Bethnal Green               1867    1991    2090    2126    2217    2089
##   Boston Manor                 305     281     337     302     334     367
##   Bounds Green                 535     552     601     618     657     582
##   Brondesbury Park             907     869    1050     924    1100    1082
##   Burnt Oak                    532     480     597     600     654     581
##   Chalfont & Latimer            29      23      25      24      34      30
##   Epping                        83      83     122      86     130     112
##   Harrow & Wealdstone          485     494     543     494     543     529
##   Heathrow Terminals 1-2-3     226     198     231     209     260     231
##   Manor House                  959    1004    1180    1109    1178    1253
##   North Acton                  324     370     449     429     432     385
##   North Wembley                448     460     527     501     582     576
##   Northolt                     281     271     347     373     342     392
##   Sloane Square               1166    1094    1273    1199    1470    1269
##   South Woodford               254     233     290     231     302     319
##   Tooting Broadway             572     604     710     642     707     733
##   West Hampstead               853     848    1026     956    1078    1026
##   Wimbledon                    359     356     414     384     432     378
##                           
##                            2017-07 2017-08 2017-09 2017-10 2017-11 2017-12
##   Archway                     1291    1305    1244    1205    1089     958
##   Bayswater                   1772    2134    1526    1592    1573    1531
##   Bethnal Green               2526    2103    2070    2239    2112    1742
##   Boston Manor                 351     372     316     365     354     246
##   Bounds Green                 635     719     658     729     791     606
##   Brondesbury Park            1164    1181    1140    1182    1085    1007
##   Burnt Oak                    705     629     605     638     594     497
##   Chalfont & Latimer            37      29      26      26      39      26
##   Epping                       114     118     132     140     116     146
##   Harrow & Wealdstone          643     575     545     568     562     485
##   Heathrow Terminals 1-2-3     211     248     247     239     206     250
##   Manor House                 1410    1188    1183    1240    1097     993
##   North Acton                  528     453     407     460     402     396
##   North Wembley                550     534     501     593     574     533
##   Northolt                     351     341     312     328     306     288
##   Sloane Square               1533    1484    1322    1305    1351    1424
##   South Woodford               315     323     355     356     330     282
##   Tooting Broadway             711     675     699     801     775     689
##   West Hampstead              1157    1027    1070    1181    1050     946
##   Wimbledon                    424     436     392     405     408     371
##                           
##                            2018-01 2018-02 2018-03 2018-04 2018-05 2018-06
##   Archway                     1086     932     956    1038    1008    1160
##   Bayswater                   1457    1392    1470    1663    1550    1597
##   Bethnal Green               2033    1668    1902    2160    2254    2315
##   Boston Manor                 289     282     303     345     348     338
##   Bounds Green                 597     600     612     614     645     666
##   Brondesbury Park            1137     866     992    1068    1106    1136
##   Burnt Oak                    515     518     602     634     693     645
##   Chalfont & Latimer            25      22      24      31      31      35
##   Epping                       145     100     113      88     123      98
##   Harrow & Wealdstone          604     478     536     533     618     557
##   Heathrow Terminals 1-2-3     201     204     238     210     208     233
##   Manor House                 1032     899    1012    1136    1131    1243
##   North Acton                  377     345     390     397     397     405
##   North Wembley                492     515     566     529     547     573
##   Northolt                     325     274     332     324     377     328
##   Sloane Square               1304    1331    1385    1260    1393    1414
##   South Woodford               306     258     278     288     248     287
##   Tooting Broadway             723     568     608     687     795     786
##   West Hampstead              1058     823    1016     941     923    1000
##   Wimbledon                    366     355     411     376     444     397

Oh, yes we did!

Let’s finish off with some minor data cleaning: API returns coordinates for crimes, but not for the original location of interest (Tube station), let’s fix it:

final_df <- final_df %>% 
    left_join(wiki_sample, by = c("location" = "name")) %>%
    rename(date = month,
         search_lat = lat,
         search_long = long)

Finally, for some heat maps, all time data will be too much to handle. In those cases I’ll focus on one month only, July 2017, and I pick only random 10% of it, to make interactive maps more responsive:

july_data <- final_df %>%
  filter(date == "2017-07") %>% 
  sample_frac(.1)
## Warning: package 'bindrcpp' was built under R version 3.4.4

Heat-mapping

FACETED HEAT MAPS

Here come the heat maps! Let’s start summarising the frequency of all crimes for each location and month:

# summarise available data and save it in a data.frame
police_grid <- final_df %>%
  unique() %>% 
  group_by(location, date) %>% 
  summarise(n_crimes = n())

### faceted heat map of all the crimes per location and month!
ggplot(police_grid,aes(x=date,y=location, fill=n_crimes))+
  #add border white colour of line thickness 0.25
  geom_tile(colour="white",size=0.25)+
  labs(x="",y="")+
  #remove extra space
  scale_y_discrete(expand=c(0,0))+
  #define new breaks on x-axis
  scale_x_discrete(expand=c(0,0), 
                   breaks=c("2016-01","2017-01","2018-01"))+
  scale_fill_viridis(option = "B", name = "Number of crimes") +
  ggtitle("Number of crimes at London Tube stations") +
  coord_fixed()+
  #set a base size for all fonts
  theme_grey(base_size=8)+
  #theme options
  theme(
    # vertical labels on x axis
    axis.text.x = element_text(),
    #bold font for both axis text
    axis.text=element_text(face="bold"),
    #set thickness of axis ticks
    axis.ticks=element_line(size=0.4),
    #remove plot background
    plot.background=element_blank(),
    #remove plot border
    panel.border=element_blank()
  ) 

So pretty! I’m a big-big fan of viridis colour pallette and I must say you see it at its best in heat maps. And how informative this heat map is! From the first glimpse you can see straight away which areas tend to have more crime (e.g. Bethnal Green, Bayswater) and when (summer months). We know what and when, what about the type of crime they experiene most? This can be sorted with another faceted heatmap:

# summarise data by location and crime type
crime_grid <- final_df %>% 
  group_by(location, category) %>% 
  summarise(n_crimes = n()) 

# plot above data in faceted heat map
ggplot(crime_grid,aes(x=category,y=location, fill=n_crimes))+
  geom_tile(colour="white",size=0.25)+
  labs(x="",y="")+
  scale_y_discrete(expand=c(0,0))+
  scale_fill_viridis(option = "B", name = "Number of crimes") +
  coord_fixed()+
  theme_grey(base_size=8)+
  theme(
    axis.text.x = element_text(angle = 45, hjust = 1),
    axis.text=element_text(face="bold"),
    axis.ticks=element_line(size=0.4),
    plot.background=element_blank(),
    panel.border=element_blank()
  ) #+

  #guides(fill=guide_legend(title="Number of crimes"),
  #       colour = guide_legend(reverse=T))

Another informative beauty! You can see that 3 types of crime dominate: anit-social behaviour, violent crime and vehicle theft. Again, avoid Bethnal Green if you can and make sure you hang out in Epping, instead!

GEOGRAPHIC HEAT MAPS

Facets are brilliant when it comes to quantitative comparison, but where are these areas? are they close/far away from each other? These questions can be only answered by geo heat maps. Lets plot a very basic one using leaflet package:

july_data %>% 
  leaflet() %>%
  addProviderTiles(providers$CartoDB.Positron) %>%
  addHeatmap(lng=~as.numeric(location.longitude),
             lat=~as.numeric(location.latitude),
             radius = 8)

Great, it looks like the farther away you are from the city centre, the fewer crimes you see (doh!). Now, we can add information about number of plotted crime-points using clusterOptions functionality, have a look:

# adding color schemes to crime types
color_scheme <- viridis::cividis(n_distinct(july_data$category))
pal = colorFactor(color_scheme, july_data$category)

july_data %>% 
  leaflet() %>%
  addProviderTiles(providers$CartoDB.Positron) %>%
  addCircleMarkers(~as.numeric(location.longitude),
                   ~as.numeric(location.latitude),
                   fillColor = ~pal(category),
                   stroke = FALSE, fillOpacity = 0.8,
                   clusterOptions = markerClusterOptions(), # adds summary circles
                   popup = ~as.character(category)
  ) %>% 
  addHeatmap(lng=~as.numeric(location.longitude),
             lat=~as.numeric(location.latitude),
             radius = 8)

To leave a comment for the author, please follow the link and comment on their blog: r-tastic.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)