stats19: a package for road safety research

(This article was first published on rOpenSci - open tools for open science, and kindly contributed to R-bloggers)

Introduction

stats19 is a new R package enabling access to and working with
Great Britain’s official road traffic casualty database,
STATS19.

We started the package in late 2018 following three main motivations:

  1. The release of the 2017 road crash statistics, which showed
    worsening road safety in some areas, increasing the importance of
    making the data more accessible.
  2. The realisation that many researchers were writing ad hoc code to
    clean the data, with a huge amount of duplicated (wasted) effort and
    potential for mistakes to lead to errors in the labelling of the
    data (more on that below).
  3. An understanding of the concept of ‘modularity’ in software design,
    following the Unix
    philosophy
    that
    programs should ‘do one thing and do it well’. This realisation has
    led to code inside the rOpenSci-hosted package
    stplanr being split-out
    into two separate packages:
    cyclestreets and
    stats19.

We have a wider motivation: we want the roads to be safer. By making
data on the nature of road crashes more publicly accessible to inform policy,
we hope this package saves lives.

stats19 been tested and peer reviewed thanks to rOpenSci and is now published in the
Journal of Open Source Software1, making this an ideal time to present the package to the world 🚀.
For installation and the code, see its home in rOpenSci:
https://github.com/ropensci/stats19, and the package documentation at
https://itsleeds.github.io/stats19/.

In this post, we’ll provide a bit of context, show how the package works, and provide ideas for future work building on the experience. Version 0.2.0 has just been released on CRAN.

Robin Lovelace (package maintainer) has written code (robinlovelace/bikeR) for an academic paper on cycle safety in Yorkshire based on STATS19 data2. Malcolm Morgan (one of the package authors) has written code to process different STATS19 data for the Cycling Infrastructure Prioritisation Toolkit (see cyipt/stats19).

The large and complex STATS19 data from the UK’s Department for Transport, which is open access but difficult-to-use, represented a perfect opportunity for us to get stuck into a chunky data processing challenge.

What is STATS19?

The name comes from a UK police form called
STATS19
(note the capital letters). Another document called STATS20 is the guidance for officers filling in a STATS19 form. The Department for Transport (DfT) also names the dataset STATS19 on the main web page that links to open access road crash
data
.

One of the reviewer comments from the rOpenSci review process (which
has 77 comments – lots of knowledge
shared)
alluded
to the package’s esoteric name:

I confess I wish the package name was more expressive–stats19 sounds like an introductory statistics class.

We agree! However, the priority with the package is to remain faithful to the data, and alternative name options, such as stats19data, roadcrashesUK and roadSafetyData were not popular. Furthermore, the term ‘stats19’ is strongly associated with road crash data. On Wikipedia, the URL
https://en.wikipedia.org/wiki/STATS19 resolves to
https://en.wikipedia.org/wiki/Reported_Road_Casualties_Great_Britain.

An important point is that the dataset omits crashes in which
nobody was hurt, as emphasised by another government
document.

The importance of road safety and informed decision making based on
crash data cannot be overstated. Deliberately avoiding the matter of
life and death of road safety, two numbers from a strategy
document
by the UK government (2011) are worth mentioning to show the scale of
the numbers:

The economic welfare costs [of road collisions] are estimated at
around £16 billion a year while insurance payouts for motoring claims
alone are now over £12 billion a year.

Even more shocking are the global
statistics
, as
summarised by an open access and reproducible paper submitted for publication, which uses
data from the package to explore car-pedestrian
crashes:

While so many people die on the roads each year in the UK (1,793
people in 2017, 3 deaths per 100,000) and worldwide (1,250,000 people
in 2015, 17 deaths per 100,000) and ‘vision zero’ remains a Swedish
dream3, we urge people researching STATS19 and other
road safety datasets to focus on a more urgent question: how to stop
this carnage?

The road crash data in stats19

There are three main different types of CSV files released by the DfT:
accidents, vehicles and casualties tables. There is a schema
covering these tables but a good amount of work is needed to understand
it, let alone be able to process the data contained within the files and convert
the integers they contain into meaningful data.

There are separate files for each of the above tables for certain years (e.g. 2016, 2017) but not for all of 1979 – 2017 or 2018 now. The largest chunk is the 1979 – 2004 data, which is made available in a huge ZIP file (link). Unzipped this contains the following 3 files, which occupy almost 2 GB on your hard drive:

721M Apr  3  2013 Accidents7904.csv
344M Apr  3  2013 Casualty7904.csv
688M Apr  3  2013 Vehicles7904.csv
# total 1.753 GB data

How stats19 works

With those introductions out of the way, lets see how the package
works and how to access STATS19 data, back to 1979. First
install the package in the usual way:

# release version - currently 0.2.0
install.packages("stats19") 
# dev version
# remotes::install_github("ropensci/stats19") 

Attach the package as follows:

library(stats19)

The easiest way to get STATS19 data is with get_stats19(). This
function takes two main arguments, year and type. The year can be any
year between 1979 and 2017.

crashes_2017 = get_stats19(year = 2017, type = "Accidents", ask = FALSE)
nrow(crashes_2017)
## [1] 129982

What just happened? We just downloaded, cleaned and read-in data on all
road crashes recorded by the police in 2017 across Great Britain. We can
explore the crashes_2017 object a little more:

column_names = names(crashes_2017)
length(column_names)
## [1] 32
head(column_names)
## [1] "accident_index"         "location_easting_osgr" 
## [3] "location_northing_osgr" "longitude"             
## [5] "latitude"               "police_force"
class(crashes_2017)
## [1] "spec_tbl_df" "tbl_df"      "tbl"         "data.frame"
kableExtra::kable(head(crashes_2017[, c(1, 4, 5, 7, 10)]))

accident_index

longitude

latitude

accident_severity

date

2017010001708

-0.080107

51.65006

Fatal

2017-08-05

2017010009342

-0.173845

51.52242

Slight

2017-01-01

2017010009344

-0.052969

51.51410

Slight

2017-01-01

2017010009348

-0.060658

51.62483

Slight

2017-01-01

2017010009350

-0.072372

51.57341

Serious

2017-01-01

2017010009351

-0.353876

51.43876

Slight

2017-01-01

The package contains the names of all “zip” files released by the DfT
and hosted on Amazon servers to download. These file names have been
included in the package and can be found under file_names variable
name. for example:

stats19::file_names$dftRoadSafetyData_Vehicles_2017.zip
## [1] "dftRoadSafetyData_Vehicles_2017.zip"

You can also get the raw data (if you really want!) to see how much
more useful the data is after it has been cleaned and labelled by the
stats19 package, compared with the data provided by
government:

crashes_2017_raw = get_stats19(year = 2017, type = "Accidents", ask = FALSE, format = FALSE)
## Files identified: dftRoadSafetyData_Accidents_2017.zip

##    http://data.dft.gov.uk.s3.amazonaws.com/road-accidents-safety-data/dftRoadSafetyData_Accidents_2017.zip

## Data already exists in data_dir, not downloading

## Data saved at /tmp/RtmpYh76XA/dftRoadSafetyData_Accidents_2017/Acc.csv

## Reading in:

## /tmp/RtmpYh76XA/dftRoadSafetyData_Accidents_2017/Acc.csv
kableExtra::kable(head(crashes_2017_raw[, c(1, 4, 5, 7, 10)]))

The first two columns are raw read, the next two are formatted by
stats19
package:

kableExtra::kable(cbind(
  head(crashes_2017_raw[1:2, c(7, 10)]),
  head(crashes_2017[1:2, c(7, 10)])
  ))

Accident_Severity

Date

accident_severity

date

1

05/08/2017

Fatal

2017-08-05

3

01/01/2017

Slight

2017-01-01

Note: the severity type is not labelled (this problem affects dozens of
columns), the column names are inconsistent, and the dates have not
been cleaned and converted into a user-friendly date (POSIXct) class.

Creating geographic crash data

An important feature of STATS19 data is that the “accidents” table
contains geographic coordinates. These are provided at ~10m resolution
in the UK’s official coordinate reference system (the Ordnance Survey
National Grid, EPSG code 27700). stats19 converts the non-geographic
tables created by format_accidents() into the geographic data form of
the sf package with the
function format_sf() as
follows:

crashes_sf = format_sf(crashes_2017)
## 19 rows removed with no coordinates
# crashes_sf = format_sf(crashes_2017, lonlat = TRUE) # provides the data in lon/lat format

An example of an administrative zone dataset of relevance to STATS19
data is the boundaries of police forces in England, which is provided in
the packaged dataset police_boundaries. The following code chunk
demonstrates the kind of spatial operations that can be performed on
geographic STATS19 data, by counting and plotting the number of
fatalities per police force:

library(sf)
library(dplyr)
crashes_sf %>% 
  filter(accident_severity == "Fatal") %>% 
  select(n_fatalities = accident_index) %>% 
  aggregate(by = police_boundaries, FUN = length) %>% 
  plot()

# as Layik and Robin are based in West Yorkshire
west_yorkshire =
  police_boundaries[police_boundaries$pfa16nm == "West Yorkshire", ]
crashes_wy = crashes_sf[west_yorkshire, ]
nrow(crashes_wy) # which is 3.36%
## [1] 4371

The big picture: road safety

We can combine the three sets of tables (accidents, vehicles and casualties) to analyse the data further.
Lets read the datasets first:

#crashes_2017 = get_stats19(year = 2017, type = "Accidents", ask = FALSE)
casualties_2017 = get_stats19(year = 2017, type = "Casualties", ask = FALSE)
nrow(casualties_2017)
## [1] 170993
vehicles_2017 = get_stats19(year = 2017, type = "Vehicles", ask = FALSE)
nrow(vehicles_2017)
## [1] 238926

Lets now read in casualties that took place in West Yorkshire as the authors are based in West Yorkshire (using
crashes_wy object above), and count the number of casualties by
severity for each crash:

library(tidyr)
library(dplyr)
sel = casualties_2017$accident_index %in% crashes_wy$accident_index
casualties_wy = casualties_2017[sel, ]
cas_types = casualties_wy %>% 
  select(accident_index, casualty_type) %>% 
  group_by(accident_index) %>% 
  summarise(
    Total = n(),
    walking = sum(casualty_type == "Pedestrian"),
    cycling = sum(casualty_type == "Cyclist"),
    passenger = sum(casualty_type == "Car occupant")
    ) 
cj = left_join(crashes_wy, cas_types)

What just happened?

We found the subset of casualties that took place in West Yorkshire with
reference to the accident_index variable in the accidents table.
Then we used the dplyr function summarise(), to find the number of
people who were in a car, cycling, and walking when they were injured.
This new casualty dataset is joined onto the crashes_wy dataset. The
result is a spatial (sf) data frame of crashes in West Yorkshire, with
columns counting how many road users of different types were hurt. The
joined data has additional variables:

base::setdiff(names(cj), names(crashes_wy))
## [1] "Total"     "walking"   "cycling"   "passenger"

In addition to the Total number of people hurt/killed, cj contains a
column for each type of casualty (cyclist, car occupant, etc.), and a
number corresponding to the number of each type hurt in each crash. It
also contains the geometry column from crashes_sf. In other words,
joins allow the casualties and vehicles tables to be geo-referenced. We
can then explore the spatial distribution of different casualty types.
The following figure, for example, shows the spatial distribution of
pedestrians and car passengers hurt in car crashes across West Yorkshire
in 2017:

library(ggplot2)
crashes_types = cj %>% 
  filter(accident_severity != "Slight") %>% 
  mutate(type = case_when(
    walking > 0 ~ "Walking",
    cycling > 0 ~ "Cycling",
    passenger > 0 ~ "Passenger",
    TRUE ~ "Other"
  ))
ggplot(crashes_types, aes(size = Total, colour = speed_limit)) +
  geom_sf(show.legend = "point", alpha = 0.3) +
  facet_grid(vars(type), vars(accident_severity)) +
  scale_size(
    breaks = c(1:3, 12),
    labels = c(1:2, "3+", 12)
    ) +
  scale_color_gradientn(colours = c("blue", "yellow", "red")) +
  theme(axis.text = element_blank(), axis.ticks = element_blank())

Spatial distribution of serious and fatal collisions in which people who were walking on the road network ('pedestrians') were hit by a car or other vehicle.

Spatial distribution of serious and fatal collisions in which people who were walking on the road network (‘pedestrians’) were hit by a car or other vehicle.

To show what is possible when the data are in this form, and allude to
next steps, let’s create an interactive map. We will plot crashes in
which pedestrians were hurt, from the crashes_types, using leaflet package:

library(leaflet)
crashes_pedestrians = crashes_types %>% 
  filter(walking > 0)
# convert to lon lat CRS
crashes_pedestrians_lonlat = st_transform(crashes_pedestrians, crs = 4326)
pal = colorFactor(palette = "Reds", domain = crashes_pedestrians_lonlat$accident_severity, reverse = TRUE)
map = leaflet(data = crashes_pedestrians_lonlat, height = "280px") %>%
  addProviderTiles(provider = providers$OpenStreetMap.BlackAndWhite) %>%
  addCircleMarkers(radius = 0.5, color = ~pal(accident_severity)) %>% 
  addLegend(pal = pal, values = ~accident_severity) %>% 
  leaflet::addMiniMap(toggleDisplay = TRUE)

The R leaflet package is not easy to embed in an .md file for the rOpenSci blog entry. Therefore, we also wrote a “custom” LeafletJS chunk below and use htmltools to embed the interactive map here. Feel free to use the chunk in your own work by replacing the geojson content generated from the preceding code.

library(geojsonsf)
library(htmltools)
geojson = sf_geojson(
  crashes_pedestrians_lonlat[,c("accident_severity")],
factors_as_string = FALSE)
template = paste0('


') path = file.path(tempdir(), "temp.html") write(template, path) includeHTML(path)

The result is:

Conclusion

stats19 provides access to a
reliable and official road safety dataset. As covered in this post, it
helps with data discovery, download, cleaning and formatting, allowing
you to focus on the real work of analysis (see the package’s
introductory
vignette for
more on this). In our experience, 80% of time spent using STATS19 data
was spent on data cleaning. Hopefully, now the data is freely available
in a more useful form, 100% of the time can be spent on analysis! We
think it could help many people, especially, including campaigners,
academics and local authority planners aiming to make the roads of the
UK and the rest of the world a safe place for all of us.

There are many possible next steps, including:

  • Comparing these datasets with interventions such as 20 mph zones4 and links with street morphology5.
  • The creation of more general software for accessing and working with
    road crash data worldwide.
  • Making the data even more available by provide the data as part of
    an interactive web application, a technique successfully used in the
    Propensity to Cycle Tool project (this would likely take
    further times/resources beyond what we can provide in our spare
    time!).

Finally, thanks to all those who have contributed to the package especially Mark Padgham and who are not mentioned in this blog post. Also, we want to take the opportunity to celebrate the release of stats19 🎉, thank rOpenSci for a great review process 🙏 and let you know: the package and data are now out there, and are ready to be used 🚀.

Affiliations

Some of the authors are based at the Institute for Transport Studies and the Leeds Institute for Data Analytics.

References


  1. Lovelace, Robin, Malcolm Morgan, Layik Hama, Mark Padgham, and M Padgham. 2019. “Stats19 A Package for Working with Open Road Crash Data.” Journal of Open Source Software 4 (33): 1181. https://doi.org/10.21105/joss.01181.
  2. Lovelace, Robin, Hannah Roberts, and Ian Kellar. 2016. “Who, Where, When: The Demographic and Geographic Distribution of Bicycle Crashes in West Yorkshire.” Transportation Research Part F: Traffic Psychology and Behaviour, Bicycling and bicycle safety, 41, Part B. https://doi.org/10.1016/j.trf.2015.02.010.
  3. Johansson, R. 2009. Vision Zero – Implementing a policy for traffic safety. Safety Science, 47(6): 826–831.
  4. Grundy, Chris, Rebecca Steinbach, Phil Edwards, Judith Green, Ben Armstrong, and Paul Wilkinson. 2009. “Effect of 20 Mph Traffic Speed Zones on Road Injuries in London, 1986-2006: Controlled Interrupted Time Series Analysis.” BMJ 339 (December): b4469. https://doi.org/10.1136/bmj.b4469.
  5. Sarkar, Chinmoy, Chris Webster, and Sarika Kumari. 2018. “Street Morphology and Severity of Road Casualties: A 5-Year Study of Greater London.” International Journal of Sustainable Transportation 12 (7): 510–25. https://doi.org/10.1080/15568318.2017.1402972.

To leave a comment for the author, please follow the link and comment on their blog: rOpenSci - open tools for open science.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...



If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.

Search R-bloggers

Sponsors

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)