By Srini Kumar, Director of Data Science at Microsoft
Who does not hate being stopped and given a traffic ticket? Invariably, we think that something is not fair that we got it and everyone else did not. I am no different, and living in the SF Bay Area, I have often wondered if I could get the data about traffic tickets, particularly since there may be some unusual patterns.
While I could not find one for any data for the SF Bay Area, I supposed I could get some vicarious pleasure out of analyzing it for some place for which I could get data. As luck would have it, I did find something. It turns out that Montgomery County in Maryland does put all the traffic violation information online. Of course, they take out personally identifiable information, but do put out a lot of other information. If any official from that county is reading this, thanks for putting it online! Hopefully, other counties follow suit.
By the standards of the data sizes I have worked with, this is rather small; it is a little over 800K records, though it has about 35 columns, including latitude and longitude. I find it convenient to park the data in a relational database, analyze it using SQL first, and then get slices of it into R.
In this post, I'll be using Microsoft SQL Server to import and prepare the traffic violations data, and then use R to visualize the data.
The SQL language
If you are unfamiliar with SQL, this SQL tutorial has been around for a long time, and this is where I got started with SQL. Another thing I find convenient is to use an ETL tool. ETL, if you are not already familiar with it, stands for Extract-Transform-Load, and is as pervasive a utility in IT infrastructures in companies as plumbing and electrical wiring are in homes, and often just as invisible, and noticed only when unavailable. If you are determined to stay open source, you can use PostgreSQL for a database (I prefer PostgreSQL over MySQL, but you can choose whatever you become comfortable with) and say, Talend for the ETL. For this exercise, I had access to Microsoft SQL Server's 2016 preview, and ended up using it. An interesting thing is that it already has an ETL tool called SQL Server Integration Services, and a wizard that makes it extremely convenient to do it all in one place.
Now, why do I spend so much time suggesting all these, when you can very easily just download a file, read it into R and do whatever you want with it? In a nutshell, because of all the goodies you get with it. For example, you could easily detect problems with the data as part of the load, and run SQL. You can handle data sizes that can't fit in memory. And at least with MS SQL, it gets better. You have an easy wizard that you can use to load the data, and figure out problems with it even before you load it. And with MS SQL, you can also run R from within it, on a large scale.
Here are the fields in the table. The query is quite simple, and accesses the meta data which is available in every relational database.
select column_name, data_type from INFORMATION_SCHEMA.columns where table_name = 'Montgomery_County_MD_Traffic_Violations'
order by ORDINAL_POSITION asc
Date Of Stop date
Time Of Stop time
Personal Injury varchar
Property Damage varchar
Commercial License varchar
Commercial Vehicle varchar
Work Zone varchar
Violation Type varchar
Contributed To Accident varchar
Driver City varchar
Driver State varchar
DL State varchar
Arrest Type varchar
The last five columns were created by me. It is trivial to create new columns in SQL Server (or any other relational database) and having them computed each time a new row is added.
With RODBC/RJDBC, it is easy to get data from an RDBMS and do what we want with it. Here is an interesting plot of violations by race and if they concentrate in certain areas.
The code to create the above chart uses, as you may have guessed, the ggmap library. Installing ggmap automatically includes the ggplot2 library as well. In order to plot this, we first get the geocodes for Montgomery county, which is easy to do. First, we get the geocode for the location, and then use the get_map function to get the map. Actually, the get_map function also has a zoom level, that we can play with to get the appropriate zoom if we need to zoom in or out instead of using the default.
Here is another one on violation concentrations related to the time of day (the scale is hours from midnight: 0 is midnight and 12 is noon):
The code to do this is as follows:
It looks like there are few violations are really small in the wee hours of the morning. How about violations by month by vehicle?
Looks like it is mostly cars, but there are some light duty trucks as well. But how did we get the numerical month of the year? Again, in SQL, it becomes easy to do: