The data set on German fuel prices contains the fuel prices, but not the sales, from more than 14000 fuel stations in Germany since June 2014. It is made available by the webservice Tankerkoenig as a Postgres dump (from June 2014 onwards) under CC4.0.
This is a particularly interesting data set
from the general public point of view as
- it deals with a topic under strong public scrutiny and is already intensively studied by economists
- the data is also constantly updated and allows for continuous monitoring for changes in trends or new phenomena
from the data science point of view as
- this is a real life data set with all the little quirks and bugs whose correction forms large part of an analysts working hours
- the data is small enough (~ 3GB in the original data dump) to be effectively prepared and analysed on a single computer, but big enough to allow to test different techniques for scaling and speeding up the analysis in a cluster
- it allows for the integration of many other (open) data sources (socio-economic data, weather, traffic, holidays). See the graph below.
- interesting results can already be obtained using purely descriptive methods, but the data can also be analysed and modelled under spatial–, time series–, panel– and many other aspects.
Overview of the available data and structure
The original data is available at Tankerkoenig, the current status of the code for the data preparation, some of the additional external data, and some simple first models are available as a “workshop” at my GitHub-account.
Structure of the GitHub project
A) Collection of regional data tables from destatis
Reading, cleaning and consolidating multiple socio-demographic data files
B) Preparation and cleaning of the fuel station master data
Tidying, creation of brand and highway markers using regular expressions, parsing of json-information on opening hours
C) Geo-operations on the station master
Identification of NUTS 3-region per station, station distance matrices to competitors, highways, traffic-counters etc.
D) Preparation of price data
Reading from Postgres, cleaning strange prices, imputing and aggregating the price data (see also the blog entry about efficient missing line imputation). Calculation of competitor prices
E) Creation of models
Moving to AWS, test of different (Linear, Panel, Spatial) models, collection of results
F) Analyses, preparation and visualisation of results
G) Presentation creation (not yet on GitHub)
You can find the print-version of my UseR!2017-talk on this project here. There is also a video, courtesy of Channel 9, that you can watch here: