The German fuel prices data set

Posted on July 27, 2017 by Boris Vaillant in R bloggers | 0 Comments

[This article was first published on Quantitative Consulting - R analytics blog, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

The data set on German fuel prices contains the fuel prices, but not the sales, from more than 14000 fuel stations in Germany since June 2014. It is made available by the webservice Tankerkoenig as a Postgres dump (from June 2014 onwards) under CC4.0.

This is a particularly interesting data set

from the general public point of view as
- it deals with a topic under strong public scrutiny and is already intensively studied by economists
- the data is also constantly updated and allows for continuous monitoring for changes in trends or new phenomena
from the data science point of view as
- this is a real life data set with all the little quirks and bugs whose correction forms large part of an analysts working hours
- the data is small enough (~ 3GB in the original data dump) to be effectively prepared and analysed on a single computer, but big enough to allow to test different techniques for scaling and speeding up the analysis in a cluster
- it allows for the integration of many other (open) data sources (socio-economic data, weather, traffic, holidays). See the graph below.
- interesting results can already be obtained using purely descriptive methods, but the data can also be analysed and modelled under spatial–, time series–, panel– and many other aspects.

Overview of the available data and structure

The original data is available at Tankerkoenig, the current status of the code for the data preparation, some of the additional external data, and some simple first models are available as a “workshop” at my GitHub-account.

Structure of the GitHub project

A) Collection of regional data tables from destatis

Reading, cleaning and consolidating multiple socio-demographic data files
B) Preparation and cleaning of the fuel station master data

Tidying, creation of brand and highway markers using regular expressions, parsing of json-information on opening hours
C) Geo-operations on the station master

Identification of NUTS 3-region per station, station distance matrices to competitors, highways, traffic-counters etc.
D) Preparation of price data

Reading from Postgres, cleaning strange prices, imputing and aggregating the price data (see also the blog entry about efficient missing line imputation). Calculation of competitor prices
E) Creation of models

Moving to AWS, test of different (Linear, Panel, Spatial) models, collection of results
F) Analyses, preparation and visualisation of results
G) Presentation creation (not yet on GitHub)

You can find the print-version of my UseR!2017-talk on this project here. There is also a video, courtesy of Channel 9, that you can watch here:

To leave a comment for the author, please follow the link and comment on their blog: Quantitative Consulting - R analytics blog.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

R-bloggers

R news and tutorials contributed by hundreds of R bloggers

The German fuel prices data set

Overview of the available data and structure

Structure of the GitHub project

A) Collection of regional data tables from destatis

B) Preparation and cleaning of the fuel station master data

C) Geo-operations on the station master

D) Preparation of price data

E) Creation of models

F) Analyses, preparation and visualisation of results

G) Presentation creation (not yet on GitHub)

Related

Overview of the available data and structure

Structure of the GitHub project

A) Collection of regional data tables from destatis

B) Preparation and cleaning of the fuel station master data

C) Geo-operations on the station master

D) Preparation of price data

E) Creation of models

F) Analyses, preparation and visualisation of results

G) Presentation creation (not yet on GitHub)

Related

Never miss an update! Subscribe to R-bloggers to receive e-mails with the latest R posts. (You will not see this message again.)

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)