The German fuel prices data set

July 27, 2017
By

(This article was first published on Quantitative Consulting - R analytics blog, and kindly contributed to R-bloggers)

The data set on German fuel prices contains the fuel prices, but not the sales, from more than 14000 fuel stations in Germany since June 2014. It is made available by the webservice Tankerkoenig as a Postgres dump (from June 2014 onwards) under CC4.0.

This is a particularly interesting data set

  • from the general public point of view as

    • it deals with a topic under strong public scrutiny and is already intensively studied by economists
    • the data is also constantly updated and allows for continuous monitoring for changes in trends or new phenomena
  • from the data science point of view as

    • this is a real life data set with all the little quirks and bugs whose correction forms large part of an analysts working hours
    • the data is small enough (~ 3GB in the original data dump) to be effectively prepared and analysed on a single computer, but big enough to allow to test different techniques for scaling and speeding up the analysis in a cluster
    • it allows for the integration of many other (open) data sources (socio-economic data, weather, traffic, holidays). See the graph below.
    • interesting results can already be obtained using purely descriptive methods, but the data can also be analysed and modelled under spatial–, time series–, panel– and many other aspects.

Overview of the available data and structure

The original data is available at
Tankerkoenig, the current status of the code for the data preparation,
some of the additional external data, and some simple first models are available as a “workshop” at my GitHub-account.

Structure of the GitHub project

  1. A) Collection of regional data tables from destatis

    Reading, cleaning and consolidating multiple socio-demographic data files

  2. B) Preparation and cleaning of the fuel station master data

    Tidying, creation of brand and highway markers using regular expressions, parsing of json-information on opening hours

  3. C) Geo-operations on the station master

    Identification of NUTS 3-region per station, station distance matrices to competitors, highways, traffic-counters etc.

  4. D) Preparation of price data

    Reading from Postgres, cleaning strange prices, imputing and aggregating the price data (see also the blog entry about efficient missing line imputation). Calculation of competitor prices

  5. E) Creation of models

    Moving to AWS, test of different (Linear, Panel, Spatial) models, collection of results

  6. F) Analyses, preparation and visualisation of results

  7. G) Presentation creation (not yet on GitHub)

You can find the print-version of my UseR!2017-talk on this project here. There is also a video, courtesy of Channel 9, that you can watch here:

To leave a comment for the author, please follow the link and comment on their blog: Quantitative Consulting - R analytics blog.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...



If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.

Search R-bloggers

Sponsors

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)