French government provides from time to time some data sets that are worth the analysis. A friend, gave me this real estate data set, for analysis.
After the download of this file, that weights less than 200kb, many things appear at first glance.
From a content point of view, data are provided for 3 countries, France, England and USA. They are organized as data series on a timeframe starting in 1800, up to 2015.
From a data quality point of view, a good job have been accomplished. I mean, there are missing data when no statistics were available, and provided data come under a well organized serie, without any hole.
From a style point of view, this appears to be a good Excel nightmare. As mentioned, we have data for 3 countries, but not the same data. For France, we got 27 series, 11 for England, and 12 for USA. But weirdness does not stop at those easy to notice issues. Some more subtles are lurking. First, some series have the same topics addressed by different authorities e.g. England home prices or scope e.g. France vs Paris home prices. Second, each data series is provided under the country currency, thus requiring a conversion to a pivot currency. That tasks is indeed eased as the conversion rate from the currency to euro based on year 2000, is provided for GBP and for USD. And third, data is presented in a well-formed and human readable format, but clearly not matching data analysis requirements. So, some data wrangling is required.
First I saved the file under a .xlsx format instead of the provided .xls format.
Second, I added some columns to ease data handling
|country||distinguish easily the country, source of the data|
|category||categorize each data serie, in order to compare what is comparable|
|unit||allows to manage conversion to a pivot currency|
|coefficient||the multiplier of the units, to ensure comparisons using sames dimensions|
|serie number||to allow serie selection when more than one is provided per country|
I simply used R package XLConnect to turn excel sheet into a R data.table.
Following conversions were applied to the data
- countries are turned to a factor
- currencies are turned to EURO using respective conversion rate over the years, as provided by the data
I used R package tidyr to manage data wrangling. Mainly, I turned data series from 1800 to 2015, given in columns into two column, named year and value, keeping all the other columns, except the one named ‘Séries’ as I want English language output to be produced, and not French.
Here is the main part of the code sc is simple the index of the starting column name .
dtx <- as.data.table(tidyr::gather(dtc, 'year', 'value', sc:ncol(dtc))) dtx[, `:=`(year = as.integer(substring(year, 2)))]
I relied on R package ggplot2 to do the job. Just need here to set up a diagram model, to be reused. I juste enforced the following
- caption is desired to be at the top, using same color scheme for all diagrams, to ease reading
- diagram title will be taken from the English label of the French serie
- diagram abscisses are the years, generally full scaled from 1800 to 2015, sometimes restricted when it makes sense
- diagram ordinates are values
- theme is kept very simple and near the default provided natively by ggplot2.
Here is the main part of the code da is the data.table variable name .
mc <- c('en' = 'red', fr = 'blue', us = 'green') ggplot(da, aes(x = year, y = value, color = country)) + geom_line() + scale_color_manual(values = mc[as.character(unique(da$country))]) + ggtitle(da$Serie) + theme(legend.position = 'top')
And the diagrams produced
Home price and disposable income
Business analysis can now be executed. Let’s wait for my friend to provide its results. Will share them in a second blog post.