R is my language of choice for data science but a good data scientist should have some knowledge of all of the great tools available to them. Recently, I have been gleefully using Python for machine learning problems (specifically pandas and the wonderful scikit-learn). However, for all its greatness, I couldn’t help but feel it lacks a bit in the data visualisation department. Don’t get me wrong, matplotlib can be used to produce some very nice visualisations but I think the code is a bit messy and quite unintuitive when compared to Hadley Wickham’s ggplot2.
I’m a huge fan of the ggplot2 package and was delighted to discover that there has been an attempt to replicate its style in Python via the ggplot package. I wanted to compare the two packages and see just how well ggplot matches up to ggplot2. Both packages contain built-in datasets and I will use the mtcars data to build a series of plots to see how they compare, both visually and syntactically.
Here we go….
Scatterplots are great for bivariate profiling and revealing relationships between variables. A simple scatterplot using ggplot2 in R:
ggplot(mtcars , aes(x = hp , y = mpg)) + geom_point()
The same scatterplot using ggplot in Python:
ggplot(mtcars , aes(x = 'hp' , y = 'mpg')) +\ geom_point()
Not much of a difference there. The syntax is also very similar but in Python’s ggplot, there is a \ after each + when adding a new layer to a plot. When mapping variables to the xy coordinates, the use of inverted commas is also required in ggplot.
Boxplots are very nice for visualising discrete variables and the distributions of variables across them. In R’s ggplot2, I discretise the cyl variable with the factor() function to create a boxplot showing the distributions of mpg across each number of cylinders category (4, 6 and 8).
ggplot(mtcars , aes(x = factor(cyl) , y = mpg)) + geom_boxplot()
In Python, we need to discretise the cyl variable with the pandas.factorize() function before plotting with ggplot. The ordering is different in the Python plot output but reordering may be possible as it is in R. Also, note that the number of cylinders have been assigned dummy variables where 0 = 6 cylinders, 1 = 4 cylinders, and 2 = 8 cylinders.
mtcars['cyl'] = pd.factorize(mtcars.cyl)
ggplot(mtcars , aes(x = 'cyl' , y = 'mpg')) +\ geom_boxplot()
A fundamental tool for univariate profiling, histograms show the frequency distribution of a variable. In R’s ggplot2, I plot the distribution of mpg across the mtcars data and add a few more components such as margin outlines and red fill while bins are set to ten and x axis tick labels are modified.
ggplot(mtcars , aes(x = mpg)) + geom_histogram(colour = "black" , fill = "red" , bins = 10) + scale_x_continuous(breaks = seq(0 , 40, 5))
With Python’s ggplot, the histogram is not as tidy. I couldn’t find a way to colour the margins black but there may be a way around this? The shape of the distribution looks a little different as well despite bins also being set to ten but this is just down to how the factoring is carried out in each language; the information within the plots is the same.
ggplot(mtcars , aes(x = 'mpg')) +\ geom_histogram(fill = 'red' , bins = 10)
The facet wrapping function in ggplot2 can create fantastic visualisations when using larger datasets. A simple example is given in both implementations.
In R’s ggplot2, quarter mile time (qsec) is plotted against horsepower (hp) for each number of cylinders category. The facet wrapping splits the data into the specified discrete variable, in this case cyl, and plots the qsec/cyl relationship for each one.
ggplot(mtcars , aes(x = hp , y = qsec)) + geom_point() + facet_wrap(~factor(cyl))
And in Python’s ggplot (note the same dummy variables for cyl are used), a similar ouput is seen. The slight difference is the absence of the grey border along the top of each plot in ggplot.
ggplot(mtcars , aes(x = 'hp' , y = 'qsec')) +\ geom_point() +\ facet_wrap(~'cyl')
Making Things a Little Fancier
The previous examples are very simple, but fundamental, plots in data science. With ggplot2 in R, one can be highly creative with their data visualisations by representing categories by colour, facet wrapping, etc. to create plots which hold a lot of information. The following plots are quick examples of how one can be more creative using both packages.
ggplot(mtcars , aes(x = hp , y = mpg , colour = factor(cyl))) + geom_point()
ggplot(mtcars , aes(x = 'hp' , y = 'mpg' , color = 'name')) +\ geom_point()
ggplot(diamonds , aes(x = price , fill = color)) + geom_histogram(colour = "black") + facet_wrap(~cut)
ggplot(diamonds , aes(x = 'price' , fill = 'color')) +\ geom_histogram(colour = 'black') +\ facet_wrap('cut')
That concludes this brief comparison of ggplot2 and ggplot. It is by no means exhaustive and I’m sure there are many ways of modifying Python plots in ggplot which I am unaware of for now. However, I am very grateful to the ggplot package creator Greg Lamp for allowing R fans to create ggplot2 style plots in Python and look forward to using the package in my Python endeavours.