In a previous post on my personal blog about creating Pivot Tables in R with melt and cast we covered a simple way to generate sales reports and summary tables from a data set consisting of orders. It is often said that a picture is worth 1000 words, so in this series of posts we will focus on how to create visual representations and summaries of the same data.
Our graphical library of choice for the job will be ggplot2 (what else?), even though we are mostly going to use it in its simplest format, which is through qplot. I have written other posts on ggplot2 which you may want to also read.
1. Getting started
If you haven’t done it yet, please complete steps 1, 2 and 3 in my previous post Pivot Tables in R with melt and cast. The file with the data can be obtained from the link at the bottom of that post. Once completed, you should have your data set loaded in R and ready for the next steps.
2. Checking the data
Before starting to plot any data frame with ggplot2, it is a good idea to check the data structure and make sure all variables have the correct type. As a matter of fact ggplot2 is a very smart library and will attempt to plot your data even if they are not in the expected format. While this may or may not produce a warning message, the results may end up being far from what we expect. Better to check in advance and save us the pain of a long troubleshooting afterwards.
It has been pointed out that str is one of the most useful functions in R and this is surely true! Let’s take a look at the structure of our data set.
> str(data) 'data.frame': 799 obs. of 5 variables: $ Country : Factor w/ 2 levels "UK","USA": 1 2 2 1 2 1 2 2 2 1 ... $ Salesperson : Factor w/ 9 levels "Buchanan","Callahan",..: 9 8 8 4 7 1 7 7 8 1 ... $ Order.Date : Factor w/ 384 levels "01/01/2004","01/01/2005",..: 118 131 147 183 183 197 197 209 271 281 ... $ OrderID : int 10249 10252 10250 10255 10251 10248 10253 10256 10257 10254 ... $ Order.Amount: num 1863 3598 1553 2490 654 ...
The use of str highlights indeed a problem with our data set. Order.Date is currently regarded by R as a factor instead of a Date. If we are thinking of grouping our sales data by quarter for example, it would be useful to convert it to a Date class so we can use data manipulation functions such as quarter() to extract the quarter of the year. This is an easy fix.
data$Order.Date <- as.Date(data$Order.Date, "%d/%m/%Y")
Note that the format string using in as.Date has to match the format of the date in Order.Date. In this case %d represents the day in digits (1-31), %m the month in digits (1-12) and %Y (capital Y) the year in the 4-digits format (1900-2999).
After the conversion, our data set structure looks like this.
> str(data) 'data.frame': 799 obs. of 5 variables: $ Country : Factor w/ 2 levels "UK","USA": 1 2 2 1 2 1 2 2 2 1 ... $ Salesperson : Factor w/ 9 levels "Buchanan","Callahan",..: 9 8 8 4 7 1 7 7 8 1 ... $ Order.Date : Date, format: "2003-07-10" "2003-07-11" ... $ OrderID : int 10249 10252 10250 10255 10251 10248 10253 10256 10257 10254 ... $ Order.Amount: num 1863 3598 1553 2490 654 ...
We are now ready to create our sales dashboard.
3. A simple scatter plot of orders
Visualizing data in a simple and immediate format should always be the first step of a good visual data analysis. This allows to spot anomalies (for example outliers) and to get an overview of the content of the data set before aggregating and manipulating it further.
Let's start with a plot of all Order.Amount in a temporal sequence, which means by Order.Date.
library(ggplot2) qplot(x=Order.Date, y=Order.Amount, data=data)
Note few things here. First, we need to load the ggplot2 library before we can use qplot. This only needs to be done once in the same R session. Second, qplot is invoked with 3 arguments:
- x is the variable we want to plot on the horizontal axis
- y is the variable we want to plot on the vertical axis
- data is the name of the data set the variables belong to, which allows us to specify them just by variable name (such as Order.Date or Order.Amount) instead that in the full format (which would be data$Order.Date or data$Order.Amount)
Third, if we do not specify any further parameter, qplot uses its defaults for all the rest. Which default is used depends also on whether only y is specified or both x and y. When both x and y are specified, the default is to produce a scatter plot of y values versus x values. Another default is to use the variable names as labels for the axis, as well as apply the standard theme. Enough technicalities, let's get back to data visualization.
Let's say we are interested to show from which country the orders came from. Let's color code the points in the scatter plot according to the value of the Country variable in the data set, which is either USA or UK. With qplot this is as easy as adding an extra argument to the function call.
qplot(x=Order.Date, y=Order.Amount, color=Country, data=data)
Note that the color parameter can also be used with its British spelling of colour. Here is the resulting chart.
Once more, qplot has applied some defaults. First, a standard high-contrast color scheme to distinguish between the orders coming from the two different countries. Second, a legend on the left of the chart specifying how to read each color. The title of the legend is, by default, the name of the variable used to color code the points. Sweet!
Let's try to color code the points according to the sales person who took the order. Another easy one with qplot. Just change the color parameter to the use the Salesperson variable.
qplot(x=Order.Date, y=Order.Amount, color=Salesperson, data=data)
qplot has done a nice job to accommodate our request and color code the points by Salesperson, however there are too many colors and the chart is not really meaningful. Time to switch to a different view!
In Part 2 we will cover Bar Charts and how to make the best use of them. Till next time!
* This article originally appeared in Sales Dashboard in R with qplot and ggplot2 - Part 1