Part 3b: EDA with ggplot2
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
- Analyse the (continuous) dependent variable – at least, make an histogram of the distribution; when applicable, plot the variable against time to see the trend. Can the continuous variable be transformed to a binary one (i.e., did it rain or not on a particular day?), or to a multicategorical one (i.e., was the rain none, light, moderate, or heavy on a particular day?).
- Search for correlations between the dependent variable and the continuous independent variables – are there any strong correlations? Are they linear or non-linear?
- Do the same as above but try to control for other variables (faceting in ggplot2 is very useful to do this), in order to assess for confounding and effect modification. Does the association between two continuous variables hold for different levels of a third variable, or is modified by them? (e.g., if there is a strong positive correlation between the rain amount and the wind gust maximum speed, does that hold regardless of the season of the year, or does it happen only in the winter?)
- Search for associations between the dependent variable and the categorical independent variables (factors) – does the mean or median of the dependent variable change depending on the level of the factor? What about the outliers, are they evenly distributed across all levels, or seem to be present in only a few of them?
So, let’s now do some analyses, having the framework described above in mind.
Exploring the dependent variable – daily rain amount
# Time series of the daily rain amount, with smoother curve ggplot(weather, aes(date,rain)) + geom_point(aes(colour = rain)) + geom_smooth(colour = "blue", size = 1) + scale_colour_gradient2(low = "green", mid = "orange",high = "red", midpoint = 20) + scale_y_continuous(breaks = seq(0,80,20)) + xlab("Date") + ylab("Rain (mm)") + ggtitle("Daily rain amount")
# Histogram of the daily rain amount ggplot(weather,aes(rain)) + geom_histogram(binwidth = 1,colour = "blue", fill = "darkgrey") + scale_x_continuous(breaks = seq(0,80,5)) + scale_y_continuous(breaks = seq(0,225,25)) + xlab("Rain (mm)") + ylab ("Frequency (days)") + ggtitle("Daily rain amount distribution")
> # Heavily right-skewed distribution > summary(weather$rain) Min. 1st Qu. Median Mean 3rd Qu. Max. 0.000 0.000 0.300 5.843 5.300 74.900 > # Right-skewness is still there after removing all the dry days > summary(subset(weather, rain > 0)$rain) Min. 1st Qu. Median Mean 3rd Qu. Max. 0.30 1.15 5.10 11.17 16.10 74.90 > # Formal calculation of skewness (e1071 package) > library(e1071) > skewness(weather$rain) [1] 2.99 > skewness(subset(weather, rain >0)$rain) [1] 2.04
> # Create binary outcome (rained = {Yes, No}) > # Number of dry days > nrow(subset(weather, rain == 0)) [1] 174 > # Number of days that will also be considered dry days > nrow(subset(weather, rain <1 & rain >0)) [1] 45 > # The new binary variable is called "rained" > weather$rained <- ifelse(weather$rain >= 1, "Yes", "No") > # Dry and wet days (absolute) > table(rained = weather$rained) rained No Yes 219 146 > # Dry and wet days (relative) > prop.table(table(rained = weather$rained)) rained No Yes 0.6 0.4
Porto is one of the wettest cities in Europe for a reason. There was rain in exactly 40% of the days of the year, and this is considering those days with rain < 1mm as dry. Should we set the cutoff at 0 mm, and more than 50% of the days in 2014 would have been considered wet.
Looking at the association between rain and season of the year
The time series plot seems to indicate that season of the year might be a good predictor for the occurrence of rain. Let’s start by investigating this relation, with both the continuous rain variable and the binary one.
Rain amount (continuous) by season
# Jitter plot - Rain amount by season ggplot(weather, aes(season,rain)) + geom_jitter(aes(colour=rain), position = position_jitter(width = 0.2)) + scale_colour_gradient2(low = "blue", mid = "red",high = "black", midpoint = 30) + scale_y_continuous(breaks = seq(0,80,20)) + xlab("Season") + ylab ("Rain (mm)") + ggtitle("Daily rain amount by season")
> # Rain amount by season > # {tapply(x,y,z) applies function z to x, for each level of y} > tapply(weather$rain,weather$season,summary) $Spring Min. 1st Qu. Median Mean 3rd Qu. Max. 0.000 0.000 0.000 2.934 2.550 28.200 $Summer Min. 1st Qu. Median Mean 3rd Qu. Max. 0.000 0.000 0.000 2.936 1.350 68.300 $Autumn Min. 1st Qu. Median Mean 3rd Qu. Max. 0.000 0.000 0.300 7.164 6.450 74.900 $Winter Min. 1st Qu. Median Mean 3rd Qu. Max. 0.00 0.00 3.95 10.45 14.65 64.80
We can see that most of the extreme values (outliers) are in the winter and autumn. There are still, however, many dry days in both these seasons, and hence the means are too close from each other. Fitting a model to predict the actual amount of rain (not the same as probability of occurrence) based on the season alone might not give the greatest results.
# Bar plot - dry and wet days by season (relative) ggplot(weather,aes(season)) + geom_bar(aes(fill = rained), position = "fill") + geom_hline(aes(yintercept = prop.table(table(weather$rained))["No"]), colour = "blue",linetype = "dashed", size = 1) + annotate("text", x = 1, y = 0.65, label = "yr. w/o = 0.60", colour = "blue") + xlab("Season") + ylab ("Proportion") + ggtitle("Proportion of days without and with rain, by season")
> round(prop.table(table(season = weather$season, rained= weather$rained),1),2) rained season No Yes Spring 0.74 0.26 Summer 0.72 0.28 Autumn 0.57 0.43 Winter 0.37 0.63
Looking at the correlations between rain and all numeric variables
We are now going to calculate the linear (Pearson) correlations between the continuous outcome variable (daily rain amount) and all the numeric variables – both the actual numeric ones (such as the temperature and wind speed), and those that are actually factor variables, but still make sense if we want to see them as numeric (for instance, the hour of the day in which some event occurred).
Note that we are not modelling at this stage, just trying to gain some insight from the data, and therefore we are neither concerned about whether the correlations are significant (p-values and/or confidence intervals) nor if the the relation between any two variables is in fact linear (we will be able to check this later, anyway, when we create the scatter plots with the lowess curves on top).
Here is the code (the comments should make it easier to understand what we are trying to do).
> # Create a new data frame with only the variables than can be numeric > weather.num <- weather[c 1="style="color:" 2="#aa5500;">"rain","l.temp","h.temp","ave.temp","ave.wind","gust.wind",
+" 14=""l.temp.hour","h.temp.hour","gust.wind.hour")" language="( # Convert the following factor variables to numeric > weather.num$l.temp.hour <- as.numeric(weather.num$l.temp.hour) > weather.num$h.temp.hour <- as.numeric(weather.num$h.temp.hour) > weather.num$gust.wind.hour <- as.numeric(weather.num$gust.wind.hour) > # Pass the entire data frame to the cor() function to create a big correlation matrix > # Pick only the first row of the matrix [1,] ==> correlation between rain and all the other variables > round(cor(weather.num),2)[1,] rain l.temp h.temp ave.temp ave.wind 1.00 -0.14 -0.29 -0.21 0.48 gust.wind l.temp.hour h.temp.hour gust.wind.hour 0.61 0.19 -0.25 -0.16
> # Let's split the data frame in four parts, one for each season > weather.num.season <- split(weather.num,weather$season) > # The result is a list... > class(weather.num.season) [1] "list" > # ...with 4 elements, where... > length(weather.num.season) [1] 4 > # ...each element of the list is a data frame (the seasons), with nine variables > summary(weather.num.season) Length Class Mode Spring 9 data.frame list Summer 9 data.frame list Autumn 9 data.frame list Winter 9 data.frame list > # Here are the names of each of the data frames of the list > attributes(weather.num.season) $names [1] "Spring" "Summer" "Autumn" "Winter" > # *apply family of functions are arguably the most powerful in base R, but also the most difficult to master > # {sapply(x,z) applies function z to each element of x} > # First go over the elements of the list and calculate the correlation matrix (all against all) > # For each season, return only the correlation between "rain" and everything else > sapply(weather.num.season, function (x) round(cor(x)["rain",],2)) Spring Summer Autumn Winter rain 1.00 1.00 1.00 1.00 l.temp -0.33 0.06 0.13 0.05 h.temp -0.45 -0.26 -0.07 -0.26 ave.temp -0.43 -0.14 0.06 -0.09 ave.wind 0.34 0.50 0.38 0.66 gust.wind 0.37 0.45 0.66 0.71 l.temp.hour 0.24 0.16 0.01 0.33 h.temp.hour -0.13 -0.22 -0.16 -0.30 gust.wind.hour -0.26 -0.34 -0.18 0.06 > # Not going to run it, but here is an alternative that might be easier (or not) to understand > # It actually shows the correlations for each element of the list > # lapply(weather.num.season, function (x) round(cor(x)["rain",],2))
Looking deeper at the wind and rain correlation – faceting technique
The idea of faceting (also called Trellis plots) is a simple but important one. A graphic, of any type and mapping any number of variables, can be divided into several panels, depending on the value of a conditioning variable. This value is either each of the levels of a categorical variable, or a number of ranges of a numeric variable. This helps us check whether there are consistent patterns across all panels. In ggplot2 there are two functions to create facets – facet_wrap() and facet_grid() – that are used when we have one or two conditioning variables, respectively. As everything in ggplot2, this is just another layer in the plot, which means all geometries, mappings, and settings will be replicated across all panels.
Let’s then assess the linearity of the correlation of the amount of rain and the maximum wind gust speed, conditioning on the season of the year.
# Amount of rain vs. wind, by season ggplot(weather,aes(gust.wind,rain)) + geom_point(colour = "firebrick") + geom_smooth(size = 0.75, se = F) + facet_wrap(~season) + xlab("Maximum wind speed (km/h)") + ylab ("Rain (mm)") + ggtitle("Amount of rain vs. maximum wind speed, by season")
Occurrence of rain – more complex faceting to visualise variable association
> # Using the defaults of the quantiles function returns 4 intervals (quartiles) > quantile(weather$h.temp) 0% 25% 50% 75% 100% 9.8 14.4 19.1 23.3 31.5 > # All we need to do is to define the quartiles as the breaks of the cut function > # and label the intervals accordingly > weather$h.temp.quant <- cut(weather$h.temp, breaks = quantile(weather$h.temp), labels = c("Cool","Mild","Warm","Hot"),include.lowest = T) > # The result > table(weather$h.temp.quant) Cool Mild Warm Hot 92 91 94 88
# Occurrence of rain, by season and daily high temperature ggplot(weather,aes(rained,gust.wind)) + geom_boxplot(aes(colour=rained)) + facet_grid(h.temp.quant~season) + xlab("Occurrence of rain") + ylab ("Maximum wind speed (km/h)") + ggtitle("Occurrence of rain, by season and daily high temperature")
In the meanwhile, if you have done any kind of analysis with this data set, feel free to share your own findings, either by commenting on the post or by sending me a private message.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.