The Fundamental Principles of Analytical Design

[This article was first published on Analysis of AFL, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

I have just finished reading the book Beautiful Evidence by Edward Tufte and in it he talks about the fundamental principles of analytical design.

Henri Matisse

I do not paint things, I paint only the differences between things

Show Comparisons, Contrasts, Differences

When looking at a graph or reading a statistic a good question and perhaps even the first question you can ask yourself is what I am comparing this too.

Think about the recent about AFL being low scoring. Low scoring compared to what? Another ‘infographic’ from NRL. Is this high is it low? Is he above average/below average all these things are hard to tell.

If I were to just show you a boxplot of AFL scores from this year you might be thinking compared to what?

## -- Attaching packages -------------------------------- tidyverse 1.2.1 --
## v ggplot2 2.2.1     v purrr   0.2.5
## v tibble  1.4.2     v dplyr   0.7.5
## v tidyr   0.8.1     v stringr 1.3.1
## v readr   1.1.1     v forcats 0.3.0
## -- Conflicts ----------------------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
## For news about 'ggpmisc', please, see
## For on-line documentation see
  mutate(total=Home.Points + Away.Points)%>%
  filter(Season == 2018)%>%
  ggplot(aes(y=total, x=as.factor(Season)))+geom_boxplot()

That doesn’t really tell us much does it? You might say I don’t even know if that is low.

But if I plot the past few seasons, it suddenly becomes a lot more informative don’t believe me, look below.

  mutate(total=Home.Points + Away.Points)%>%
  filter(Season %in% c(2015,2016,2017,2018))%>%
  ggplot(aes(y=total, x=as.factor(Season)))+geom_boxplot()

Causality, Mechanism, Structure, Explanation

One possible explanation for this drop in scoring rate is due to increased pressure. Players are fitter than ever and the game isn’t supposed to open up as much as before. As a possible hypothesis we might think there is a negative relationship between the total points scored in games and tackles inside 50.

  group_by(Season, Round, Team)%>%
  filter(Season >2014)%>%
  ggplot(aes(y=tacklesin50total, x=as.factor(Season)))+geom_boxplot()

Multivariate Analysis

The analysis of cause and effect, intially bivariate, quickly becomes multivariate through such necessary elaborations as the conditions under which the causal relation holds

If there is a relationship between totals and tackles inside 50 we can look at this visually by looking at a scatterplot

We are going to cheat a bit here and instead of joining the match scores with fitzRoy::get_match_results() we are just going to sum up the player goals, behinds with the tackles.

  group_by(Match_id, Season)%>%
  summarise(total=6*sum(G)+ sum(B), 
  ggplot(aes(x=totaltacklesin50, y=total))+geom_point()+ geom_smooth(method='lm')

This would be a bivariate example just looking to see if there is anything that looks like a relationship between tackles and totals. It actually look likes there is a slightly weak negative relationship between tackes inside 50 and totals

We can make compare 3 variables at once by faceting

  group_by(Match_id, Season)%>%
  summarise(total=6*sum(G)+ sum(B), 
  ggplot(aes(x=totaltacklesin50, y=total))+
  geom_point() + 
  geom_smooth(method='lm') + 

Integration of evidence

Theres no reason to only show a plot or only show a table, you should be able to combine different modes of presentation to make your graphic as information rich as possible.

So here, our plot becomes a bit more informative simply by adding a regression line to our plot. Here we can see what the slope is and with it we can make inference as to our relationship between tackes inside 50 and total points.

Think about it this way if I just presented you with the plot, your first question might be ok, it looks like theres a negative relationship, how does increasing tackle count effect points totals?

By adding a regression line, we can help answer this question and our plot becomes more informative.

formula <- y ~ x
  group_by(Match_id, Season)%>%
  summarise(total=6*sum(G)+ sum(B), 

  ggplot(df,aes(x=totaltacklesin50, y=total))+
    geom_point() + 
    geom_smooth(method='lm',formula = formula) +
  stat_poly_eq(aes(label = ..eq.label..), 
               formula = formula, parse = TRUE)


Its important to keep the code that makes the plot, this helps add credibility to your work. It also means that errors are easy to spot and fix.

Content that means something

When your making plots, figures what is the story you are trying to tell, what is the data you have your disposal. Then what is the best way to present it. If you don’t have great content then whats the point if its not interesting?

To leave a comment for the author, please follow the link and comment on their blog: Analysis of AFL. offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)