After my recent post on Anscombe’s Quartet in which I demonstrated how to efficiently adjust any data set to match mean, variance, correlation (x,y), as well as regression coefficients. Philip Waggoner tuned me onto Justin Matejka and George Fitzmaurice’s Datasaurus R package/paper in which the authors demonstrate an alternative method of modifying existing data to fit a specified mean and variance for X and Y. Their method randomly applies small disturbances to individual observations to gradually move the data to match a target preference set.
Inspired by their use of point images which match a specific parameter set, I have done generated some of my own. For all of them X has a mean and variance of 1 and 11. While y has a mean and variance of 1 and 1. The correlation between X and Y is set to 0 which causes some distortion in the images. More on that in the post.
Figure 1: Shows a graph of 12 data sets each with 15 transitional data sets. The mean, variance, and correlations of X and Y are held constant throughout the sets and transitions.
I generated the data myself using Mobilefish’s upload photo and record clicks webapp. The source images are from images I found online.
The only slight trick to using the data generated by Mobilefish was that the y cooridates are typically tracked from the top of the page with software, yet most statistical graphing software plots with y starting from the bottom of the graph.
The raw data when plotted look like their source material..
New Images: Force Cor(X,Y)=0
When we force the correlation of X and Y to be zero certain point distributions become distorted.
For Bart and the Cat forcing cor(X,Y) has noticable distortions while for the flower minimal distortions seem to have been introduced.
New Images: Force Cor(X,Y)<>0
It gets even worse when we impose a constant correlation between X and Y. The following shows the distortions to the flower when we change b1, keeping Cor(X,Y) constant and fixing the Y plot limits.
Figure 8: Shows the effect on Var(Y) that changing b1, when all other factors are held constant.
Slight changes to the Anscombe-Generator Code
In order to generate graphs that had cor(X,Y)=0 I had to modify my previous code to allow variation in Y that was completely independent of X. The problem with my code was that if b1=1, my calculation used SSE= (b1^2*var(X))*n in order to infer how large the variation in u needed to be (varianceu= (SSE/corXY^2–SSE)/n). This backwards inference does not work if b1=0.
So, just for the special case of corXY=0 I have included an additional error term E which is helpful in the even that b1=0.
The thought of use points to make recognizable images had not occurred to me until I viewed Justin Matejka and George Fitzmaurice’s Datasaurus work. I hope that in making a slightly more efficient distribution manipulator I will allow new and better datasets to be generated which will help students understand the importance of graphical exploration of their data.