[This article was first published on novyden, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Skewed data prevail in real life. Unless you observe trivial or near constant processes data is skewed one way or another due to outliers, long tails, errors or something else. Such effects create problems in visualizations when a few data elements are much larger than the rest.
Suppose we decided to visualize top 30 U.S trading partners usingbubble chart, which simply is a 2Dscatter plotwith the third dimension expressed through point size. Then U.S. trade partners become disks with imports and exports forxycoordinates and trade balance (abs(export – import)) for size:
China, Canada, and Mexico run far larger balances compared to the other 27 countries which causes most data points to collapse into crowded lower left corner. One way to “solve” this problem is to eliminate 3 mentioned outliers from the picture:
While this plot does look better it no longer serves its original purpose of displayingalltop trading partners. And undesirable effect of outliers though reduced still presents itself with new ones: Japan, Germany, and U.K. So let us bring all countries back into the mix by trying logarithmic scale.
Quick refresher from algebra.Log function(in this example log base 10 but the same applies to natural log or log base 2) is commonly used to transform positive real numbers. All because of its property of mapping multiplicative relationships into additive ones. Indeed, given numbersA,B, andCsuch that
Logarithmic scale is simply a log transformation applied to all feature’s values before plotting them. In our example we used it on both trading partners’ features – imports and exports which gives bubble chart new look:
The same data displayed on logarithmic scale appear almost uniform but not to forget the farther away points from 0 the more orders of magnitude they are apart on actual scale (observe this by scrolling back to the original plot). The main advantage of using log scale in this plot is ability of observing relationships between all top 30 countries without loosing the whole picture and avoiding collapsing smaller points together.