R summary() got better!

[This article was first published on R – Win-Vector Blog, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Here is a really nice feature found in the current 3.4.0 version of R: summary() has become a lot more reasonable.


#    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
#   15555   15555   15555   15555   15555   15555 

Please read on for some background.

In older versions of R (say R 3.3.1) the above code gave the following undesirable result:


#    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
#   15560   15560   15560   15560   15560   15560 

This was always very confusing and hard to explain to beginners. To justify this you had to explain that “R, by default, calculates the summary rounded to 4 significant digits, and is simultaneously configured to give absolutely no indication has to how many significant digits are in fact being displayed.” To add insult to injury summary() picked a different number of sigfigs than the default numeric presentation. One could type “median(15555)” and get the expected presentation “15555“.

Frankly people do not expect significant digits to be 4 when viewing what appears to be an integer presented directly from software. They either expect display significance to be much lower such as “Earth has about 7,500,000,000 people” (2 sigfig) or higher as “Daniel Burnham’s New York flatiron building has zip code 10010” (5 sigfig, and not the same as 10012). In my opinion it is a bit of crime to aggressively round numbers in an analysis (not presentation) system prior to moving into scientific notation (which can, in principle, signal the number of significant figures through the use of trailing zeros).

I take “1.556e+4” as an acceptable textual approximation of 15555 and “15560” as unacceptable.

To make matters much worse, at the time R was storing rounded numbers in the summary! It wasn’t storing the presentation string “15560” but the floating point or numeric value 15560.0. This very much confused representation and presentation and made pulling the median off a summary needlessly different than calling median().

Now thanks to Martin Maechler and the R core team: summary() now stores much more reasonable numbers and separates representation from presentation:

#      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
# 1.556e+09 1.556e+09 1.556e+09 1.556e+09 1.556e+09 1.556e+09 

format(summary(1555555555), digits=12)
#         Min.      1st Qu.       Median         Mean      3rd Qu.         Max. 
# "1555555555" "1555555555" "1555555555" "1555555555" "1555555555" "1555555555" 

One of the motivations for the fix (which obviously will change some results) was [loc. sit.]:

The benefit for maintainers and old timers like me will be that we will not need to answer this (non-official) FAQ nor excuse a peculiar behavior in the future …..

The idea is: it is simpler to fix things than to forever explain/defend peculiar behavior. At some point software must adapt to its domain and users, and not always expect the users to retrain an arbitrary number of distinctions and caveats.

To leave a comment for the author, please follow the link and comment on their blog: R – Win-Vector Blog.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)