I’ve met Anthony before to discuss whether the Census Bureau could either…
- publish R-readable input statements for flat file public datasets (instead of only the SAS input statements we publish now); or…
- cite his R package
sascii, which automatically processes a SAS input file and reads data directly into R (no actual SAS installation required!). Folks agree
sasciiis an excellent tool and we’re working on the approvals to mention it on the relevant download pages.
Meanwhile, Anthony’s not just waiting around. He’s put together an awesome blog, asdfree.com (“Analyze Survey Data for Free”), where he posts complete R instructions for finding, downloading, importing, and analyzing each of several publicly-available US government survey datasets. These include, in his words, “obsessively commented” R scripts that make it easy to follow his logic and understand the analysis examples. Of course, “My syntax does not excuse you from reading the technical documentation,” but the blog posts point you to the key features of the tech docs. For each dataset on the blog, he also makes sure to replicate a set of official estimates from that survey, so you can be confident that R is producing the same results that it should.
For the huge datasets that fit on his hard drive but not into memory and hence R can’t handle natively, he recommends the open-source database MonetDB. With R and MonetDB, he was able to do summary statistics on 67 million records in 8 seconds on his personal laptop; I don’t know the standards in this area myself but the audience seemed to find it impressive. And Anthony’s blog gives clear step-by-step instructions for installing MonetDB, getting these huge survey datasets into it, and calling them from R.
(He mentioned a few other ways to deal with big data and R, but my favorite was: “If your colleagues avoid R because the data won’t fit into RAM, tell them to take the $10,000 you spend on SAS licenses per year; spend $30 on RAM; and… spend the other $9,970 on pizza parties every day.”)
Anthony actually started the talk with a quick illustration of why R is so useful: unlike its competitors like SAS, SPSS, Stata, and SUDAAN, which are statistical packages that manipulate everything as data tables, R is a statistical programming language that lets you subset objects, pass them to functions, stick the output directly into another function, etc. without requiring that everything be a data table. He warned it’s a difficult language, and “You will not get instant gratification from R,” but it’s certainly worth learning.
Futhermore, he warned of the “vanishing right to privacy” of code used for public analyses based on public data. There’s a trend afoot to make such research openly reproducible by anyone, and that’s easier with an open language like R than with its proprietary competitors. Also, if you change jobs, it’s easier to keep working in an open tool like R than to justify your bosses paying for, say, a SAS license if your new workplace is a Stata shop (for example). Plus, the R community and package system are a huge bonus.
Finally, if you’re psyched to use R but need help not only with these survey datasets but with learning R itself, Anthony has a great series of two-minute R tutorials at twotorials.com (I wish I’d thought of the name first!). He does talk fast and there are over 90 of them by now, so it’s a lot of information. His flowchart handout lists other resources for learning R.
[Edit: this seems to have been fixed, but just FYI: Anthony's site, "Analyze Survey Data for Free" or asdfree.com,
is currently was being blocked by some anti-virus software. The site is safe, but apparently the domain name used to belong to spammers, and on some computers this is was even blocking the R Meetup event page that links to asdfree.com. For that reason I won't hyperlink to his site yet either. But he's had the site re-reviewed by the anti-virus companies, and they will fix fixed it in an update soon. So if your computer blocks it too, wait a few days and it should be back!]