I really enjoyed all four talks at today's online conference, Making Data Work. (Disclosure: Revolution sponsored this conference.) I thought the four speakers together gave a great overview of issues related to the processing, analysis, and visualization of big data.
Mike Driscoll started off with a useful categorization for data size. "Small Data" (<10Gb) fits in the memory of one machine. "Medium Data" (10GB-1TB) fits on the disk of one machine. "Large Data" (1TB and above) requires distributed storage on multiple machines. Naturally, the tools and processes you use in each case varies, and Mike summarized his recommendations as follows:
1. Use the right tool for the job
2. Compress everything
3. Split up your data
4. Work with samples. Mike offers a handy idiom in Perl to take a 1% sample of a text file:
perl -ne "print if (rand()<0.01)" data.csv > sample.csv
5. Use Statistics (or, as Mike put it, "the grammar of data science"). Naturally, the R Project was featured heavily here.
6. Copy from others, i.e. use open-source tools like GitHub, R, and R packages from CRAN
7. Avoid "chart typology": instead of just dumping data into standard charts like pie charts and bar charts, think of charts as compositions, not containers. Mike gave a great example of a bespoke chart created using the R package ggplot2.
8. Use color wisely
9. Tell a story.
Mike rounded out his talk with a real-life example from the mobile phone business: using a Greenplum database of billions of phone calls from millions of callers, a social-networking model in R revealed that callers in a local "call network" (of subscribers who regularly call each other) were 700% more likely to switch providers ("churn") if their friends in the network do so. Mike tells this story dramatically by visualizing the social network and showing how the "infection" of churn spreads through the network over time. See Mike's slides for the details.
Next up was Joe Adler, author of the excellent R in a Nutshell reference manual. Joe began with a fascinating historical factoid: Herman Hollerith, a statistician working on the 1880 census, not only invented the punch card and founded the Tabulating Machine Company which would become one of the companies to launch IBM, but also was arguably the first implementor of map-reduce techniques (for which Hadoop is now famous). He could justifiably be called the first Data Scientist.
Joe mentioned some of the challenges of with with big data at LinkedIn. For example, the People You May Know feature is a recommendation engine for 70 million users - an O(n^2) problem, and therefore computationally intensive. Joe's suggestions aligned with Mikes: compressing data, eliminating data not used for analysis, etc.
Joe did make one interesting observation regarding P-values and sampling: when you're dealing with very large data sets, don't forget that P-values aren't necessarily useful: statistical significance isn't the same thing as practical significance. He gave a great example of testing (using a Pearson Chi-Squared test in R) of whether the ratio of boys to girls born in the United States in 2006 varies by day of week. Using a 10% sample of the births, the P-value (about 0.3) is not significant. But when using all of the data, you get a very significant P-value (<0.001). Does this mean that boys are significantly more likely to be born on a Monday than on a Wednesday? Probably not ... and even if it does, the difference isn't meaningful.
Joe also included a great discussion on privacy issues related to predicting from combined data sets of both on-line and off-line behavior, recalling the outrage directed at DoubleClick in 2000.
Next up was Hilary Mason from Bit.ly, who have a great presentation about practical approaches to machine learning. She included lots of useful tips, like using Amazon Mechanical Turk to bootstrap hand-categorizing a sample of data, and then using supervised learning techniques to infer the remaining labels. She also mentioned Yahoo Boss, which looks like a great way of defining your own search to extract semi-structured data from the unstructured Web. Hilary introduced a handy new (to me, at least) term: "streamlining", the process of data mining where the algorithm only sees each data point once. (Thanks to the folks on Twitter who let me know that this is widely referred to as on-line learning or streaming analytics in the machine-learning community.)
Finally, Ben Fry, one of the creators of the Processing programming language, demonstrated some beautiful interactive visualizations of data. I've been impressed with Processing visualizations featured on FlowingData and elsewhere, but not as impressed when I learned that Processing graphics have featured both on the cover of Nature and in the movie The Hulk. (When Nick Nolte is using your software, that's when you know you've made it to the big leagues.)
Kudos to O'Reilly for putting on such a great raft of speakers. As an on-line conference, it worked well -- if I'd had to travel I probably wouldn't have made it to a conference like this, and the price was very reasonable. On the other hand, you don't get to mingle with the other attendees (which is one of the biggest benefits of live conferences, IMO), but maybe they can fix that problem for the next conference. (You can sometimes get a similar experience from following a Twitter stream, but sadly Twitter was in permanent fail-whale mode today.) I'm looking forward to the followup.
O'Reilly Conferences: Making Data Work