by Joseph Rickert
The 2013 Mining Big Data Camp was held last Saturday at Ebay’s Town Hall Conference Center in San Jose. The San Francisco chapter of the ACM has been sponsoring this data mining themed, “un-conference” event since 2009. Attendance, this year was lighter than I remembered in the past, however, the event continues to be a viable way to find out what’s hot in the Silicon Valley Data Mining scene. The buzz this year was about Deep Learning, Data Science and R.
I stumbled into the hall just in time to watch the un-conference take shape. Greg Makowski and his team of ACM volunteers do a superb job of managing chaos. An un-conference self-organizes: people propose sessions, a show of hands decides which will fly, people volunteer or are gently prodded into leading the sessions, and a quick count decides which sessions get the larger rooms. I found myself leading two sessions: “An Overview and Introduction to R”, and a discussion on “How to become a Data Scientist”. The attitude of the participants in the R session was strictly business: “How is R organized?”, “What is the best way learn R?’, “Show me some code.” The pragmatism and enthusiasm reflected exactly what the polls indicate: R skills have become essential to Data Mining and Data Science.
In addition to the “Data Scientist” session in which I participated there was another parallel session led by eBay hiring managers on getting hired as a Data Scientist. I think the tremendous interest in this topic at the un-conference and elsewhere reflects how much momentum has been built up towards establishing “Data Scientist” as a distinct job position, and also indicates how useful the title has become as a label for a fairly extensive set of interdisciplinary skills. My take is that a Data Scientist needs to be proficient in four areas:
- Statistical Inference: an understanding of sampling and experimental design at minimum
- Sufficient programming skills to acquire and manipulate large data sets and implement machine learning algorithms
- IT skills: some knowledge of Linux and big data architectures, how to connect to databases, clusters, clouds and hadoop
- Business Skills: How to take an insufficiently articulated business problem and shape it into a series of relevant technical questions.
While R and Data Science are in the realm of the here and now, the buzz around Deep Learning is that it might be the next really big thing. “Deep Learning” refers to using multi-layer neural nets, including Restricted Boltzmann Machines, to solve difficult tasks in machine vision, audio processing and difficult Natural Language Processing. Apparently, the basic ideas have been around for quite some time but recent advances in training these multilayer networks have made them practical for certain classes of problems. Python seems to be the language of choice for working in this area: for example NuPIC (the Numenta Platform for Intelligent Computing, which recently became an open-source project) is a mix of Python and C++ . The two very knowledgable Ebay engineers who lead the un-conference session worked through and example based on code that I think relied on the Pylearn2 library.
For me, the ACM un-conference brought some clarity to the complementary roles R and Python play in Data Mining, and provided concrete examples that illustrate why KDnuggets advises would-be Data Scientists to learn both languages (and SQL).