My parents and I made plans to visit San Jose and Saratoga on my grandmother’s birthday, March 19, since that is where she grew up. I randomly saw someone tweet about the ACM Data Mining Camp unconference that happened to be the next day, March 20, only a couple of miles from our hotel in Santa Clara. This was an opportunity I could not pass up.
Upon arriving at eBay/PayPal’s “Town Hall” building, I was greeted by some very hyper people! Surrounding me were a lot of people my age and my interest. I finally felt like I was in my element. The organizers of the event also had a predetermined Twitter hashtag for the event #DMCAMP, and also set up a blog where people could add material and write comments about the sessions. I felt like a kid in a candy shop when I saw the proposed sessions for the breakout sessions.
Some of the proposed topics I found really interesting:
- Anonamly Detection
- Natural Language Processing
- Collaborative Filtering and a Netflix Paper
- CPC Optimization for Events
- Data Mining Programming Tools
- Structured Tags
- Status of Mahout
- Machine Learning with Parallel Processors
- Sentiment Analysis
- Parallel R
About half of these actually made it onto the schedule. Unfortunately, I was only able to attend 4 sessions due to the schedule, but that’s OK because at the end of the day I was still exhausted.
Session 1: Status of Mahout
This one was particularly exciting. Dr. Ted Dunning, a committer for the Mahout project stated that the purpose of Mahout (muh-hoot) is to make machine learning and data mining algorithms scalable. The purpose is not to make the most efficient or highest performing algorithm. In particular, Mahout is built on top of Hadoop so algorithms can take advantage of map-reduce. Mahout is not currently a top-level project, but a subproject of Lucene. It may become a top-level project, like Hadoop, some time in the future.
Learning about Mahout was exciting because I learned what it can do for me and other researchers. Recently, I had a huge incidence matrix that I wanted to find the singular values and vectors for. Astonishingly, NumPy cannot do this yet despite its awesome sparse matrix support. Mahout, on the other hand, is also working on it and seems to be pretty close. Mahout can already perform very fast SVD using a Hebbian method. Now they are working on distributed SVD for sparse matrices using stochastic decomposition. Unfortunately, there does not currently seem to be any plans to incorporate methods for non-negative matrix factorization.
Session 2: Data Mining in the Cloud
The second session was roundtable style and showed the amount of diversity in the crowd. Experience ranged from cloud expert to not understanding what the cloud is. There was also some discussion on the blurring of data mining and data processing. We also struggled a bit with the sound “sass” as it came up in different contexts: SAS the statistical package, SaaS “software as a service” and SAS hard disks, “serial attached SCSI” sometimes used for big data.
Introduction to Amazon EC2 and the services it provides dominated the discussion. A common question is how much Amazon EC2 costs in comparison to a private hardware cluster. One case study mentioned was RazorFish‘s experience with EC2. They spent about $13,000 per month using a large EC2 cluster whereas without EC2 they would have spent upward of $500,000 for the necessary hardware in addition to another systems support employee.
There were people interested in combining R with Hadoop in the cloud. I mentioned the packages RHIPE (ree-pay) and HadoopStreaming. Chris Wensel (@cwensel) mentioned that these packages may not be very useful performancewise due to the way that serialization occurs in Hadoop. I may not remember his exact reason, or quote, though.
Session 3: Data Mining in R
The Data Mining in R session was originally planned to be a Bird of a Feather session, but only a few of us in the room had used R for data mining. The room was packed and it seemed that a lot more people had attended than had originally shown interest! Some packages mentioned included: rattle, a GUI for data mining and caret, a package to streamline the creation of predictive models.
Hadley Wickham‘s ggplot2 was the focus of our discussion on visualization of data. For large data, J. Rickert from Revolution Computing gave an interesting bit of advice: “always expect that your size matrix will require four times as much space.” We also learned a bit about the bigmemory package.
Some other time during the day there was an R for Newbies session. I wish I had my slides because I could have assisted. I was thrilled that so many people were interested in R. As I entered the room for the Data Mining in R session, there was a fellow that asked what the next session in the room was. When the presenter said “Data Mining in R”, I expected him to say something like “oh, no, that’s not for me” as I am so used to hearing. Instead, he said “Oh cool, I really need to learn R.”
For those that are interested in learning R, see our slides from the UCLA Statistical Consulting Center, where we teach workshops in R several times per quarter. Material from previous quarters is there as well.
Session 4: Hadoop
Chris Wensel is a Hadoop genius. I envy his attention to every technical detail about the system! At this point in the long day, I attended casually because I was pretty tired. Most of the time was spent in a question and answer forum. The momentum from the presenter and the audience was with Cascading, a workflow system for Hadoop jobs. Wensel’s advice was, “write a map-reduce application, throw it away, and then start using Cascading.” This gave me a lot of motivation to try it now. I just assumed one must master the full art of Hadoop before moving towards Cascading and some of the other projects.
One major thing I learned was the Amazon EC2 is not necessary to run Hadoop in the cloud. Amazon Elastic MapReduce accomplishes the same without the need for an AMI!
HBase was introduced as a parallel column based key-value store that adheres to much of the BigTable specification; performance is key. On the other hand, Hive was designed for ad-hoc analytics. Pig is a query language for processing large datasets. One participant asked if the user must worry about persistence and file locking. Chris mentioned that Zookeeper allows the user to control some aspects of locking in Hadoop jobs. We did not go into much detail about these subprojects though.
In the few minutes remaining, the speaker entertained a question about the NoSQL movement and Hadoop. Relational databases adhere to the CAP theorem: Consistency, Availability, and Partitioning. Wensel stated that with big data, we absolutely must have Partitioning, but for NoSQL, the other two conditions consistency and availability must be relaxed. For example, the filesystem used for S3, which is kind of a database, is eventually consistent, a relaxation of consistency. An ls in an S3 bucket will yield different results when data is being processed. A NoSQL system could also drop availability . This means that at a partition event, the affected services wait until the data is consistent, and the system is unavailable for use during the period of inconsistency. There was also some discussion on Lucene and the companion projects Solr, Katta and Nutch.
Rest of Conference
At the beginning of the conference, there was a great expert panel that took questions from the audience. There was also time for companies to announce that they are hiring jobs. Despite the terrible US and California economies, there is a ton of momentum in data mining. The best part of the talk was the soundbytes. Joseph Rickert had some real zingers, and I could not agree more. My comments are in parenthesis.
Rickert: “The thing about statisticians is that they don’t write good code.” (some of them really think they do; it’s funny)
Rickert: “Ask a statistician about a hash table and they have no idea what you’re talking about.” (yup!)
Dr. Dunning: “For data mining, software engineering is not as important. Working with big data and experience with big data is key.” (I completely agree, but try telling Google that.)
There were some door prizes; I did not win anything. Nothing to cry about though: 4 Microsoft 2GB flash drives, certificates for free e-books, a portable hula-hoop, light up bouncing balls, data dictionaries, and some weird snake bracelet that had its tongue sticking out.
What was Missing
The ACM Data Mining Camp was pretty complete. In hindsight, there were some things that I was expecting to hear about but did not see a session about. Some of these include the visualization system Processing. I also did not see very much about working with network data. I was also expecting to see something about Scala, but it did not seem to come up.
There is one non-data mining thing I learned this past week. After meeting a Twitter friend at the event, and receiving an email from another Twitter friend over the weekend, it seems that Data Mining and Machine Learning people all have similar feelings about the Statistics and Computer Science “empires.” There is little communication between both fields. It’s a shame because we could conquer the world if we combined minds.
This entire four years I’ve been in grad school, I’ve felt all alone and wondering why at UCLA there is such a gap between CS and Statistics with respect to Data Mining. It just happens to be that at UCLA Computer Science gives data mining a better treatment. At many other schools it may be the exact opposite. It is great to know that I am not alone in my frustration!