My first time at ACM Data Mining Camp was so awesome, that I was thrilled the make the trip up to San Jose for the November 2010 version. In July, I gave a talk at the Emerging Technologies for Online Learning Symposium conference with a faculty member in the Department of Statistics, at the Fairmont. The place was amazing, and I told myself I would save up to stay there. This trip gave me an opportunity to check it out, and pretend that I am posh for a weekend . The night I arrived I had the best dinner and drinks at this place called Gordon Biersch. I had the best garlic fries and BBQ burger I have ever had. I ate it with a Dragonfruit Strawberry Mojito, the Barbados Rum Runner, and finished off with a Long Island Iced Tea, so the drinks were awesome as well. Anyway, to the point of this post…
The next morning I made the short trek to the PayPal headquarters for a very long 9am-8pm day. Since I came up here for the camp, I wanted to make the most of it and paid the $30 for the morning session, even though I had not intended on going originally.
Overview of Data Mining Algorithms with Dean Abbott
The paid morning session from 9-11:30 was led by Dean Abbott (@deanabb), the president of Abbott Analytics. It was an excellent overview of the basic data mining algorithms, but obviously 2 hours is not enough time to cover the algorithms in detail. When I first scanned through the slides I was concerned that I would be bored, but I actually learned a few things that made it worth it.
One of the first concepts I learned about was CHAID, (CHi-squared Automated Interaction Detector) a decision tree algorithm that can build wide n-ary trees rather than just binary trees like in CART. CHAID can also output a p-value, making diagnostic analysis more practical. I also did not know that decision trees could be used as a pre-analysis step to find interactions among variables. The output from this step can be used to construct better regression models including the proper interaction terms.
We moved on to linear regression and logistic regression which were obviously very basic. Next, we spent some time discussing neural networks. It is no secret that I detest neural networks. I don’t know what it is, but they annoy me to no end. It seems like there is very little science behind how to choose quantities such as the parameters, number of neurons or number of hidden layers. Maybe it is just me, but neural networks feel like a hack. Besides, anything that can be done with a neural network can be done using plain old statistics.
At this point, we had to start rushing which was too bad. We briefly discussed ensemble methods including bagging, boosting (AdaBoost), and Random Forests. We spent about 5 minutes on unsupervised methods as well including k-means, Kohonen maps
(self-organizing maps). I am not sure what happened to principal components analysis (PCA), multidimensional scaling (MDS)
or independent components analysis (ICA). As I mentioned to a friend, unsupervised learning always gets the shaft. We had slides for association rules (Apriori algorithm), but we did not have time to discuss it. I was hoping semi-supervised learning, reinforcement learning and
recommendation systems would be mentioned, but there was not enough time even for what was on the agenda.
I wish we had more time. Unfortunately, there were way too many questions and a few individuals that wished to waste minutes debating and challenging the speaker.
Dean Abbott teaches a full, two-day course (not free) in data mining that may be of interest. Click here for more information. I usually would not post something like this, but he is an excellent, and practical speaker
What I found a bit surprising was that this session was at a Data Mining event. I would hope that most of the people in attendance had familiarity with a good amount of the material. The Netflix session at the previous ACM Data Mining Camp seemed to align better with the target audience of the day’s events. On the other hand, there were a ton of people in the session. Perhaps this was a good money-maker for the Bay Area ACM, because perhaps some people got their training on in the morning, and then left after lunch.
The eBay sponsored lunch was phenomenal, just like last time. I got a smoked ham sandwich and my little box also contained a bag of potato chips, an apple, and an oatmeal-raisin muffin looking thing (it was supposed to be a cookie but the baker got carried away).
Next up was the main session which mainly consisted of a QA session with some experts in the field and also some job announcements from companies that sponsored the event.
- Somebody from SIGKDD announced the SIGKDD 2011 conference to be held in San Diego, CA in August 2011.
- A research engineer from eBay discussed the fact that many equate data mining with text mining and search. He drove home the point that at eBay, researchers are interested in other things such as social network analysis and valuing links.
- The Bayesian networks analysis tool BayesiaLab from Bayesia was introduced and the developers gave a shout-out to Judea Pearl over at UCLA. Dr. Pearl said about Bayesia, “This is good stuff!”
- LinkedIn talked about some of its new projects including CareerExplorer, that takes the professional graph and shows what a college student’s future career could potentially be. LinkedIn’s product team has engineers that specialize in machine learning, statistics, and data mining. They also host an “InDay” each month which is essentially its version of a hackday. They also mentioned that LinkedIn is investing very heavily in Hadoop, and they just tripled the size of their Hadoop cluster.
- Netflix is hiring “like crazy” and expanding internationally. Data mining engineers work on Cinematch technology and other projects.
- Joseph Rickert from Revolution Analytics introduced the crowd to its commercial version of R.
- Salford Systems talked a bit about its products including CART and Random Forests.
- SAS was also present and mentioned that it is looking for people that want to publish their books on data mining with them.
Large Data with R
Given that I gave a talk to the Los Angeles R Users’ Group on working with large datasets in R, I figured this would be an enlightening session. Unfortunately, the R skills that were covered were very basic, and it was little more than a commercial for Revolution Computing’s version of R. The take away from the session was basically just that the Revolution version has optimized methods that read the data into memory in chunks and operate on each chunk (perhaps) independently. This is nothing that a nice integration with Hadoop could not provide. No mention was made of the free open-source solutions for large datasets in R: bigmemory and ff.
If I had a time machine, I would have instead attended Rob Zinkov‘s talk on Sentiment Analysis. Rob is a second-year Ph.D. student in Computer Science at University of Southern California’s Information Sciences Institute and a member of the Los Angeles R Users’ Group.
Next up was Ted Dunning discussing Mahout. I was elated to see practically each hand in the room shoot up when we were asked to vote on which sessions we wanted to attend. Mahout is a Java framework that provides scalable machine learning and data mining algorithms. Mahout code interacts with Hadoop to provide map-reduce functionality for algorithms. The purpose of Mahout is to provide early production quality scalable data mining. Some classification methods currently in Mahout include mixture modeling, Latent Dirichlet Allocation (LDA), logistic regression, naive Bayes, Complementary Naive Bayes, latent factor loglinear algorithms, stochastic gradient descent SVM, and random forests. Some of these methods are parallel, and some are sequential. Large scale SVD is currently being worked on, and still has some rough edges.
The biggest news in this talk was how well Mahout has been snapped up by industry. AOL uses Mahout’s methods for its product recommendation services. “A large scale electronics company” (name was secret) uses Mahout for music recommendations. Other uses of Mahout in industry include frequent itemset mining, and spam website detection.
Dunning mentioned that Mahout does seem to work well with sparse matrices assuming that if an element of the matrix is unspecified, it is equal to 0. If I understood his statement correctly, this means that Mahout works well with most sparse matrices. Some more technical gems I learned is that Mahout can do stochastic gradient descent (although it is sequential), and its implementation uses per-term annealing which can then be used for supervised learning with logistic regression. These implementations optimize for high dimensional sparser data, possibly with interactions. These methods are scalable and fast to train. Ted mentioned that for a particular test case, the optimization converged in less than 10,000 examples. For large datasets, it is possible that the method will converge before seeing all of the data. With that said, in the “best” case, an algorithm using stochastic gradient descent can be sublinear in the number of training examples.
Towards the end of the session, Ted answered some questions personally, and it gave me some insight into data mining methods. He is not a fan of “most common” itemset algorithms (Apriori, Eclat, etc.) because they are difficult to parallelize due to their quadratic nature. Instead, he prefers co-occurence analysis methods. He also prefers R to Weka, and he loves Python. I also prefer R to Weka, and love Python .
Large Scale Supervised Learning
The next talk I attended was rehearsed with slides etc. and was presented by Junling Hu, from eBay. Junling has a Ph.D. in Computer Science from University of Michigan, Ann Arbor. Although the talk began with (another) quick review of data mining algorithms, the meat of the talk was on how to parallelize some of these algorithms. The challenge of parallelization is that we must maintain a global function, and messages must be passed to update this global function. One basic way to do this is how map-reduce does it: split the data into subsets, perform some function on each subset and reduce the computations into one result. Each method has its own way it can be parallelized.
Decision Trees. One type of parallelization for decision trees is based on the tree nodes. One can write a map-reduce function to recursively split nodes, like the PLANET method proposed by Google. We have some sequence of nodes in a tree and we maintain a model. With the proposed framework, we start with an empty set of nodes. We maintain a map-reduce queue for all the nodes we are going to divide and we also maintain an in-memory queue of nodes. The goal is to find the best splits based on some measure, in PLANET’s case, variance. We run some controller that controls map-reduce jobs. Each map-reduce job sends back data and we update the global variables: the model, the map-reduce queue and the in-memory queue. Then, new map-reduce jobs are constructed and the process continues. Due to time constraints, I had a difficult time following all of what was going on, but more information about the algorithm discussed can be found here.
Support Vector Machines. Hu mentioned two types of SVMs: primal SVM and parallel SVM. The idea behind parallelizing SVMs is to use block-based optimization by dividing either the data, or the features, into blocks. Stochasic gradient descent can be used for block minimization for the primal SVM. Some other ways mentioned included randomly splitting the data (bootstrapping perhaps), or using data compression (dimension reduction, perhaps). One resource for parallel SVM is the psvm project on Google Code which provides distributed gradient computation for maximum entropy, and parallel logistic regression.
Junling listed a few resources:
- ML-MapReduce, routines for machine learning using map-reduce (currently only logistic regression).
- IBM Parallel Machine Learning Toolbox, which is blackbox and not open-source.
The final time slot was slim pickings for me, and the second time slot (when I attended Mahout) hosted 4 sessions I wanted to attend that all conflicted with each other. The discussion about the association between tweets and stock prices sounded interesting, except for the stock prices part. So, I attended the Monetizing Images session. This session was more of a discussion about data mining with images in general.
- Apparently, Facebook can identify brands in images and use these brands for advertising. I have not seen this but I do not really doubt that it is true.
- Microsoft Photosynth crawls freely available images on the web and uses these images to create an entire scene from them, essentially allowing someone to tour Rome using just images on the web.
- The United States Postal Service uses k-nearest neighbors for intelligent character recognition (ICR) used to read addresses on envelopes.
- Google Goggles
- ZunaVision allows advertisers to embed logos and ads into a video with more flexibility than with things like green screens used on football fields etc.
We also discussed forensic photography and the ability to detect if an image has been doctored. We also discussed some techniques for measuring image similarity. David Lowe from the University of British Columbia maintains a list of uses and companies regarding computer vision on his website.
At this point I was exhausted. I like meeting Twitter friends and followers, but people very quiet! It was a pleasure to meet Scott Waterman (@tswaterman) and Tommy Chheng (@tommychheng). I also got to reconnect with my friend Shaun Ahmadian (@ssahmadian) from the UCLA Department of Computer Science as well as Rob Zinkov (@zaxtax) who also made the trek from Los Angeles to San Jose.