**Byte Mining » R**, and kindly contributed to R-bloggers)

I am summarizing all of the days together since each talk was short, and I was too exhausted to write a post after each day. Due to the broken-up schedule of the KDD sessions, I group everything together instead of switching back and forth among a dozen different topics. By far the most enjoyable and interesting aspects of the conference were the breakout sessions.

**Keynotes**

KDD 2011 featured several keynote speeches that were spread out among three days and throughout the day. This year’s conference had a few big names.

*Steven Boyd, Convex Optimization: From Embedded Real-Time to Large-Scale Distributed.*The first keynote, by Steven Boyd, discussed convex optimization. The goal of convex optimization is to minimize some objective function given linear constraints. The caveat is that the objective function and all of the constraints must be convex (“non-negative curvature” as Boyd said). The goal of convex optimization is to turn the problem into a linear programming problem. We should care about convex optimization because it comes from some beautiful and complete theory like duality and optimality conditions. I must say, that whenever I am chastising statisticians, I often say that all they care about is “beautiful theory” so his comment was humorous to me. Convex optimization is a very intuitive way to think about regression and techniques such as the lasso. Convex optimization has tons of use cases including parameter estimation (MLE, MAP, least-squares, lasso, logistic SVM and modern L1 optimization). Boyd showed an example of convex optimization for disk head scheduling.

For more information about convex optimization, see the website for *Convex Optimization *by Boyd and Vandenberghe. The book is available for free as well as lecture slides etc. Even better, the second author is from UCLA! I did not realize that.

*Peter Norvig*,

*Internet Scale Data Analysis*. It is always great to hear from Peter Norvig. At the very least, you may have seen his name on your Artificial Intelligence introductory textbook

*Artificial Intelligence: A Modern Approach*. Norvig is also well known as the Director of Research at Google. He also spoke at SciPyCon 2009 and was wearing a similarly flashy shirt. Norvig discussed how to get around long latencies in a large scale system. Interestingly, his talk began with a discussion about Google’s interest in its carbon footprint because of course all of Google’s massive systems require a lot of power. The carbon output of 2500 queries is approximately equal to the carbon output in a beer. Norvig noted that most of Google’s most successful engineers are well-versed in distributed systems, and this should come as no surprise. He then introduced MapReduce and showed an example of how Google uses MapReduce to process map tiles for Google Maps. Norvig concluded by mentioning a variety of large systems used by Google including BigTable (column oriented store), and Pregel for graph processing. Pregel is vertex based, and thus programs “think like a vertex” where each vertex responds to actions transmitted over edges.

(There was a keynote by a fellow named David Haussler about cancer genomics. After an exhausting first two days, I skipped this talk as I needed to sleep…and I was not incredibly interested in the topic.)

*Judea Pearl, The Mathematics of Causal Inference.*Go Bruins! Judea Pearl is a professor at the UCLA Department of Computer Science and teaches a course on his field, Causality, each spring. His talk was essentially the same talk he gives at UCLA at the beginning of the quarter. I attempted to take his course in 2009, but quite frankly, I don’t get it and my mind cannot bend into that realm. I remember sitting in his class and wondering “what is wrong with me?” I love listening to Dr. Pearl speak only because of his sense of humor. Despite his age and the fact that he is slowing down, he had the crowd in hysterics as he struggled with the presentation technology and made intelligent jokes at every chance.

Pearl believes that humans do not communicate with probability, but causality (I do not agree with this entirely). I appreciated that he mentioned that it takes work to overcome the difference in thinking between probability and causality. In statistics, we use some data and a joint distribution to make inferences about some quantity or variable *P*. In causality, there is an intentional intervention that changes the joint distribution *P *into another joint distribution *P’*. Causality requires new language and mathematics (I do not see it). In order to use causality, one must introduce some untestable hypothesis. Pearl mentioned that some non-standard mathematical methods include counterfactuals and structural equation modeling. I do not know how I feel about any of this. For more information about Pearl’s Causality, check out his book.

**Data Mining Competitions**

One interesting event during KDD 2011 was the panel *Lessons Learned from Contests in Data Mining. *This panel featured Jeremy Howard (Kaggle), Yehuda Koren (Yahoo!), Tie-Yan Liu (Microsoft), and Claudia Perlich (Media6Degrees). Both Kaggle and Yahoo *run *data mining competitions: Kaggle has its own series of competitions and Yahoo is a major sponsor of the KDD Cup competition. Perlich has participated and won many data mining competitions. Liu provided a different insight into data mining competitions as an industry observer. ** **

Jeremy Howard gave some insight into the history of data mining competitions. He credited KDD 97 with the formation of the first data mining competition. He announced to the crowd that companies spend 100 billion dollars every year on data mining products and services (not including in-house costs such as employment) and that there are approximately 2 million Data Scientists. The estimate of the number of Data Scientists was based on the number of times R was downloaded, and is an estimate based on David Smith’s (Revolution Computing) blog post. I love R, and every Data Scientist should use it, but there are several problems with this estimate. Not everyone that uses R is a Data Scientist; a large portion of R users are statisticians (“beautiful theory”), teachers, miscellaneous students etc. Second, not all Data Scientists use R. Some are even more creative and write their own tools or use little-adopted software packages. There are also a lot of Data Scientists that use Python instead of R. Howard also announced that over the next year, Kaggle with be starting 1000s of “invitation only” competitions. Personally, I do not care for this type of exclusion even though their intentions are good.

Yehuda Koren introduced the crowd to Yahoo’s involvement in data mining competitions. Yahoo is a major force behind the KDD Cup and the Heritage Foundation competition. Yahoo also won a progress award in the Netflix challenge. Koren then described how data mining competitions help the community. Competitions raise awareness and attract research to a field, end up involving the release of a cool dataset to the community, encourage contribution and education, and provide publicity for participants and winners. Contestants are attracted to competitions for various reasons including fun, competitiveness, fame, the desire to learn more, peer pressure and of course the monetary reward. As with every competition, data mining competitions have rules and Koren stated that rules are very difficult to enforce. I believe that data mining is vague as it is, so competitions would be just as vague. It is important to maximize participation by minimizing the reduction of participation while maximizing fairness and innovation. Some such “rules” include discouraging huge ensembles (which probably overfit anyway), submission frequency, team duplication, team size (the KDD Cup winning team had 25 members). Some obvious keys to success in data mining competitions are ensembles, hard work, team size, innovation vs. fancy models, quick coding and patience.

I felt that Tie-Yan Liu from Microsoft sort of served as the Simon Cowell of the panel, and I feel that his role was necessary. He provided industry insight that provided a bit of a reality check as to what data mining competitions accomplish and do not accomplish. Liu questions if the problems being solved in data mining competitions are really important problems. Part of the problem is that many datasets are censored as to protect privacy. Additionally, the really interesting problems cannot be opened to the public because they involve trade secrets. I consider myself an inclusive guy – I do not like the concept of winners and losers. I was elated that Liu brought up this point: “what about the losers?” Is it bad publicity to “lose” several (or all) competitions? The answer to this question varies person-to-person. I honestly believe that the goal of these competitions is of the open-source nature (fun, share, learn, solve) and not so much to cure cancer. They are great for college students, people that are interested in data science but do not have access to great data. For the rest of us, learning on our own using interesting data is probably better.

Claudia Perlich (Media6Degrees) discussed her experience participating in data mining competitions. She has won several contests. She commented on the distinction between sterile/cleaned data and real data as competitions can include either type. The concept of Occam’s Razor applies to data mining competitions; Perlich won most of her competitions using a linear model, but by using more complex and creative features. Perlich emphasizes that complex features are better than complex models.

Considering the Netflix Prize has been one of the biggest data mining competitions, I was disappointed that they were not represented on the panel since there were several researchers from Netflix at the conference.

*Rather than write a few sentences for each topic, I will just bullet the goals of the research discussed in the sessions. Descriptions with a star (*) denote my favorite papers and are cited later.*

**Text Mining**

I attended two of the three text mining sessions. I must say that I am quite topic-modeled and LDAed out! Latent Dirichlet Allocation (LDA) and several variations were part of every talk I heard. That was very exciting and reaffirms that I am in a hot field. Still, nobody has taken my dissertation topic yet (which I have remained quiet about).

- Using explicit user feedback to improve LDA and display topics appropriately by combining topic labels, topic n-grams and capitalization/entity detection.* This talk was presented by David Andrzejwski (@davidandrzej). I finally got to meet him and I discussed my dissertation topic with him. I am always entertained by the fact that we all look much different than our Twitter avatars portray.
- Using external metadata and topics (LDA) to predict user ratings on items using localized factor models.
- Using preferences and relative emphasis of each factor (i.e. how important to you is free wireless Internet in a hotel room?) to predict rating scores.*
- Determining the network process that created a piece of text: who copied from whom?
- Using a topic model (LDA) with other features such as part-of-speech tag (noun, verb etc.), WordNet features, sentiment/polarity etc.*
- Modeling how topics and interests grown over time and understanding the correlations between terms over time.*

**Social Network Analysis and Graph Analysis**

The Social Networks session conflicted with one of the Text Mining sessions, but since I knew there would be two more, I decided to attend this one instead. I also combined the two Graph Analysis sessions into this section since they are so related. The goals of the research presented in these talks were as follows:

- To label venue (Foursquare venues etc.) types (restaurant, bar, park etc.) based on several attributes of the user: user’s friends, user’s weekly and daily schedule using label propagation.
- To determine the connections/edges in a social network that are the most critical for propagation of data (an idea, tweet, viral marketing etc.)*
- To use tagging (items on Amazon can be tagged with keywords by users) and reviews to predict the success of a new item.
- To find a better metric for ranking search engine results by starting with a relevant subgraph rather than a random surfer model. Also models attention span of user.*
- Classification of nodes, labeling of nodes and node link prediction using one unified algorithm (C3).*
- Ranking using large graphs using a priori information about good/bad nodes and edges.*
- The importance of bias in sampling from networks.*

**User Modeling**

This session I suspect was similar to the Web User Modeling session and focused on recommendation engines and rating prediction.

- Using endorsements to measure user bias (retweets, likes, etc.) to perform real time sentiment analysis,
- Estimating user reputation using thumbs-up vote rates on Yahoo News comments.
- Selecting a set of reviews that encapsulates the most information about a product with the most diverse viewpoints.

**Frequent Sets
**

I did some work with itemset mining at my last job and I was not incredibly interested in the Online Data and Streams session at the time so I attended this talk.

- Using background knowledge about transactions to minimize redundancy.
- Studying the effects of order on itemset mining.
- Mining graphs as frequent itemsets from streams.

**Classification**

I got stuck in this session because the session I really wanted to attend “Web User Modeling” was full and there was nowhere to sit or stand. This session was more technical and theoretical. The only session that I really enjoyed was about a classifier called CHIRP. I did not follow the details, but this is a paper that I am interested in reading. The authors used a classifier based on Composite Hypercutes on Interated Random Projections to classify spaces that have complex topology (think of classifying items that appear in a bullseye/dartboard pattern).*

**Unsupervised Learning**

This talk was similar to the classification talk but more practical in my opinion.

- Using decision trees for density estimation classifiers.
- Clustering cell phone user behavior using “Earth Mover” distance.
- Clustering of multidimensional data using mixure modeling with components of different distributions and copulas.*

**Favorite Papers**

Below is a short bibliograph of papers that were my favorite. There were also a few at the poster session (the first four) that I include here.

*Ranking-Based Classification of Heterogeneous Information Networks*, Ming Ji, Jiaewi Han, Marina Danilevsky.*Axiomatic Ranking of Network Role Similarity,*Ruomong Jin, Victor E. Lee, Hui Hong.*Approximate Kernel k-means: Solutions to Large Scale Kernel Clustering**User-Level Sentiment Analysis Incorporating Social Networks,*Chenhao Tan, Lillian Lee, Jie Tang, Lang Jiang, Ming Zhou, Ping Li.*Latent Topic Feedback for Information Retrieval*, David Andrzejewski, Lawrence Livermore National La; David Buttler, Lawrence Livermore National Laboratory*Latent Aspect Rating Analysis without Aspect Keyword Supervision*, Hongning Wang, UIUC; Yue Lu, University of Illinois; ChengXiang Zhai, UIUC*Conditional Topical Coding: an Efficient Topic Model Conditioned on Rich Features*, Jun Zhu, Carnegie Mellon University; Ni Lao, Carnegie Mellon University; Ning Chen, Tsinghua University; Eric Xing, CMU*Tracking Trends: Incorporating Term Volume into Temporal Topic Models*, Liangjie Hong, Lehigh University; Dawei Yin, lehigh University; Jian Guo, University of Michigan; Brian Davison, Lehigh University*Diversity in ranking via resistive graph centers*, Kumar Dubey, IBM Research; Soumen Chakrabarti, “Indian Institute of Technology, Bombay”; Chiru Bhattacharya, IISc*Collective Graph Identification*, Galileo Namata, University of Maryland; Stanley Kok, University of Maryland; Lise Getoor, “University of Maryland, College Park”*Semi-Supervised Ranking on Very Large Graph with Rich Metadata*, Bin Gao, Microsoft Research Asia; Tie-Yan Liu, Microsoft Research Asia; Wei Wei, ; Taifeng Wang, Microsft research; Hang Li, Microsoft*Benefits of Bias: Towards Better Characterization of Network Sampling*, Arun Maiya, UIC; Tanya Berger-Wolf, University of Illinois at Chicago*CHIRP: A new classifier based on Composite Hypercubes on Iterated Random Projections*, Leland Wilkinson, Systat; Anushka Anand, UIC; Tuan Dang, UIC*Sparsification of Influence Networks*, Michael Mathioudakis, University of Toronto; Francesco Bonchi, Yahoo! Research; Carlos Castillo, Yahoo!; Aristides Gionis, Yahoo! Research Barcelona; Antti Ukkonen,*Online heterogeneous mixture modeling with marginal and copula selection*, RYOHEI FUJIMAKI, NEC Laboratories America; Yasuhiro Sogawa, ; Satosi Morinaga,

**Wrapping Up**

I had an awesome time at KDD and wish I could go next year, but it will be held in Beijing. I got to meet a lot of different people in the field that have the same passion for data and that was really cool. I got to meet with recruiters from a few different companies and get some swag from Yahoo and Google.

It was awesome being around such greatness. I ran into Peter Norvig several times, ran into Judea Pearl in the restroom (I already know him), as well as Christos Faloutsos (I am a huge fan) and Ross Quinlan. I stopped at the Springer booth and found a cool book about link prediction with Faloutsos as one of the authors. I went to buy it, handed the lady my credit card, and learned that it was $206 (AFTER conference discount)! Interestingly… Amazon has the same book for $165. I will probably order it anyway.

Here’s hoping that KDD returns to California (or the US) real soon!

**Candid Shots**

Ross Quinlan enjoying a beer during the poster session. What a cool guy! | Christos Faloutsos talking with a student during the poster session. |

**leave a comment**for the author, please follow the link and comment on their blog:

**Byte Mining » R**.

R-bloggers.com offers

**daily e-mail updates**about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...