This week, we continue the parallel themes of deep learning and natural language processing. Last week I mentioned some papers that use deep learning for NLP. In deep learning, these tasks are modeled as a prediction problem, which is why such an extensive training set is required. I think it’s important to remember this amongst the flurry of sensationalist headlines around deep learning. While I don’t think anyone believes that these systems are actually “threatening” humans or are “self-aware”, it troubles me that these sensationalist headlines can feed the paranoia of high-profile people warning against super intelligent AIs. Besides, Isaac Asimov solved this decades ago with a bit of computer science mischief and a 0th Robot Law.
Spark + H2O
The preview release of Spark 1.6 was announced a few weeks back. It appears that the DataBricks model is to give their cloud clients access prior to the general public. For the technically minded, this isn’t a huge issue for two reasons: 1) if you’re using Spark seriously, it’s better for you that someone else is beta testing and working the kinks out; 2) you can build the release yourself via the source code. The DataBricks crew is already onto the first release candidate, so you can probably get a fairly stable build at this point.
Judging from the source code, R users ain’t getting no love this time around. It seems that the bindings for MLlib still only support generalized linear models (GLM). Hence, the biggest strength of Spark for R users is around data collection and processing (aka cleaning, munging, wrangling) prior to conducting an analysis. To model large datasets, it looks like you need to look elsewhere.
H2O / Sparkling Water
People anticipating the full release of Spark 1.6 can tickle their fancy with a different announcement on their blog: integration between Spark and H2O with the cutesy Sparking Water integration library. This looks pretty interesting and should work in a local instance of Spark, since Sparkling Water is on Maven.
Don’t get too excited about using deep learning from within Spark, though. A close reading of the post shows that the example is using a bag-of-words approach and calculating the TF-IDF for each message. This in turn is being fed into a neural network to classify the results. In other words, it’s a logistic regression, and there’s really no need to use a neural network or “deep learning” for this scenario. The approaches used by serious deep learning applications of NLP use a different form of encoding, such as word2vec, continuous-bag-of-words, or skip-grams. The TensorFlow site has an excellent tutorial on these methods. You’ll notice that even word2vec uses a logistic regression (doh!) but the difference is in what’s being compared. Bag-of-words approaches such as TF-IDF is context-ignorant, whereas word2vec, et al. are locally context-aware. The recurrent architecture of the neural networks give them longer term context awareness as well.
What’s nice about H2O is that they have a much deeper integration with R, as demonstrated in the
H2O tuning guide.
Natural Language Processing Buyer’s Guide (but not really)
Evolution has not quite landed in the NLP ecosystem. You could spend a year just learning about all the toolkits and frameworks out there. So what do you do if you just need something to work? First off, know what you need. Do you need a part-of-speech (POS) tagger, named entity recognition (NER), co-reference resolution, sentiment analysis, or language realization? Did any of that make sense to you? If not, you might consider using a high-level API like AlchemyAPI. Be aware that what you get in convenience, you lose in transparency. In other words, it’s difficult to reproduce results and know why you’re getting a particular result. Depending on your need, you may or may not care about this. As an exercise, have your friendly neighborhood data scientist compare the sentiment analysis of AlchemyAPI to an open source alternative.
Assuming you’re going down the rabbit hole, there are three first order toolkits that I tend to consider: Stanford CoreNLP (Java), OpenNLP (Java) and NLTK (Python). If you want state of the art, CoreNLP and NLTK are good choices. I like NLTK, but I’ve noticed that complete grammars for English aren’t provided, so it’s harder to use out-of-the-box for parsing and realization applications. That’s not the case with OpenNLP, but the drawback here is the general verbosity of Java, i.e. it takes longer to get things set up. Worse, the R binding for OpenNLP has integration issues. The biggest problem is that sending a ^C (control C) passes a terminate signal to Java, which then terminates your R session! Of course, you can always look at it as an excuse for a second lunch.
Next week, I’ll follow up more on the different packages and why you might pick one over another.
As mentioned previously, the onslaught of self-contained distributed computing systems is largely possible due to functional programming. For those interested in learning more about using functional programming for data science, take a look at my forthcoming book, Modeling Data With Functional Programming In R. I just posted the latest draft that includes a chapter on using lists as a general data structure. I include an implementation of random forests using trees in about 30 lines of code.
My theory on why people like Bill Gates, Stephen Hawking, and Elon Musk are so afraid of AI is because they can’t beat them at chess. For us East Asians, AI have yet to be a credible threat to our game of choice, paduk (aka go, wei chi). But perhaps that’s changing, as some researchers have created a hybrid neural network that achieves 1 dan ranking. This is like the first-degree black belt of go. It’s an interesting read that sheds light on the dynamics of both the game and also limitations of neural networks.