Last week witnessed a number of exciting announcements from the big data and machine learning space. What it shows is that there are still lots of problems to solve in 1) working with/deriving insights from big data, 2) integrating insights into business processes.
Probably the biggest (data) headline was that Google open sourced TensorFlow, their graph-based computing framework. Many stories refer to TensorFlow as Google’s AI engine, but it is actually a lot more. Indeed, like Spark and Hadoop, it encompasses a computing paradigm based on a directed, acyclic graph (DAG). DAGs have been around in the world of mathematics since the days of Euler, and have been used in computer science for decades. The past 10-15 years have seen DAGs become popular as a way to model systems, with noteworthy examples being SecDB/Slang from Goldman Sachs and its derivatives (Athena, Quartz, etc.).
What differentiates TensorFlow is that it transparently scales across various hardware platforms, from smartphones to GPUs to clusters. For anyone who’s tried to do parallel computing in R, knows how significant this seamless scaling can be. Second, TensorFlow has built in primitives for modeling recurrent neural networks, which are used for Deep Learning. After Spark, TensorFlow delivers the final nail in the coffin for Hadoop. I wouldn’t be surprised if in a few years the only thing remaining in the Hadoop ecosystem is HDFS.
A good place to get started with TensorFlow is their basic MNIST handwriting tutorial. Note that TensorFlow has bindings for Java, Python, C/C++. One of their goals of open sourcing TensorFlow is to see more language bindings. One example is this simple R binding via RPython, although integrating with Rcpp is probably preferred. If anyone is interested in collaborating on proper R bindings, do reach out via the comments.
What’s in a name exactly? Tensors are a mathematical object that is commonly said to generalize vectors. For the most part the TensorFlow documentation refers to tensors as multidimensional arrays. Of course, there’s more to the story, and for the mathematically inclined, you’ll see that tensors are referred to as functions, just like matrix operators. The mechanics of tensors are nicely described in Kolecki’s An Introduction To Tensors For Students Of Physics And Engineering published by NASA and this (slightly terse) chapter on tensors from U Miami.
Another notable computing platform is Ufora, founded by Braxton McKee. Braxton’s platform differs from TensorFlow and the others I mentioned in that it doesn’t impose a computing paradigm on you. All the magic is behind the scenes, where the platform acts as a dynamic code optimizer, figuring out how to parallelize operations as they happen.
What made the headlines is that Ufora decided to open source their kit as well. This is really great for everyone, as their technology will likely find its way into all sorts of places. A good place to start is the codebase on github. Do note that you’ll need to roll up your sleeves for this one.
PCA and K-Means
Last week in my class, we discussed ways of visualizing multidimensional data. Part of the assignment was clustering data via k-means. One student suggested using PCA to reduce the dimensions into 3-space so it could be visualized. In Kuhn & Johnson, PCA is cited as a useful data preparation step to remove noise in extra dimensions. This suggests pre-processing with PCA and then applying k-means. Which is right?
It turns out that PCA and k-means are intimately connected. In K-means Clustering via Principal Component Analysis, Ding and He prove that PCA is actually the continuous solution of the cluster membership indicators of k-means. Whoa, that was a mouthful. To add some color, clustering algorithms are typically discrete: an element is either in one cluster or another, but not both. In this paper, the authors show that if cluster membership is considered continuous (akin to probabilities), then the k-means solution is the same as applying PCA!
Back to the original question, in practice both approaches are valid and it really boils down to what you want to accomplish. If your goal is to remove noise, pre-processing with PCA is appropriate. If the dataset becomes easily visualized, that’s a nice side effect. On the other hand, if the original space is already optimal, then there’s no harm in clustering and reducing dimensions via PCA afterward for visualization purposes. If you take this approach, I think it’s wise to communicate to your audience that the visualization is an approximation of the relationships.
What are your thoughts on using PCA for visualization? Add tips and ideas in the comments.
It’s true, the jet pack is finally a reality. Forty years after their first iteration of the RocketBelt, the inventors have succeeded in improving the flight time (from 30 seconds 40 years ago) to 10 minutes and a top speed of around 100 kmh.