My last post uses random forest proximity to visualize a set of diamond shapes (the random forest is trained to distinguish diamonds from non-diamonds).
This time I looked at the digits data set that Kaggle is using as the basis of a competition for “getting started”. The random forest is trained to classify the digits, and this is an embedding of 1000 digits into 2 dimensions preserving proximities from the random forest as closely as possible:
Here’s the same but just for the 7’s:
The random forest has done a reasonable job putting different types of 7’s in different areas, with the most “canonical” 7’s toward the middle.
You can see all of the other digits http://www.learnfromdata.com/media/blog/digits/.
Note that this random forest is different from the one in my last post — here it’s built to classify the digits, not separate digits from non-digits. I wonder what kind of results a random forest to distinguish 7’s from non-7’s would look like?
Code is on Github.