Manifold Visualization: Second Example

October 1, 2018
By

(This article was first published on Mad (Data) Scientist, and kindly contributed to R-bloggers)

In last night’s post, I introduced prVis(), a new visualization tool which we have invented, available in our polyreg package. Recall that prVis() is intended as a simpler alternative to recent visualization tools like t-SNE and UMAP. Here I will post another example.

The dataset is prgeng, included in the package. It consists of wage income, age, gender, and so on, of Silicon Valley programmers and engineers, from the 2000 Census. We first load the data and then choose some of the variables (age, gender, education and occupation):

getPE()
pe1 <- pe[,c(1,2,6:7,12:16)]

So, let’s plot the graph:

The graph consists of streaks, about a dozen of them. What do they represent? To investigate that question, we call another polyreg function:

addRowNums(16,z)

This will write the row numbers of 16 random points from the dataset onto the graph that I just plotted, which now looks like this:

Due to overplotting, the numbers are difficult to read, but are also output to the R console:

[1] “highlighted rows:”
[1] 2847
[1] 5016
[1] 5569
[1] 6568
[1] 6915
[1] 8604
[1] 9967
[1] 10113
[1] 10666
[1] 10744
[1] 11383
[1] 11404
[1] 11725
[1] 13335
[1] 14521
[1] 15462

Rows 2847 and 10666 seem to be on the same streak, so they must have something in common. Let’s take a look.

> pe1[2847,]
         age sex ms phd occ1 occ2 occ3 occ4 occ5
2847 32.3253   1  1   0    0    0    0    0    0
> pe1[10666,]
          age sex ms phd occ1 occ2 occ3 occ4 occ5
10666 45.36755  1  1   0    0    0    0    0    0

Aha! Except for age, these two workers are identical in terms of gender (male), education (Master’s) and occupation (occ. category 6). Now those streaks make sense; each one represents a certain combination of the categorical variables.

Well, then, let’s see what UMAP does:

plot(umap(pe1))

The result is

The pattern here, if any, is not clear.

So in both examples, both last night’s and tonight’s, prVis() was not only simpler but also much more visually interpretable than UMAP.

In fairness, I must point out:

  • I just used the default values of umap() in these examples. It would be interesting to explore other values. On the other hand, it may be that UMAP simply is not suitable for partially categorical data, as we have in this second example.
  • For most other datasets I’ve tried, prVis() and UMAP give similar results.

Even so, these two points show the virtues of using prVis() . We are getting equal or better quality while not having to worry about settings for various hypeparameters.

To leave a comment for the author, please follow the link and comment on their blog: Mad (Data) Scientist.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...



If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.

Search R-bloggers


Sponsors

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)