More on freqparcoord

April 6, 2014
By

(This article was first published on Mad (Data) Scientist, and kindly contributed to R-bloggers)

In a previous post, I introduced my new package with Yingkang Xie, freqparcoord. Here I’ll illustrate some of the other uses to which the package can be applied.

The freqparcoord package visualizes multivariate data by plotting the most frequent cases in the data, as defined by multivariate density estimation. The example in the previous post illustrated typical height, weight and age differences by position among baseball players.

But the package allows plotting the least frequent cases–perfect for outlier hunting. Let’s apply this to the baseball data, say finding 3 outliers for each position:

> library(freqparcoord)
> data(mlb)
> freqparcoord(mlb,-3,4:6,7)

Here columns 4:6 are height, weight and age, while column 7 is position; under default settings, the data will be faceted vertically by position. Here is what is displayed:

All variables are centered and scaled. To get a better idea as to what the extreme values are, we can set the keepidxs argument, and then show the original data corresponding to the displayed lines:

> p <- freqparcoord(mlb,-3,4:6,7,keepidxs=4)
> p
> p$xdisp[,c(1,4:7)]
               Name Height Weight   Age PosCategory
237  Ivan_Rodriguez     69    218 35.25     Catcher
994      So_Taguchi     70    163 37.66  Outfielder
674    Julio_Franco     73    188 48.52   Infielder
964     Barry_Bonds     74    228 42.60  Outfielder
36        Toby_Hall     75    240 31.36     Catcher
35  A.J._Pierzynski     75    245 30.17     Catcher
891  Mike_Restovich     76    257 28.16  Outfielder
547      Tony_Clark     79    245 34.71   Infielder
155   C.C._Sabathia     79    290 26.61     Pitcher
275   Richie_Sexson     80    237 32.17   Infielder
559   Randy_Johnson     82    231 43.47     Pitcher
910       Jon_Rauch     83    260 28.42     Pitcher

Sexson was flagged due to his height and weight, while Franco emerged because he is 48 years old!

Our package can also be used for cluster hunting. Here we again look for the most frequent cases, but now locally most frequent, meaning that their estimated density values are local maxima.

Let’s try it on simulated data, with known clusters. We’ll generate from a mixture of 3 bivariate normals, with means at (0,0), (1,2) and (3,3). (The package includes a function rmixmvnorm() for such experiments.) Here are the results, both graphical and text:

 
> cv <- 0.5*diag(2)
> x<- rmixmvnorm(10000,2,3,list(c(0,0),c(1,2),c(3,3)),
+ list(cv,cv,cv))
> p <- freqparcoord(x,m=1,method="locmax",
+ keepidxs=1,k=50,klm=600)
[,1] [,2]
[1,] -0.3556997 0.2423760
[2,] -0.0993228 -0.2510209
[3,] 0.2507786 0.1883437
[4,] 1.1386850 2.2073467
[5,] 2.7581227 2.8935957

blog040814bb

Of course, the results depend on the arguments, with default values being used here. The reader may wish to experiment with other values of klm. But the data clearly fall into 3 clusters, the correct number, centered near (0,0), (1,2) and (3,3), the correct locations. Note that we did NOT specify the number of clusters.

And there’s more! Watch this space for future posts.

The package is on CRAN but the latest version is here.

To leave a comment for the author, please follow the link and comment on their blog: Mad (Data) Scientist.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...



If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.

Sponsors

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)