Maximal Information Coefficient (Part II)

Posted on September 17, 2014 by Marc in the box in R bloggers | 0 Comments

[This article was first published on me nugget, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

A while back, I wrote a post simply announcing a recent paper that described a new statistic called the “Maximal Information Coefficient” (MIC), which is able to describe the correlation between paired variables regardless of linear or nonlinear relationship. This turned out to be quite a popular post, and included a lively discussion as to the merits of the work and difficulties in using the software provided by the authors. Regarding the latter, I also had difficulties running the software on R and thus did not include an example. Checking back on this topic, I was pleased to see that an R package had subsequently been developed: minerva: Maximal Information-Based Nonparametric Exploration R package for Variable Analysis (Albanese et al. 2013). Further documentation of the package can be found here: http://minepy.sourceforge.net/

I tried out the package on the baseball data set used in the original paper by Reshef et al. (2011), where a suite of variables are correlated against a baseball player’s salary. The author’s state in their paper:

“In the MLB data set (131 variables), MIC and ρ both identified many linear relationships, but interesting differences emerged. On the basis of p, the strongest three correlates with player salary are walks, intentional walks, and runs batted in. By contrast, the strongest three associations according to MIC are hits, total bases, and a popular aggregate offensive statistic called Replacement Level Marginal Lineup Value (27, 34) (fig. S12 and table S12). We leave it to baseball enthusiasts to decide which of these statistics are (or should be!) more strongly tied to salary.”

Here is a summary from the results computed with the function mine() of the minerva package (top 10 ranking MIC coefficients), which reproduces the same results as are shown in the Supplementary table S12 of the original paper:

For a visual representation of these results, the top figure plots MIC vs. Pearson and MIC Rank vs. Pearson Rank. Thanks to minerva author and maintainer M. Filosi for helping in reproducing the example.

References:

Albanese, D., Filosi, M., Visintainer, R., Riccadonna, S., Jurman, G., & Furlanello, C. (2013). minerva and minepy: a C engine for the MINE suite and its R, Python and MATLAB wrappers. Bioinformatics, 29(3), 407-408. [link]

Reshef, D. N., Reshef, Y. A., Finucane, H. K., Grossman, S. R., McVean, G., Turnbaugh, P. J., … & Sabeti, P. C. (2011). Detecting novel associations in large data sets. science, 334(6062), 1518-1524. [link]

Code to reproduce the example:

library(minerva)
 
# Load data
dat <- read.csv("MLB2008.csv") # from http://www.exploredata.net/Downloads/Baseball-Data-Set
x <- dat[,-c(1:3)]
excl <- which(diag(var(x)) < 1e-5) # exclude variables with low variance
x <- x[,-excl]
y <- dat$SALARY
 
# Analysis
M <- mine(x, y=y, alpha=0.7)
P <- cor(x, y=y)
res <- data.frame(MIC = c(M$MIC))
rownames(res) <- rownames(M$MIC)
res$MIC_Rank <- nrow(res) - rank(res$MIC, ties.method="first") + 1
res$Pearson <- P
res$Pearson_Rank <- nrow(res) - rank(abs(res$Pearson), ties.method="first") + 1
res <- res[order(res$MIC_Rank),]
head(res, n=10)
 
# Plot
png("MIC_vs_Pearson.png", width=7.5, height=3.5, res=400, units="in", type="cairo")
op <- par(mfrow=c(1,2), mar=c(4,4,1,1))
plot(MIC ~ abs(Pearson), res, pch=21,  col=4, bg=5)
plot(MIC_Rank ~ Pearson_Rank, res, pch=21, col=4, bg=5)
par(op)
dev.off()

To leave a comment for the author, please follow the link and comment on their blog: me nugget.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

R-bloggers

R news and tutorials contributed by hundreds of R bloggers

Maximal Information Coefficient (Part II)

Related

Related

Never miss an update! Subscribe to R-bloggers to receive e-mails with the latest R posts. (You will not see this message again.)

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)