Don’t stare at your correlations in search of variable clusters when you can rearrange() them: library(corrr) mtcars %>% correlate() %>% rearrange() %>% fashion() #> rowname am gear drat wt disp mpg cyl vs hp carb qsec #> 1 am ...

The Problem When clustering data using principal component analysis, it is often of interest to visually inspect how well the data points separate in 2-D space based on principal component scores. While this is fairly straightforward to visualize with a scatterplot, the plot can become cluttered quickly with annotations as shown in the following figure:

Win-Vector LLC, Nina Zumel and I are pleased to announce that ‘vtreat’ version 0.5.26 has been released on CRAN. ‘vtreat’ is a data.frame processor/conditioner that prepares real-world data for predictive modeling in a statistically sound manner. (from the package documentation) ‘vtreat’ is an R package that incorporates a number of transforms and simulated out of … Continue reading...

This post was first published on my Linkedin page and posted here as a contributed post. In the last few months, I have had several people contact me about their enthusiasm for venturing into the world of data science and using Machine Learning (ML) techniques to probe statistical regularities and build impeccable data-driven products. However, I’ve observed that some actually lack...

Hackathons are not alike Recently, a number of this blog’s authors were at a data hackathon, the strangest one we’ve been to so far. It was more of a startup pitch gathering, complete with pitch training and whatnot. I was repeatedly asked by other participants “so, how do you want to monetise your idea?”. My answer was simple: I...

What is credibility? For over one hundred years 1 actuaries have been wresting with the idea of “credibility”. This is the process whereby one may make a quantitative assessment of the predictive power of sample data. Where necessary, the researcher augments the sample with some exogeneous information - usually more data - to arrive at a final conclusion. In...

Nina Zumel introduced y-aware scaling in her recent article Principal Components Regression, Pt. 2: Y-Aware Methods. I really encourage you to read the article and add the technique to your repertoire. The method combines well with other methods and can drive better predictive modeling results. From feedback I am not sure everybody noticed that in … Continue reading...

While developing risk models with hundreds of potential variables, we often run into the situation that risk characteristics or macro-economic indicators are highly correlated, namely multicollinearity. In such cases, we might have to drop variables with high VIFs or employ “variable shrinkage” methods, e.g. lasso or ridge, to suppress variables with colinearity. Feature extraction approaches

Short form: Win-Vector LLC’s Dr. Nina Zumel has a three part series on Principal Components Regression that we think is well worth your time. Part 1: the proper preparation of data (including scaling) and use of principal components analysis (particularly for supervised learning or regression). Part 2: the introduction of y-aware scaling to direct the … Continue reading...

Guest post by Khushbu Shah The most common question asked by prospective data scientists is – “What is the best programming language for Machine Learning?” The answer to this question always results in a debate whether to choose R, Python or MATLAB for Machine Learning. Nobody can, in reality, answer the question as to whether Python or R is best...

e-mails with the latest R posts.

(You will not see this message again.)