How do you know if your data has signal?

August 11, 2015
By

(This article was first published on Revolutions, and kindly contributed to R-bloggers)

by Nina Zumel
Data Scientist Win-Vector LLC

NewImage Image by Liz Sullivan, Creative Commons. Source: Wikimedia

An all too common approach to modeling in data science is to throw all possible variables at a modeling procedure and "let the algorithm sort it out." This is tempting when you are not sure what are the true causes or predictors of the phenomenon you are interested in, but it presents dangers, too. Very wide data sets are computationally difficult for some modeling procedures; and more importantly, they can lead to overfit models that generalize poorly on new data. In extreme cases, wide data can fool modeling procedures into finding models that look good on training data, even when that data has no signal. We showed some examples of this previously in our "Bad Bayes" blog post.

In this latest "Statistics as it should be" article, we will look at a heuristic to help determine which of your input variables have signal.

To leave a comment for the author, please follow the link and comment on their blog: Revolutions.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...



If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.

Search R-bloggers


Sponsors

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)