Quickly Create Dummy Variables in a Data Frame

January 2, 2014

(This article was first published on randyzwitch.com » R, and kindly contributed to R-bloggers)

On Quora, a question was asked about how to fix the error of the randomForest package in R not being able to handle more than 32 levels in a categorical variable. Seeing as how I’ve seen this question asked on Kaggle forums, StackOverflow and elsewhere, here’s the answer: code your own dummy variables instead of relying on Factors!

Code snippet

As the code above shows, it’s trivial to generate your own 1/0 columns of data instead of relying on Factors. There are two things to keep in mind when creating your own dummy variables:

  1. The problem you are trying to solve
  2. How much RAM you have available

While it may make sense to generate dummy variables for Customer State (~50 for the United States), if you were to use the code above on City Name, you’d likely either run out of RAM or find out that there are too many levels to be useful.

Of course, with any qualitative statement such as “too many levels to be useful”, oftentimes the only way to definitively know is to try it! Just make sure you save your work before running this code, just in case you run out of RAM. Or, use someone else’s computer for testing ;)

Edit 1/2/14: John Myles White brought up a good point via Twitter about RAM usage:

Quickly Create Dummy Variables in a Data Frame is an article from randyzwitch.com, a blog dedicated to helping newcomers to Digital Analytics & Data Science

If you liked this post, please visit randyzwitch.com to read more. Or better yet, tell a friend…the best compliment is to share with others!

To leave a comment for the author, please follow the link and comment on their blog: randyzwitch.com » R.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.


Mango solutions

RStudio homepage

Zero Inflated Models and Generalized Linear Mixed Models with R

Quantide: statistical consulting and training




CRC R books series

Six Sigma Online Training

Contact us if you wish to help support R-bloggers, and place your banner here.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)