Things I wish I’d known before I started using R

March 12, 2011
By

(This article was first published on Erehweb's Blog » r, and kindly contributed to R-bloggers)

I’ve been using R for a couple of years now.  This post is aimed at me a couple of years ago, or you if you’re just starting to use R and are pressed for time.  Here’s some things I wish I’d known in early 2009.

  1. Use a naming convention
  2. read.csv is a great function, but be careful
  3. doBy is not just a city in the Middle East
  4. attach is more trouble than it’s worth
  5. Many packages are poor
  6. There’s a lot of useful blog posts out there

Use a naming convention.  You should probably have a naming convention whatever language you’re using, but you really need one with R, thanks to features like not needing to declare variables, and partial matchingGoogle has one – I think that putting dots in variable names is asking for trouble, but it would be fine – the important thing is just to pick one.  (And you should probably figure that Google knows more about how to write good code than I do).

read.csv is a really great function.  But it has some gotchas – e.g. default options may convert numbers to factors.  Dealing with data is a whole other post, but you can always convert back using as.numeric(as.character(f)), or go through the documentation carefully (see stringsAsFactors, colClasses), or perhaps best yet, use Python or some other scripting language to pre-process the data (it’s always a mess).

doBy is a great package, covering 95% of what you need in processing data by groups.  People swear by plyr too, but try doBy first.

attach – I’ll let Google’s R style guide take this one:

The possibilities for creating errors when using attach are numerous. Avoid it.

Many packages are poor.  My friend John Mount has a whole tutorial (the cranky guide to trying R packages) on this:

The summary is: expect errors, search out errors and don’t start with the built in examples or real data.

Why do I bring this up?  Well, it’s not just to criticize package designers who fail to do even minimal QA.  If you’re using a language and something doesn’t work, your first instinct is (hopefully) to think that you’ve got something wrong.  If you’re using a contributed package, give a bit more weight to the idea that they’ve got something wrong (and don’t rule it out even for something in core R).

And finally, there’s a lot of useful resources out there.  It’s my blog, so I’m going to point you to my “R in production systems” post, but John Mount and Nina Zumel’s posts on R are an excellent read, particularly Survive R.  I’ve also liked Quick-R for SAS / SPSS / Stata users, and you probably will have your favorites.

What am I missing?  Add it in the comments.


To leave a comment for the author, please follow the link and comment on his blog: Erehweb's Blog » r.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...



If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.