I’ve been using R for a couple of years now. This post is aimed at me a couple of years ago, or you if you’re just starting to use R and are pressed for time. Here’s some things I wish I’d known in early 2009.
- Use a naming convention
- read.csv is a great function, but be careful
- doBy is not just a city in the Middle East
- attach is more trouble than it’s worth
- Many packages are poor
- There’s a lot of useful blog posts out there
Use a naming convention. You should probably have a naming convention whatever language you’re using, but you really need one with R, thanks to features like not needing to declare variables, and partial matching. Google has one – I think that putting dots in variable names is asking for trouble, but it would be fine – the important thing is just to pick one. (And you should probably figure that Google knows more about how to write good code than I do).
read.csv is a really great function. But it has some gotchas – e.g. default options may convert numbers to factors. Dealing with data is a whole other post, but you can always convert back using as.numeric(as.character(f)), or go through the documentation carefully (see stringsAsFactors, colClasses), or perhaps best yet, use Python or some other scripting language to pre-process the data (it’s always a mess).
doBy is a great package, covering 95% of what you need in processing data by groups. People swear by plyr too, but try doBy first.
attach – I’ll let Google’s R style guide take this one:
The possibilities for creating errors when using
attachare numerous. Avoid it.
Many packages are poor. My friend John Mount has a whole tutorial (the cranky guide to trying R packages) on this:
The summary is: expect errors, search out errors and don’t start with the built in examples or real data.
Why do I bring this up? Well, it’s not just to criticize package designers who fail to do even minimal QA. If you’re using a language and something doesn’t work, your first instinct is (hopefully) to think that you’ve got something wrong. If you’re using a contributed package, give a bit more weight to the idea that they’ve got something wrong (and don’t rule it out even for something in core R).
And finally, there’s a lot of useful resources out there. It’s my blog, so I’m going to point you to my “R in production systems” post, but John Mount and Nina Zumel’s posts on R are an excellent read, particularly Survive R. I’ve also liked Quick-R for SAS / SPSS / Stata users, and you probably will have your favorites.
What am I missing? Add it in the comments.