Giving R the strengths of Stata

[This article was first published on Robert Grant's stats blog » R, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Stata-R-Hmmm

This is not a partisan post that extols the virtues of one software package over another. I love Stata and R and use them both all the time. They each have strengths and weaknesses and if I could only take one to the desert island, I’d find it hard to choose. For me, the greatest unique selling point in Stata is the flexibility of the macros. If I write something like this:

local now=1988

summarize gdp if year==`now'

I will get a summary of GDP in 1988, the same as if I had typed

summarize gdp if year==1988

And I could do the same thing in R (assuming I have installed the package pastecs):

now<-1988

stat.desc(gdp[year==now])

All very nice. But they are doing this in two different ways. R holds all sorts of objects in memory: data frames, vectors, scalars, etc, and accesses any of their contents when you name them. Stata can only have one data file open at a time and stores other stuff in temporary memory as matrices, scalars or these macros, which are set up with the commands local or global. When you give it a name like now, it will by default look for a variable in the data file with that name. So, you  place the macro name between the backward quote* and the apostrophe in order to alert it to fetch the contents of the macro, stick thjem into the command and then interpret the whole command together. That is a very flexible way of working because you can do stuff that most programming languages forbid, like shoving your macro straight into variable names:

summarize gdp`now'

// the same as summarize gdp1988

or into text:

display as result "Summary of world GDP in the year `now':"

or indeed into other macros’ names in a nested way:

local now=1988
local chunk "ow"
summarize gdp if year==`n`chunk''

or even into commands!

local dothis "summa"
 `dothis'rize gdp if year==`now':"

I believe that is also how Python works, which no doubt helps account for its popularity in heavy number crunching (so I hear – I’ve never gone near it).

Now, the difference between these approaches is not immediately obvious, but because R does not differentiate in naming different classes of object, like scalars, matrices or estimation outputs, you can do whatever you like with them (helpful), except just jamming their names into the middle of commands and expecting R to replace the name with the contents. That is the strength of Stata’s two-stage interpretation. How can we give that strength to R?

A popular question among new useRs is “how do I manipulate the left-hand side of an assignment?”

Here’s the typical scenario: you have a series of analyses and want the results to be saved with names like result1, result2 and so on. Nine times out of ten, R will easily produce what you want as a list or array, but sometimes this collection of distinct objects really is what you need. The problem is, you can’t do things like:

mydata <- matrix(1:12,nrow=3)

paste("columntotal", 1:4, sep="") <- apply(mydata, 2, sum)

And hope it will produce the same as:

columntotal1 <- sum(mydata[, 1])
columntotal2 <- sum(mydata[, 2])
columntotal3 <- sum(mydata[, 3])
columntotal4 <- sum(mydata[, 4])

Instead you need assign()! It’s one of a series of handy R functions that can be crudely described as doing something basic in a flexible way, something which you would normally do with a simple operator such as <- but with more options.

for (i in 1:4) {

assign(paste("columntotal", i, sep=""), sum(mydata[,i]))

}

will do exactly what you wanted above.

If you need to fetch a whole bunch of objects by name, mget() is a function that takes a vector of strings and searches for objects with those names. The contents of the objects get returned in a single list. Now you can easily work on all the disparate objects by lapply() and the like. Now, before you mget too carried away with all this fun, take time to read this excellent post, which details the way that R goes looking for objects. It could save you a lot of headaches.

All right, now we know how to mess around with object names. What about functions? do.call() is your friend here. The first argument do.call wants is a string which is the name of a function. The second argument is a list containing your function’s arguments, and it passes them along. You could do crazy stuff like this:

omgitsafunction <- paste("s","um",sep="")

do.call(omgitsafunction,list(mydata))

and it would be the same as:

sum(mydata)

…which raises the possibility of easily making bespoke tables of statistics by just feeding a vector of function names into do.call:

loadsafunctions <- c("sum","mean","sd")

for (i in 1:length(loadsafunctions)) {

print(do.call(loadsafunctions[i],list(mydata)))

}

or more succinctly:

listafunctions <- as.list(loadsafunctions)

lapply(listafunctions,FUN=do.call,list(mydata))

Another neat feature of Stata is that you can prefix any line of code with capture: and it will absorb error messages and let you proceed. In R you can do this with try(). This is never going to work:

geewhiz<-list(3,10,8,"abc",2,"xyz")

lapply(geewhiz,log)

But maybe you want it to run, skip the strings and give you the numeric results (of course, you could do this by indexing with is.numeric(), but I just want to illustrate a general point, and try() is even more flexible):

lapply(geewhiz,function(x) try(log(x),TRUE))

will work just fine. Note the one line function declaration inside lapply(), which is there because lapply wants a function, not the object returned by try(log()).

attach() and (more likely) with() are useful functions if you need to work repetitively on a batch of different data frames. After all, what’s the first thing newbies notice is cool about R? You can open more than one data file at a time. So why not capitalise on that? That takes you into territory that Stata can only reach by repeatedly reading and writing from the hard drive (which will slow you down).

subset() is another good one. Really it just does the same as the indexing operator [, but because it’s a function, you can link it up with mget() and/or do.call() and get it to work it’s way through all sorts of subsets of different objects under different conditions, running different analyses on them. Nice!

The final function I want to highlight is substitute(). This allows you to play around with text which needs to be evaluated by another function as if it was typed in by the user, and yet still have it work

mydata<-c(1,2,3)
xx<-"my"
substitute(paste(xx,"data",sep=""),list(xx=xx))
eval(substitute(paste(xx,"data",sep=""),list(xx=xx)))
mget(eval(substitute(paste(xx,"data",sep=""),list(xx=xx))))

Pretty cool huh? I hope this opens up some new ideas for the more advanced Stata user who wants to use R functionality. On the other hand, if you use R all the time, perhaps this will encourage you to take Stata seriously too.


To leave a comment for the author, please follow the link and comment on their blog: Robert Grant's stats blog » R.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)