How I learned to stop worrying and really love lists

[This article was first published on PirateGrunt » R, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

One of the first weird things to get used to in R is unlearning some of the things that you think you know. As often happens, this reminds me of a quote I once read about Zen, which went about like this (I’m paraphrasing), “When I knew nothing of Zen, mountains were mountains, rivers were rivers and the sky was the sky. When I knew a little of Zen, the mountains were not mountains, the rivers were not rivers and the sky was not the sky. When I fully understood Zen, mountains were mountains, rivers were rivers and the sky was the sky.” When I knew a little bit of R, a list was not a list. Actually, I wasn’t sure what to make of it. Is it a structure? Is it a linked list? Is it an oject array?

I’m slowly reaching the point where I begin to understand that a list is a list. I’m not fully Zen on lists yet, but I do know this. I think they might be awesome. For me, the first circle of enlightenment for R comes when I realize how much more powerful and flexible it is than any of the other tools I’ve used (yes, even Matlab). The second circle of enlightenment comes with an appreciation of the apply functions and that means understanding lists. Here’s a very simple construct that I’ve started applying (ha!) often:

df = GetTriangleData()
lCompanyDFs = split(df, df$GRCODE)
lProjections = lapply(lCompanyDFs, SomeFunction)
dfResults = do.call("rbind", lProjections)

Here’s how the process works in a nutshell: 1) Get a pile of data, which contains at least one categorical variable. In the NFL data set, that’s a team, in the NAIC insurance data set (to be discussed in a forthcoming post), that’s an insurance company. 2) Split the data. This will return a list whose elements are all dataframes. (Or at least in this case it will.) 3) Apply some function across the entire list. 4) Stitch the results back together with a call to rbind. Lather. Rinse. Repeat.

Once you’re in the second circle of enlightenment, you’ll never again write a “for” loop. This has been a lifesaver to me when I’m trying to crunch through a giant set of data. I can pull data from our warehouse and carry out routine actions for each of our 500 accounts, for each of our lines of business, for each accident/policy year, etc. I split the data along a different access and the rest of the analysis pretty much takes care of itself.


To leave a comment for the author, please follow the link and comment on their blog: PirateGrunt » R.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)