R: The Dummies Package
R-2.9.2 was released in August. While R can be considered stable and battle-ready, it is also far from stagnation. It is humbling to see such an intelligent and vibrant community helping CRAN grow faster than ever. Every day I see a new package or read a new comment on R-Help gives me pause to think.
As much as I like R, on occasion I will find myself lost in some dark corner. Sometimes, I find light. Sometimes I am gnashing teeth and wringing hands. Frustrated. In a recent foray, I found myself trying to do something that I thought exceedingly trivial: expanding character and factor vectors to dummy variables. There must be some function, but what? Trying ?dummy didn’t turn up anything. Surely some else must have encountered this and provided a package. I went to the Internet and sure enough the R-wiki was here to save me. And looking even harder, I found some who had treaded before me on the R-Help archives. It turns out, it’s simple. Expanding a variable as a dummy variable can be done like so:
x <- c(2, 2, 5, 3, 6, 5, NA)
xf <- factor(x, levels = 2:6)
model.matrix( ~ xf - 1)
Two problems. The first problem is that without an external source (Google), I would have never stumbled upon what I wanted. ( Thanks Google!) I understand it now, but for what I wanted to do, I would never have thought, “oh, model.matrix.”
The second problem is the arcane syntax, wtf <- ~ xf - 1. I get it now, but it took me some time to figure out what was going on. I get it, but why not just dummy(var)? This is what I want to do.
The solution on the wiki wasn’t quite what I was looking for. For instance, you can’t say:
model.matrix( ~ xf1 + xf2 + xf3- 1)
It turns out, you can only expand one variable at a time. Well, this is not good. I know that you could solve this with some sapply’s and some tests, but next time I might forgot about how to do it. So with a couple of spare hours, I decided that the next guy, wouldn’t have to think about it. He could just use my dummies package.
Like the R-wiki solution, the dummies package provides a nice interface for encoding a single variable. You can pass a variable -or- a variable name with a data frame. These are equivalent:
dummy( df$var )
dummy( "var", df )
Moreover, you can choose the style of the dummy names, whether to include unused factor level, to have verbose output, etc.
But more than the R-wiki solution, dummy.data.frame offers to something similar to data.frames. You can specify which columns to expand by name or class and whether to return non-expanded columns.
The package dummies-1.04 is available in CRAN. Comments and questions are always appreciated.
Measuring performance of functions in R

For instance:
> performance(complex.function(1,5,"goal"), samples=100)
Average time per run:
-----------------------------
User System Elapsed
0.338 0.014 0.352
Total time for all runs:
------------------------
User System Elapsed
33.805 1.369 35.212
I included 'performance' in my R basic functions (see also Customizing R: startup script).
Example 7.14: A simple graphic of sales
RInside release 0.1.1, and a fresh example
However, today I committed a new example to SVN archive at R-Forge. It is based on this thread on r-devel. Abhijit Bera tries to do this in C, but to me his questions provide rather clear motivation for showing how much simpler things can be via C++ and the Rcpp classes along with RInside. Using a small example, the task was to pass a weight vector to a portfolio solver from the Rmetrics package fPortfolio and to then access the computed solution. The original poster struggled with access from C to the S4 classes used by fPortfolio and could not set the weights. But when using RInside, we simply pass a C++ vector of weights down to R, solve the problem and pass a solution vector back using the handy evaluation of R expressions:
// -*- mode: C++; c-indent-level: 4; c-basic-offset: 4; tab-width: 8; -*- // // Another simple example inspired by an r-devel mail by Abhijit Bera // // Copyright (C) 2009 Dirk Eddelbuettel and GPL'ed #include "RInside.h" // for the embedded R via RInside #include "Rcpp.h" // for the R / Cpp interface used for transfer #include <iomanip> int main(int argc, char *argv[]) { try { RInside R(argc, argv); // create an embedded R instance SEXP ans; std::string txt = "suppressMessages(library(fPortfolio))"; if (R.parseEvalQ(txt)) // load library, no return value throw std::runtime_error("R cannot evaluate '" + txt + "'"); txt = "lppData <- 100 * LPP2005.RET[, 1:6]; " "ewSpec <- portfolioSpec(); " "nAssets <- ncol(lppData); "; if (R.parseEval(txt, ans)) // prepare problem throw std::runtime_error("R cannot evaluate '" + txt + "'"); const double dvec[6] = { 0.1, 0.1, 0.1, 0.1, 0.3, 0.3 }; // choose any weights you want const std::vector<double> w(dvec, &dvec[6]); R.assign( w, "weightsvec"); // assign STL vector to R's 'weightsvec' variable txt = "setWeights(ewSpec) <- weightsvec"; if (R.parseEvalQ(txt)) // evaluate assignment throw std::runtime_error("R cannot evaluate '" + txt + "'"); txt = "ewPortfolio <- feasiblePortfolio(data = lppData, spec = ewSpec, constraints = \"LongOnly\"); " "print(ewPortfolio); " "vec <- getCovRiskBudgets(ewPortfolio@portfolio)"; if (R.parseEval(txt, ans)) // assign covRiskBudget weights to ans throw std::runtime_error("R cannot evaluate '" + txt + "'"); RcppVector<double> V(ans); // convert SEXP variable to an RcppMatrix R.parseEval("names(vec)", ans); // assign columns names to ans RcppStringVector names(ans); for (int i=0; i<names.size(); i++) { std::cout << std::setw(16) << names(i) << "\t" << std::setw(11) << V(i) << "\n"; } } catch(std::exception& ex) { std::cerr << "Exception caught: " << ex.what() << std::endl; } catch(...) { std::cerr << "Unknown exception caught" << std::endl; } exit(0); }
WordPress Blogging with R in 3 Steps
A few people have emailed me and enquired about the use of tools mentioned at the end of this post to make blogposts with embedded R-commands. Below is a small step-by-step walkthrough of how to accomplish this.
- Write your blog post in a simple text file, you can include formatting using asciidoc syntax. Let’s call the file workflow.Rnw:
Letters ------- First we will display all the letters. <<>>= letters @ And then only the first five letters of the alphabet. <<>>= letters[1:5] @
- Process workflow.Rnw with Sweave using driver provided by ascii package and create workflow.txt, a file in Asciidoc format.
> library(ascii) > Sweave("workflow.Rnw", driver = RweaveAsciidoc, + syntax = "SweaveSyntaxNoweb") Writing to file workflow.txt Processing code chunks ... 1 : echo term verbatim 2 : echo term verbatim You can now run asciidoc on 'workflow.txt'The asciidoc file workflow.txt looks like this:
Letters ------- First we will display all the letters. ---- > letters [1] "a" "b" "c" "d" "e" "f" "g" "h" "i" "j" "k" "l" "m" [14] "n" "o" "p" "q" "r" "s" "t" "u" "v" "w" "x" "y" "z" ---- And then only the first five letters of the alphabet. ---- > letters[1:5] [1] "a" "b" "c" "d" "e" ----
- Use Python script blogpost.py written by Stuart Rackham to upload the post to a WordPress host. The host and login details are contained in blogpost.py.conf.
Note
Prerequisites are Python >2.5 and Asciidoc. python blogpost.py post --conf blogpost.py.conf workflow.txt
will render the following entry:
Letters
First we will display all the letters.
> letters [1] "a" "b" "c" "d" "e" "f" "g" "h" "i" "j" "k" "l" "m" [14] "n" "o" "p" "q" "r" "s" "t" "u" "v" "w" "x" "y" "z" |
And then only the first five letters of the alphabet.
> letters[1:5] [1] "a" "b" "c" "d" "e" |

WordPress Blogging with R in 3 Steps
A few people have emailed me and enquired about the use of tools mentioned at the end of this post to make blogposts with embedded R-commands. Below is a small step-by-step walkthrough of how to accomplish this.
- Write your blog post in a simple text file, you can include formatting using asciidoc syntax. Let’s call the file workflow.Rnw:
Letters ------- First we will display all the letters. <<>>= letters @ And then only the first five letters of the alphabet. <<>>= letters[1:5] @
- Process workflow.Rnw with Sweave using driver provided by ascii package and create workflow.txt, a file in Asciidoc format.
> library(ascii) > Sweave("workflow.Rnw", driver = RweaveAsciidoc, + syntax = "SweaveSyntaxNoweb") Writing to file workflow.txt Processing code chunks ... 1 : echo term verbatim 2 : echo term verbatim You can now run asciidoc on 'workflow.txt'The asciidoc file workflow.txt looks like this:
Letters ------- First we will display all the letters. ---- > letters [1] "a" "b" "c" "d" "e" "f" "g" "h" "i" "j" "k" "l" "m" [14] "n" "o" "p" "q" "r" "s" "t" "u" "v" "w" "x" "y" "z" ---- And then only the first five letters of the alphabet. ---- > letters[1:5] [1] "a" "b" "c" "d" "e" ----
- Use Python script blogpost.py written by Stuart Rackham to upload the post to a WordPress host. The host and login details are contained in blogpost.py.conf.
Note
Prerequisites are Python >2.5 and Asciidoc. python blogpost.py post --conf blogpost.py.conf workflow.txt
will render the following entry:
Letters
First we will display all the letters.
> letters [1] "a" "b" "c" "d" "e" "f" "g" "h" "i" "j" "k" "l" "m" [14] "n" "o" "p" "q" "r" "s" "t" "u" "v" "w" "x" "y" "z" |
And then only the first five letters of the alphabet.
> letters[1:5] [1] "a" "b" "c" "d" "e" |

Blogs on R | Statistics
What makes a good blog on R?
Of course different readers look for different things, but for me a good blog on R:
- Provides a Gateway: The blog discovers and shares useful external material and explains why it's useful
- Provides A Forum: The blog raises important issues and topics. Better yet if the blog can initiate debate.
- Generates Original Content: In particular it is useful if it can teach the reader how to do something. Better yet if the thing taught is of general appeal and is taught clearly.
- Posts Frequently: E.g., 2+ posts per week.
UPDATE: R bloggers compiles many of the blogs on R into a single source and is probably the best place to find a list of Blogs on R. You can also subscribe to all of them through R bloggers.
These are some blogs on R to which I subscribe.
- REvolutions: A blog devoted exclusively to R. This is probably the best blog on R currently around. Among other things it has interesting applications, references to resources, and interesting assorted tidbits.
- Statistical Modeling, Causal Inference, and Social Science: This is probably the best example of a blog in the social science statistics area. It has a large readership and good discussions in the comments. Only a small percentage of posts are on R; many of the posts reflect the author's interests in political science. For R stuff, check out particularly the category Statistical Computing.
- One R Tip a day:This blog is devoted to R and has lots of interesting examples.
- Decision Science News: Sometimes has posts on R.
- Planet R: This is a meta-blog that pools subscriptions to a large number of feeds related to R.
- Andrew Redd
- Romain Francois
- Gregor Gorjanc
- Quantitative Ecology:
- Blogistic Reflections
- Getting Genetics Done: R Tag
- Of course, there's my own blog. with the R tag.
Adjusting Correlations for Reliability | Attenuation Formula
Classical Test Theory states that an Observed variable is True Score plus Error. The true score variable is latent. In psychology theoretical interest typically relates more to the latent than the observed variable. How can you estimate the correlation between two latent variables?
The correction for attenuation formula:
- rxy / sqrt(rxx * ryy)
- Or in words: The disattenuated correlation is the raw correlation between x and y (rxy) divided by the square root of the product of the reliability of x (rxx) and the reliability of y (ryy).
- See Page 130 of Murphy, K. R. & Davidshofer, C. O. (1988). Psychological Testing: Principles and Applications.
- Here it is on Wikipedia
In R:
The psych package has the following function which will return a correlation matrix of corrected correlations. For the details see the help.
correct.cor(x, y)
"Raw correlations below the diagonal, reliabilities on the diagonal, disattenuated above the diagonal."
Structural Equation Modelling:
A major motivation for doing Structural Equation Modelling is to estimate parameters (e.g., correlations and regression coefficients) after adjusting for reliability of measurement. You can either specify the reliability of measurement explicitly or you can estimate the reliability based on the indicators used.
Output:
It can sometimes be nice to show a correlation matrix with reliability adjusted correlations in the upper diagonal and unadjusted correlations in the lower diagonal. The correct.cor function in the psych package provides this output.
Comments on Assessing Variable Importance in Multiple Regression:
If you are trying to assess the relative importance of a set of predictors in a multiple regression, it is problematic if the predictors differ in their reliability. The predictors with larger reliability will appear better than other predictors partially because of differences in reliability.
In this situation, it is desirable to design a study where all measures are reliable and equally so. SEM provides a good option if the data is already collected and the measures differ in reliability.
Psychology Statistics 101 | R or SPSS
It is an interesting case study in how to integrate R into a psychology quantitative methods course at the undergraduate level. It's also a cool example of integrating web resources.
How to Import MS Excel Data into R
He should have added the last sentence if he were a Windows user in this age.
1. Avoid Using M$ Excel
A lot of R users often ask this question: “How to import MS Excel data into R?” Well, my suggestion is, avoid using M$ Excel if you are a statistician (or going to be a statistician) because you just cannot imagine how messy Excel data can be: some cells might be merged, some are colored, some texts are bold, several data tables can be put everywhere (e.g. cell(1,1) to (10,4), and (17,3) to (25,9)), stupid bar plots and pie charts are inserted in the sheets, silly statistical procedures that are wrong forever… If you don’t trust my words (yes, I’m a nobody), just read the examples here: Problems with Excel (collected by Prof Harrell).
I know there are reasons for you to continue using Excel. Your boss required you to do so; you don’t have time to learn more about various data formats; everybody is using Excel, and you don’t want to be so cool to use R; or if you finish your tasks too quickly and accurately, your boss will doubt whether you have really spent time on working, hence you will get less money paid (this is a REAL story for me – though I didn’t get less payment, I was indeed doubted when I used R); …
2. Data as Pure Text
A quick solution to the problem is to save your Excel data in a pure text format, e.g. CSV (comma-separated value) or tab-delimited. If you have ever thumbed through Dr Murrell’s book “Introduction to Data Technologies”, you probably know that the CSV format is NOT an Excel-specific format, although Windows users always find the Excel icon is associated with the *.csv files. Pure text is a ridiculously simple data format, but it’s amazing that there are still many people who do not know anything about it. The basic idea is to separate data columns with a delimiter (e.g. “,” or “;”) and rows with a usual line-break symbol (e.g. carriage-return, which can be different in Windows and Linux). In this case, we can identify all data values as we do in the spreadsheet. Here is an example with data in a spreadsheet:
If we save this data as a CSV file, and open it with a pure text editor (e.g. Notepad), we will see:
"","Sepal.Length","Sepal.Width","Petal.Length","Petal.Width","Species" "1",5,3.2,1.2,0.2,"setosa" "2",5.1,3.8,1.9,0.4,"setosa" "3",5.1,3.3,1.7,0.5,"setosa" "4",6.7,3.1,4.7,1.5,"versicolor" "5",5.1,3.7,1.5,0.4,"setosa" "6",5,3,1.6,0.2,"setosa" "7",5.3,3.7,1.5,0.2,"setosa" "8",5,3.4,1.6,0.4,"setosa" "9",4.9,2.4,3.3,1,"versicolor" "10",6.3,2.5,5,1.9,"virginica"
Or save as tab-delimited text:
"" "Sepal.Length" "Sepal.Width" "Petal.Length" "Petal.Width" "Species" "1" 5 3.2 1.2 0.2 "setosa" "2" 5.1 3.8 1.9 0.4 "setosa" "3" 5.1 3.3 1.7 0.5 "setosa" "4" 6.7 3.1 4.7 1.5 "versicolor" "5" 5.1 3.7 1.5 0.4 "setosa" "6" 5 3 1.6 0.2 "setosa" "7" 5.3 3.7 1.5 0.2 "setosa" "8" 5 3.4 1.6 0.4 "setosa" "9" 4.9 2.4 3.3 1 "versicolor" "10" 6.3 2.5 5 1.9 "virginica"
Then use read.table() or read.csv() in R to read these pure text files (as data.frames).
A hint for lazy users: you can also select all the data cells, copy it (into clipboard) and use read.table("clipboard") to get the data into R. In this case, what exists in your clipboard is the tab-delimited pure text.
3. What If I Insist on Using Excel
All right, you don’t bother to save the excel sheet into pure text and even don’t want to copy it into clipboard, then you can treat Excel files as databases, although they are indeed bad databases. You must guarantee that the data is “clean” and well-formatted, i.e. observations in each row and variables in each column (no merged cells, better no graphs). We can use the RODBC package to establish a connection to the Excel file, and execute SQL commands in the connection to make queries to data. Functions related to this task are odbcConnectExcel() or odbcConnectExcel2007() (again, Excel is stupid — they always change the standard in order that their products can be inconsistent). This is described in details in the manual R-data (“R Data Import/Export”).
As *.xls (or *.xlsx) is a binary format, never try to read.table("*.xls"). Meanwhile, read.xls() in the gdata package might be what you want if you are looking for the read.*-style R functions. [Thanks, Doug!]
In most cases, pure text format suffices to work, although it is ridiculously simple. Take a look at the “source code” and you will know everything. By the way, the extension of a file name is not that important: *.csv does not have to be a comma-separated text file, and *.doc can be something other than a Word document. It’s just a matter of convention. Again, open it and see what on earth is inside.

