R: The Dummies Package

September 30, 2009 · Posted in R bloggers · Comments Off 

R-2.9.2 was released in August. While R can be considered stable and battle-ready, it is also far from stagnation. It is humbling to see such an intelligent and vibrant community helping CRAN grow faster than ever. Every day I see a new package or read a new comment on R-Help gives me pause to think.

As much as I like R, on occasion I will find myself lost in some dark corner. Sometimes, I find light. Sometimes I am gnashing teeth and wringing hands. Frustrated. In a recent foray, I found myself trying to do something that I thought exceedingly trivial: expanding character and factor vectors to dummy variables. There must be some function, but what? Trying ?dummy didn’t turn up anything. Surely some else must have encountered this and provided a package. I went to the Internet and sure enough the R-wiki was here to save me. And looking even harder, I found some who had treaded before me on the R-Help archives. It turns out, it’s simple. Expanding a variable as a dummy variable can be done like so:


x <- c(2, 2, 5, 3, 6, 5, NA)
xf <- factor(x, levels = 2:6)
model.matrix( ~ xf - 1)

Two problems. The first problem is that without an external source (Google), I would have never stumbled upon what I wanted. ( Thanks Google!) I understand it now, but for what I wanted to do, I would never have thought, “oh, model.matrix.”

The second problem is the arcane syntax, wtf <- ~ xf - 1. I get it now, but it took me some time to figure out what was going on. I get it, but why not just dummy(var)? This is what I want to do.

The solution on the wiki wasn’t quite what I was looking for. For instance, you can’t say:

model.matrix( ~ xf1 + xf2 + xf3- 1)

It turns out, you can only expand one variable at a time. Well, this is not good. I know that you could solve this with some sapply’s and some tests, but next time I might forgot about how to do it. So with a couple of spare hours, I decided that the next guy, wouldn’t have to think about it. He could just use my dummies package.

Like the R-wiki solution, the dummies package provides a nice interface for encoding a single variable. You can pass a variable -or- a variable name with a data frame. These are equivalent:


dummy( df$var )
dummy( "var", df )

Moreover, you can choose the style of the dummy names, whether to include unused factor level, to have verbose output, etc.

But more than the R-wiki solution, dummy.data.frame offers to something similar to data.frames. You can specify which columns to expand by name or class and whether to return non-expanded columns.

The package dummies-1.04 is available in CRAN. Comments and questions are always appreciated.

Measuring performance of functions in R

September 30, 2009 · Posted in R bloggers · Comments Off 
Rlogo
In R you can use the system.time(function) function to test the time it takes to execute a function. Because I had to do some extended performance testing in R, I wrote a new function based on system.time which gives you the possibility to run multiple samples of a function, to increase the reliability of the measurements. The function is called performance, and takes three arguments: the function you want to test, the number of samples (1 by default), and whether garbage collection should be performed before each function run (yes by default).

For instance:

> performance(complex.function(1,5,"goal"), samples=100)

Average time per run:
-----------------------------
User System Elapsed
0.338 0.014 0.352

Total time for all runs:
------------------------
User System Elapsed
33.805 1.369 35.212

I included 'performance' in my R basic functions (see also Customizing R: startup script).

Example 7.14: A simple graphic of sales

September 29, 2009 · Posted in R bloggers · Comments Off 
In this example, we show a simple plot of the sales rank data read in as shown in example 7.13.SASIn SAS, we use the symbol statement (section 5.3) to request small (with the h option) dots (with the v option, and that the dots not be connected (with the i option. (See sections 5.2.2, 5.3.9 for more details.)we request a scatter plot with the gplot procdure (section 5.1.1), and tell SAS how to

RInside release 0.1.1, and a fresh example

September 29, 2009 · Posted in R bloggers · Comments Off 
Last week's 0.1.0 release of RInside, and the first to have been published on CRAN, still had some issues with builds and use on OS X. Thanks to testing and fixes by Jan de Leeuw, Jeff Horner and particularly Simon Urbanked, things are said to be better now with the new release 0.1.1 which went onto CRAN yesterday. So no new features, but fixes to the main Makefile as well as the Makefile for the examples directory, some minor fixes and editing for the examples. I also added a file THANKS to show some appreciation for the various patches and fixes I have been receiving -- they are appreciated!

However, today I committed a new example to SVN archive at R-Forge. It is based on this thread on r-devel. Abhijit Bera tries to do this in C, but to me his questions provide rather clear motivation for showing how much simpler things can be via C++ and the Rcpp classes along with RInside. Using a small example, the task was to pass a weight vector to a portfolio solver from the Rmetrics package fPortfolio and to then access the computed solution. The original poster struggled with access from C to the S4 classes used by fPortfolio and could not set the weights. But when using RInside, we simply pass a C++ vector of weights down to R, solve the problem and pass a solution vector back using the handy evaluation of R expressions:

// -*- mode: C++; c-indent-level: 4; c-basic-offset: 4;  tab-width: 8; -*-
//
// Another simple example inspired by an r-devel mail by Abhijit Bera
//
// Copyright (C) 2009 Dirk Eddelbuettel and GPL'ed

#include "RInside.h"                    // for the embedded R via RInside
#include "Rcpp.h"                       // for the R / Cpp interface used for transfer
#include <iomanip>

int main(int argc, char *argv[]) {

    try {
        RInside R(argc, argv);          // create an embedded R instance
        SEXP ans;

        std::string txt = "suppressMessages(library(fPortfolio))";
        if (R.parseEvalQ(txt))          // load library, no return value
            throw std::runtime_error("R cannot evaluate '" + txt + "'");

        txt = "lppData <- 100 * LPP2005.RET[, 1:6]; "
	  "ewSpec <- portfolioSpec(); "
	  "nAssets <- ncol(lppData); ";
        if (R.parseEval(txt, ans))      // prepare problem
            throw std::runtime_error("R cannot evaluate '" + txt + "'");

	const double dvec[6] = { 0.1, 0.1, 0.1, 0.1, 0.3, 0.3 }; // choose any weights you want
	const std::vector<double> w(dvec, &dvec[6]);

	R.assign( w, "weightsvec");	// assign STL vector to R's 'weightsvec' variable

	txt = "setWeights(ewSpec) <- weightsvec";
        if (R.parseEvalQ(txt))		// evaluate assignment
            throw std::runtime_error("R cannot evaluate '" + txt + "'");

	txt = "ewPortfolio <- feasiblePortfolio(data = lppData, spec = ewSpec, constraints = \"LongOnly\"); "
	  "print(ewPortfolio); "
	  "vec <- getCovRiskBudgets(ewPortfolio@portfolio)";
        if (R.parseEval(txt, ans))      // assign covRiskBudget weights to ans
            throw std::runtime_error("R cannot evaluate '" + txt + "'");
	RcppVector<double> V(ans);      // convert SEXP variable to an RcppMatrix

	R.parseEval("names(vec)", ans);	// assign columns names to ans
	RcppStringVector names(ans);

	for (int i=0; i<names.size(); i++) {
	  std::cout << std::setw(16) << names(i) << "\t"
		    << std::setw(11) << V(i) << "\n";
        }

    } catch(std::exception& ex) {
        std::cerr << "Exception caught: " << ex.what() << std::endl;
    } catch(...) {
        std::cerr << "Unknown exception caught" << std::endl;
    }

    exit(0);
}

WordPress Blogging with R in 3 Steps

September 29, 2009 · Posted in R bloggers · Comments Off 

A few people have emailed me and enquired about the use of tools mentioned at the end of this post to make blogposts with embedded R-commands. Below is a small step-by-step walkthrough of how to accomplish this.

  1. Write your blog post in a simple text file, you can include formatting using asciidoc syntax. Let’s call the file workflow.Rnw:
    Letters
    -------
    
    First we will display all the letters.
    
    <<>>=
    letters
    @
    
    And then only the first five letters of the alphabet.
    
    <<>>=
    letters[1:5]
    @
  2. Process workflow.Rnw with Sweave using driver provided by ascii package and create workflow.txt, a file in Asciidoc format.
    > library(ascii)
    > Sweave("workflow.Rnw", driver = RweaveAsciidoc,
    +     syntax = "SweaveSyntaxNoweb")
    Writing to file workflow.txt
    Processing code chunks ...
     1 : echo term verbatim
     2 : echo term verbatim
    
    You can now run asciidoc on 'workflow.txt'

    The asciidoc file workflow.txt looks like this:

    Letters
    -------
    
    First we will display all the letters.
    
    ----
    > letters
     [1] "a" "b" "c" "d" "e" "f" "g" "h" "i" "j" "k" "l" "m"
    [14] "n" "o" "p" "q" "r" "s" "t" "u" "v" "w" "x" "y" "z"
    ----
    
    And then only the first five letters of the alphabet.
    
    ----
    > letters[1:5]
    [1] "a" "b" "c" "d" "e"
    ----
  3. Use Python script blogpost.py written by Stuart Rackham to upload the post to a WordPress host. The host and login details are contained in blogpost.py.conf.

    Note

    Prerequisites are Python >2.5 and Asciidoc.
    python blogpost.py post --conf blogpost.py.conf workflow.txt

    will render the following entry:


Letters

First we will display all the letters.

> letters
 [1] "a" "b" "c" "d" "e" "f" "g" "h" "i" "j" "k" "l" "m"
[14] "n" "o" "p" "q" "r" "s" "t" "u" "v" "w" "x" "y" "z"

And then only the first five letters of the alphabet.

> letters[1:5]
[1] "a" "b" "c" "d" "e"

WordPress Blogging with R in 3 Steps

September 29, 2009 · Posted in R bloggers · Comments Off 

A few people have emailed me and enquired about the use of tools mentioned at the end of this post to make blogposts with embedded R-commands. Below is a small step-by-step walkthrough of how to accomplish this.

  1. Write your blog post in a simple text file, you can include formatting using asciidoc syntax. Let’s call the file workflow.Rnw:
    Letters
    -------
    
    First we will display all the letters.
    
    <<>>=
    letters
    @
    
    And then only the first five letters of the alphabet.
    
    <<>>=
    letters[1:5]
    @
  2. Process workflow.Rnw with Sweave using driver provided by ascii package and create workflow.txt, a file in Asciidoc format.
    > library(ascii)
    > Sweave("workflow.Rnw", driver = RweaveAsciidoc,
    +     syntax = "SweaveSyntaxNoweb")
    Writing to file workflow.txt
    Processing code chunks ...
     1 : echo term verbatim
     2 : echo term verbatim
    
    You can now run asciidoc on 'workflow.txt'

    The asciidoc file workflow.txt looks like this:

    Letters
    -------
    
    First we will display all the letters.
    
    ----
    > letters
     [1] "a" "b" "c" "d" "e" "f" "g" "h" "i" "j" "k" "l" "m"
    [14] "n" "o" "p" "q" "r" "s" "t" "u" "v" "w" "x" "y" "z"
    ----
    
    And then only the first five letters of the alphabet.
    
    ----
    > letters[1:5]
    [1] "a" "b" "c" "d" "e"
    ----
  3. Use Python script blogpost.py written by Stuart Rackham to upload the post to a WordPress host. The host and login details are contained in blogpost.py.conf.

    Note

    Prerequisites are Python >2.5 and Asciidoc.
    python blogpost.py post --conf blogpost.py.conf workflow.txt

    will render the following entry:


Letters

First we will display all the letters.

> letters
 [1] "a" "b" "c" "d" "e" "f" "g" "h" "i" "j" "k" "l" "m"
[14] "n" "o" "p" "q" "r" "s" "t" "u" "v" "w" "x" "y" "z"

And then only the first five letters of the alphabet.

> letters[1:5]
[1] "a" "b" "c" "d" "e"

Blogs on R | Statistics

September 28, 2009 · Posted in R bloggers · Comments Off 
What are the current blogs on R? What do they cover?
What makes a good blog on R?
Of course different readers look for different things, but for me a good blog on R:
  • Provides a Gateway: The blog discovers and shares useful external material and explains why it's useful
  • Provides A Forum: The blog raises important issues and topics. Better yet if the blog can initiate debate.
  • Generates Original Content: In particular it is useful if it can teach the reader how to do something. Better yet if the thing taught is of general appeal and is taught clearly.
  • Posts Frequently: E.g., 2+ posts per week.
Blogs on R:
UPDATE: R bloggers compiles many of the blogs on R into a single source and is probably the best place to find a list of Blogs on R. You can also subscribe to all of them through R bloggers.

These are some blogs on R to which I subscribe.
I've probably left some out and new blogs on R seem to be coming up all the time. Feel free to add any missing ones in the comments to this post.

Related Posts:

Adjusting Correlations for Reliability | Attenuation Formula

September 28, 2009 · Posted in R bloggers · Comments Off 
This post discusses ways of adjusting correlations for reliability.


Classical Test Theory states that an Observed variable is True Score plus Error. The true score variable is latent. In psychology theoretical interest typically relates more to the latent than the observed variable. How can you estimate the correlation between two latent variables?

The correction for attenuation formula:

  • rxy / sqrt(rxx * ryy)
  • Or in words: The disattenuated correlation is the raw correlation between x and y (rxy) divided by the square root of the product of the reliability of x (rxx) and the reliability of y (ryy).
  • See Page 130 of Murphy, K. R. & Davidshofer, C. O. (1988). Psychological Testing: Principles and Applications.
  • Here it is on Wikipedia

In R:
The psych package has the following function which will return a correlation matrix of corrected correlations. For the details see the help.
correct.cor(x, y)
"Raw correlations below the diagonal, reliabilities on the diagonal, disattenuated above the diagonal."

Structural Equation Modelling:
A major motivation for doing Structural Equation Modelling is to estimate parameters (e.g., correlations and regression coefficients) after adjusting for reliability of measurement. You can either specify the reliability of measurement explicitly or you can estimate the reliability based on the indicators used.

Output:
It can sometimes be nice to show a correlation matrix with reliability adjusted correlations in the upper diagonal and unadjusted correlations in the lower diagonal. The correct.cor function in the psych package provides this output.


Comments on Assessing Variable Importance in Multiple Regression:
If you are trying to assess the relative importance of a set of predictors in a multiple regression, it is problematic if the predictors differ in their reliability. The predictors with larger reliability will appear better than other predictors partially because of differences in reliability.
In this situation, it is desirable to design a study where all measures are reliable and equally so. SEM provides a good option if the data is already collected and the measures differ in reliability.

Psychology Statistics 101 | R or SPSS

September 28, 2009 · Posted in R bloggers · Comments Off 
Dan Wright has placed his Quantitative Methods 1 course online. The course offers instructions both in R and SPSS (PASW).

It is an interesting case study in  how to integrate R into a psychology quantitative methods course at the undergraduate level. It's also a cool example of integrating web resources.

How to Import MS Excel Data into R

September 26, 2009 · Posted in R bloggers · Comments Off 
As Sir Francis Bacon said, “Histories make men wise; poets witty; the mathematics subtile; natural philosophy deep; moral grave; logic and rhetoric able to contend.” And Windows stupid.

He should have added the last sentence if he were a Windows user in this age.

1. Avoid Using M$ Excel

A lot of R users often ask this question: “How to import MS Excel data into R?” Well, my suggestion is, avoid using M$ Excel if you are a statistician (or going to be a statistician) because you just cannot imagine how messy Excel data can be: some cells might be merged, some are colored, some texts are bold, several data tables can be put everywhere (e.g. cell(1,1) to (10,4), and (17,3) to (25,9)), stupid bar plots and pie charts are inserted in the sheets, silly statistical procedures that are wrong forever… If you don’t trust my words (yes, I’m a nobody), just read the examples here: Problems with Excel (collected by Prof Harrell).

I know there are reasons for you to continue using Excel. Your boss required you to do so; you don’t have time to learn more about various data formats; everybody is using Excel, and you don’t want to be so cool to use R; or if you finish your tasks too quickly and accurately, your boss will doubt whether you have really spent time on working, hence you will get less money paid (this is a REAL story for me – though I didn’t get less payment, I was indeed doubted when I used R); …

2. Data as Pure Text

A quick solution to the problem is to save your Excel data in a pure text format, e.g. CSV (comma-separated value) or tab-delimited. If you have ever thumbed through Dr Murrell’s book “Introduction to Data Technologies”, you probably know that the CSV format is NOT an Excel-specific format, although Windows users always find the Excel icon is associated with the *.csv files. Pure text is a ridiculously simple data format, but it’s amazing that there are still many people who do not know anything about it. The basic idea is to separate data columns with a delimiter (e.g. “,” or “;”) and rows with a usual line-break symbol (e.g. carriage-return, which can be different in Windows and Linux). In this case, we can identify all data values as we do in the spreadsheet. Here is an example with data in a spreadsheet:

Data in Grid Cells

Data in the Spreadsheet

If we save this data as a CSV file, and open it with a pure text editor (e.g. Notepad), we will see:

"","Sepal.Length","Sepal.Width","Petal.Length","Petal.Width","Species"
"1",5,3.2,1.2,0.2,"setosa"
"2",5.1,3.8,1.9,0.4,"setosa"
"3",5.1,3.3,1.7,0.5,"setosa"
"4",6.7,3.1,4.7,1.5,"versicolor"
"5",5.1,3.7,1.5,0.4,"setosa"
"6",5,3,1.6,0.2,"setosa"
"7",5.3,3.7,1.5,0.2,"setosa"
"8",5,3.4,1.6,0.4,"setosa"
"9",4.9,2.4,3.3,1,"versicolor"
"10",6.3,2.5,5,1.9,"virginica"

Or save as tab-delimited text:

""	"Sepal.Length"	"Sepal.Width"	"Petal.Length"	"Petal.Width"	"Species"
"1"	5	3.2	1.2	0.2	"setosa"
"2"	5.1	3.8	1.9	0.4	"setosa"
"3"	5.1	3.3	1.7	0.5	"setosa"
"4"	6.7	3.1	4.7	1.5	"versicolor"
"5"	5.1	3.7	1.5	0.4	"setosa"
"6"	5	3	1.6	0.2	"setosa"
"7"	5.3	3.7	1.5	0.2	"setosa"
"8"	5	3.4	1.6	0.4	"setosa"
"9"	4.9	2.4	3.3	1	"versicolor"
"10"	6.3	2.5	5	1.9	"virginica"

Then use read.table() or read.csv() in R to read these pure text files (as data.frames).

A hint for lazy users: you can also select all the data cells, copy it (into clipboard) and use read.table("clipboard") to get the data into R. In this case, what exists in your clipboard is the tab-delimited pure text.

3. What If I Insist on Using Excel

All right, you don’t bother to save the excel sheet into pure text and even don’t want to copy it into clipboard, then you can treat Excel files as databases, although they are indeed bad databases. You must guarantee that the data is “clean” and well-formatted, i.e. observations in each row and variables in each column (no merged cells, better no graphs). We can use the RODBC package to establish a connection to the Excel file, and execute SQL commands in the connection to make queries to data. Functions related to this task are odbcConnectExcel() or odbcConnectExcel2007() (again, Excel is stupid — they always change the standard in order that their products can be inconsistent). This is described in details in the manual R-data (“R Data Import/Export”).

As *.xls (or *.xlsx) is a binary format, never try to read.table("*.xls"). Meanwhile, read.xls() in the gdata package might be what you want if you are looking for the read.*-style R functions. [Thanks, Doug!]

In most cases, pure text format suffices to work, although it is ridiculously simple. Take a look at the “source code” and you will know everything. By the way, the extension of a file name is not that important: *.csv does not have to be a comma-separated text file, and *.doc can be something other than a Word document. It’s just a matter of convention. Again, open it and see what on earth is inside.

Related Posts

Next Page »

Diag| Memory: Current usage: 36534 KB
Diag| Memory: Peak usage: 37336 KB