My first Reproducible Research Compendium

December 14, 2010
By

(This article was first published on Social data blog, and kindly contributed to R-bloggers)

I have just completed my first Reproducible Research Compendium
"Analysis of the combined survey datasets from the American Red Cross
Tsunami Recovery Program Psycho-Social Project (adult community
respondents)".
It is basically all the reports and data from all the work I did on
evaluation psychosocial projects for the American Red Cross, bundled
up.
But one of the subfolders also contains the scripts and everything
necessary to generate the final pdf report from the original datasets
from scratch, in the spirit of transparency and "reproducible
research".



So there is no copying-and-pasting of graphics from one program into
another. It is easy to make small but significant changes to the
analysis - for instance, to exclude one of the constituent surveys by
changing a line near the start of the script - and rerun the whole
thing and produce a new corresponding version of the report. No more
hunting about to find how you produced some particular graphic or
table.

"An article about computational science in a scientific publication is
not the scholarship itself, it is merely advertising of the
scholarship. The actual scholarship is the complete software
development environment and the complete set of instructions which
generated the figures."
—D. Donoho


This approach has the following advantages:

• making it easier for me to return to the data and analyses in the
future and repeat or extend them

• making it easier for ARC to do the same without having to contact me

• enabling other researchers to repeat and verify these findings
themselves, even automatically if they desire.

• Ensuring complete transparency of the results

Concretely, this means that the original SPSS files as delivered by
the agencies are not changed at all. All recoding, data cleaning,
omission of cases etc is carried out in syntax. In fact the report
document itself – tables, graphics, statistics mentioned within the
text is produced entirely by the following procedure:

A word processing document ("source file") is prepared which is
essentially the final report complete with introduction, chapter
headings, commentary etc together with blocks of syntax where
statistical results are required - in particular tables, and graphics
and inline results.

A single syntax file is run which takes the source file and creates a
second document, the present report, which is identical to the source
file except that the blocks of syntax are replaced by the results of
the syntax (tables, graphics, etc.). So there is neither any
cutting-and-pasting or editing of data in the data files and nor is
there, for example, any manual editing of table data or graphics.

So at each point in this report at which data preparation is
discussed, the interested reader will find the corresponding syntax at
the corresponding point in the source file which actually conducts the
corresponding data preparation. And at each point in this report at
which tables, graphics etc are displayed, the interested reader will
find the syntax at the corresponding point in the source file which
actually constructs those tables and graphics.

So the source document and datasets are available to anyone interested
who can then repeat these calculations, see exactly how they are
arrived and, and can extend the analyses at will.

Unfortunately, to the best of my knowledge the statistics program most
familiar to social scientists, SPSS, does not fulfill all of these
requirements, in particular it cannot produce a complete report
automatically. So the work is carried out using the package Sweave for
the open-source statistics program R. But intermediate datasets in
SPSS format including all recoded and calculated variables are also
provided additionally, so that as much as possible of the above can
also be accomplished with SPSS.

In detail, the original word processing file is written using the free
program Lyx (www.lyx.org) which is available for Windows, Mac and
Linux, which is transformed into final pdf report - using the R
statistics engine. If you open the source file in Lyx you can see all
the R commands which are embedded in the text and which produce the
tables, etc in the pdf file.

Permalink | Leave a comment  »

To leave a comment for the author, please follow the link and comment on his blog: Social data blog.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...



If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.