Example 9.9: Simplifying R using the mosaic package (part 1)

October 13, 2011
By

(This article was first published on SAS and R, and kindly contributed to R-bloggers)



While both SAS and R are powerful systems for statistical analysis, they can be frustrating to new users or those learning statistics for the first time.

R
The mosaic package is designed to help simplify the interface for such new users, while allowing them to undertake sophisticated analyses.

As an example of how the package simplifies life for the novice user, consider calculating summary statistics and displaying a densityplot for the CESD (measure of depressive symptom) scores by substance abuse group in the HELP dataset. Doing this in R without the package would require mastering a package such as plyr to replicate results by substance or a typing-intensive use of syntax to select rows corresponding to each substance.


ds = read.csv("http://www.math.smith.edu/r/data/help.csv")
library(mosaic)
options(digits=3)

After loading the data and the package, and setting the number of digits to a more reasonable default, we can call the mean() function to easily calculate this statistic (denoted by S in the result) for each of the three substance abuse groups alcohol, cocaine or heroin.

> mean(cesd ~ substance, data=ds)
substance S N Missing
1 alcohol 34.4 177 0
2 cocaine 29.4 152 0
3 heroin 34.9 124 0

Similar results are seen when we calculate the standard deviations per group:

> sd(cesd ~ substance, data=ds)
substance S N Missing
1 alcohol 12.1 177 0
2 cocaine 13.4 152 0
3 heroin 11.2 124 0

Another function can calculate a raft of summary statistics for each group that are nicely formatted.

> summary(cesd ~ substance, data=ds, fun=favstats)
cesd N=453
+---------+-------+---+----+---+-------+---+----+-----+----+---+--------+
| | |N |min |Q1 |median |Q3 |max |mean |sd |n |missing |
+---------+-------+---+----+---+-------+---+----+-----+----+---+--------+
|substance|alcohol|177|4 |26 |36 |42 |58 |34.4 |12.1|177|0 |
| |cocaine|152|1 |19 |30 |39 |60 |29.4 |13.4|152|0 |
| |heroin |124|4 |28 |35 |43 |56 |34.9 |11.2|124|0 |
+---------+-------+---+----+---+-------+---+----+-----+----+---+--------+
|Overall | |453|1 |25 |34 |41 |60 |32.8 |12.5|453|0 |
+---------+-------+---+----+---+-------+---+----+-----+----+---+--------+

These commands allow quick review of the data to ensure, for example, that assumptions of equal variance are justified, or that coding errors or missing values haven't crept in.

A graphical depiction using a set of densityplots (shown above) can be created using the command:

densityplot(~ cesd, group=substance, data=ds, auto.key=TRUE)


SAS
We're unaware of any similar program that attempts to simplify SAS syntax for educational use. To replicate the above results, we would use the means and sgpanel procedures.

data ds;
set "C:\book\help.sas7bdat";
run;

options ls=80;
proc means data=ds fw=4
min q1 median q3 max mean std nmiss n;
class substance;
var cesd;
run;
Analysis Variable : CESD

N Lower Upper Std
SUBSTANCE Obs Min Quartile Median Quartile Max Mean Dev
------------------------------------------------------------------
alcohol 177 4.00 26.0 36.0 42.0 58.0 34.4 12.1
cocaine 152 1.00 19.0 30.0 39.0 60.0 29.4 13.4
heroin 124 4.00 28.0 35.0 43.0 56.0 34.9 11.2
------------------------------------------------------------------

N N
SUBSTANCE Obs Miss N
---------------------------
alcohol 177 0 177
cocaine 152 0 152
heroin 124 0 124
---------------------------

After reading the data in, the meansprocedure can produce any of the desired statistics (plus may others) directly. To replicate the mosaic package in printing a single statistic, list only that statistic in the proc means statement. Note that the overall statistic in the R table is not included. To replicate that row, you would re-run the above code, omitting the class statement.

To the best of our knowledge, there still does not exist an easy way to plot multiple densities in a single SAS plot. In example 2.6.4 we show how it can be done using proc kde, saving the density estimates, and plotting separately. (Code for this is included at the book web site.) But in the interest of simple code, we show a simpler method using proc sgpanel. The result, show below, is less useful than the R plot from the the mosaic package, but still gets the point across.

proc sgpanel data=ds;
panelby substance / columns=1;
density cesd / type=kernel;
run;

To leave a comment for the author, please follow the link and comment on his blog: SAS and R.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...



If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Tags: , , , , , , ,

Comments are closed.