Using R to ‘drive’ MS Excel – 3/27/2018

March 27, 2018
By

(This article was first published on Chuck Powell, and kindly contributed to R-bloggers)

I have until recently made it a habit to draw a clear distinction about
using R for data analysis and Microsoft Excel for other office
productivity tasks. I know there are people who use Excel to process
data and even (gasp) to teach statistics with it. But I’m a bit
snobbish that way and to date all my efforts have been in getting data
out of Excel and into R, either by simple methods like read.csv or if
the task was more meaty by using Hadley Wickham’s marvelous readxl
package.

But this week I got a request that may sound familiar to some of you who
work in an environment where the MS Office products are ubiquitous. My
colleague wanted to be able to do a little analysis and graphing of a
modest sized dataset and they only really had MS Excel experience. To be
honest I could have just done the work in R and provided the results
but that’s not what they asked for and would have meant I was
responsible for future updates and questions. So I decided to just
provide the data in the most useful way I could and that led to this
post. To both document the process and also to potentially help others
who have this need in the future.

Along the way I also had some nice learning experiences around functions
and both standard and non standard evaluation NSE that I’ll document
in some future posts.

Background

My colleague wanted to be able to do some simple analysis around health
care using the Centers for Disease Control and Prevention
(https://www.cdc.gov), National Center for Health Statistics
(https://www.cdc.gov/nchs/index.htm), National Health Interview Survey
(https://www.cdc.gov/nchs/nhis/nhis_2016_data_release.htm). They
wanted a series of cross tabulated sets of summary data for variable
pairings (for example whether or not the respondent had a formal health
care provider by region of the country). They wanted one Excel
“workbook” with 12 worksheets each one of which was the summary of
counts for a pair of variables. From there they could use Excel’s native
plotting tools to make the graphs they needed.

A little sleuthing around CRAN helped me discover openxlsx which seems
to be quite active, well maintained, and have a variety of features I
would need. In my case that involved a download and install first (but
I’ll comment it out in this version for you). As long as I’m at it
I’ll load dplyr and ggplot2 (they’ll figure more prominently in my
next post).

knitr::opts_chunk$set(echo = TRUE, warning = FALSE)
library(dplyr)
## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(ggplot2)
theme_set(theme_bw()) # set theme to my personal preference
# install.packages("openxlsx")
require(openxlsx)
## Loading required package: openxlsx

Download and structure the data

The data my colleague wanted was from 2014 and she was kind enough to
provide the URL for a compressed zipfile
ftp://ftp.cdc.gov/pub/Health_Statistics/NCHS/Datasets/NHIS/2016/personsxcsv.zip.
Full documentation about the file and its contents and the methodology
is here https://www.cdc.gov/nchs/nhis/nhis_2016_data_release.htm.

After downloading and uncompressing it into a local project directory
(it’s about 77Mb of data, with over 100,000 rows and more than 600
columns). I could go to work on processing it to get what was really
needed for my colleague.

FullFile <- read.csv(file = "personsx.csv")
dim(FullFile)
## [1] 103789    606
# 606 variables too many whittle it down with wild cards

A little dplyr to make our life easier

While strictly speaking nothing in the next few steps requires dplyr
(they can all be done in base R) I will showcase a couple ways dplyr
can make your data analysis faster and easier.

First off, call me old-fashioned but, I abhor having lots of data in
working memory that I know for a fact I’ll never use. While my Mac has
plenty of space let’s showcase dplyr’s ability to help us rapidly
reduce down to a more manageable dataset. A quick look at the data
dictionary provided by the CDC shows we’ll never use any of the
variables in the dataset that start with “L” or with “INT” so let’s make
them go away.

FullFile <- select(FullFile, -starts_with("L"))
FullFile <- select(FullFile, -starts_with("INT"))
dim(FullFile)
## [1] 103789    289
# 289 is still big but ...

The other thing we notice is that the variables are all coded as integers when
we know they are truly factors. Since we know we’re going to analyze
them as factors let’s take a moment to recode which will also mean an
opportunity to provide more user friendly labels so we’re not constantly
referring to the code book to see what “1” really represents in the
data.

The function is recode_factor and allows us to map factor labels onto
the integers to provide a more useful dataset.

FullFile$REGION <- recode_factor(FullFile$REGION,
                            `1` = "Northeast",
                            `2` = "Midwest",
                            `3` = "South",
                            `4` = "West")
FullFile$SEX <- recode_factor(FullFile$SEX, `1` = "Male", `2` = "Female")
FullFile$RACERPI2 <- recode_factor(FullFile$RACERPI2,
                              `1` = "White only",
                              `2` = "Black/African American only",
                              `3` = "AIAN only", 
                              `4` = "Asian only",
                              `5` = "Race group not releasable",
                              `6` = "Multiple race")
FullFile$PDMED12M <- recode_factor(FullFile$PDMED12M, `1` = "Yes", `2` = "No")
## Warning: Unreplaced values treated as NA as .x is not compatible. Please
## specify replacements exhaustively or supply .default
summary(FullFile$PDMED12M)
##   Yes    No  NA's 
##  6744 96986    59

Uh oh, what are we being warned about? recode_factor has a nice
rational solution for cases where we don’t specify all the possible
choices for a variable (see ?recode_factor) it simply assigns them an
NA value. In our case that’s just what we want.

Finishing the rest (and suppressing the
warnings)…

FullFile$PNMED12 <- recode_factor(FullFile$PNMED12, `1` = "Yes", `2` = "No")
FullFile$PNMED12M <- recode_factor(FullFile$PNMED12M, `1` = "Yes", `2` = "No")
FullFile$NOTCOV <- recode_factor(FullFile$NOTCOV, `1` = "Not covered", `2` = "Covered")
FullFile$COVER <- recode_factor(FullFile$COVER,
                           `1` = "Private",
                           `2` = "Medicaid and other public",
                           `3` = "Other coverage",
                           `4` = "Uninsured",
                           `5` = "Do not know")
FullFile$PLNWRKS1 <- recode_factor(FullFile$PLNWRKS1,
                              `1` = "Through employer",
                              `2` = "Through union",
                              `3` = "Through workplace, but don't know if employer or union",
                              `4` = "Through workplace, self-employed or professional association",
                              `5` = "Purchased directly",
                              `6` = "Through Healthcare.gov or the Affordable Care Act",
                              `7` = "Through a state/local government or community program",
                              `8` = "Other",
                              `9` = "Through school",
                              `10` = "Through parents",
                              `11` = "Through relative other than parents")
FullFile$HCSPFYR <- recode_factor(FullFile$HCSPFYR,
                              `0` = "Zero",
                              `1` = "Less than $500",
                              `2` = "$500 - $1,999",
                              `3` = "$2,000 - $2,999",
                              `4` = "$3,000 - $4,999",
                              `5` = "$5,000 or more")
FullFile$MEDBILL <- recode_factor(FullFile$MEDBILL, `1` = "Yes", `2` = "No")
FullFile$MEDBPAY <- recode_factor(FullFile$MEDBPAY, `1` = "Yes", `2` = "No")
# I am thinking that earnings can be collapsed into three attributes:  low; medium; high
FullFile$EARNINGS <- recode_factor(FullFile$ERNYR,
                              `1` = "$01-$34,999",
                              `2` = "$01-$34,999",
                              `3` = "$01-$34,999",
                              `4` = "$01-$34,999",
                              `5` = "$01-$34,999",
                              `6` = "$01-$34,999",
                              `7` = "$35,000-$74,999",
                              `8` = "$35,000-$74,999",
                              `9` = "$35,000-$74,999",
                              `10` = "$35,000-$74,999",
                              `11` = "$75,000 and over")
# Education the same:  low; medium; high
FullFile$EDUCATION <- recode_factor(FullFile$EDUC1,
                              `0` = "HSchool Grad or less",
                              `1` = "HSchool Grad or less",
                              `2` = "HSchool Grad or less",
                              `3` = "HSchool Grad or less",
                              `4` = "HSchool Grad or less",
                              `5` = "HSchool Grad or less",
                              `6` = "HSchool Grad or less",
                              `7` = "HSchool Grad or less",
                              `8` = "HSchool Grad or less",
                              `9` = "HSchool Grad or less",
                              `10` = "HSchool Grad or less",
                              `11` = "HSchool Grad or less",
                              `12` = "HSchool Grad or less",
                              `13` = "HSchool Grad or less",
                              `14` = "HSchool Grad or less",
                              `15` = "Some college or AA degree",
                              `16` = "Some college or AA degree",
                              `17` = "Some college or AA degree",
                              `18` = "Bachelor's or higher",
                              `19` = "Bachelor's or higher",
                              `20` = "Bachelor's or higher",
                              `21` = "Bachelor's or higher")
# Age also collapsed: low; medium; high
FullFile$AGE <- cut(FullFile$AGE_P,
                    breaks = c(-Inf, 18, 61, Inf),
                    labels = c("Less than 18", "18 to 60", "More than 60"),
                    right = FALSE)
table(FullFile$EDUC1,FullFile$EDUCATION)
##     
##      HSchool Grad or less Some college or AA degree Bachelor's or higher
##   0                  3424                         0                    0
##   1                  1679                         0                    0
##   2                  1633                         0                    0
##   3                  1787                         0                    0
##   4                  1716                         0                    0
##   5                  1822                         0                    0
##   6                  2726                         0                    0
##   7                  1860                         0                    0
##   8                  2644                         0                    0
##   9                  2908                         0                    0
##   10                 2884                         0                    0
##   11                 3186                         0                    0
##   12                 1774                         0                    0
##   13                 2290                         0                    0
##   14                18413                         0                    0
##   15                    0                     14822                    0
##   16                    0                      5413                    0
##   17                    0                      2966                    0
##   18                    0                         0                13883
##   19                    0                         0                 5956
##   20                    0                         0                  955
##   21                    0                         0                 1018
##   96                    0                         0                    0
##   97                    0                         0                    0
##   98                    0                         0                    0
##   99                    0                         0                    0

One easy way to see whether your recoding has had the desired effect is
to make a simple table that maps the original values to the new values.
That’s what I’ve done here for EDUCATION which shows the collapse of
categories and the fact that values like 99 have been mapped to NA.

Finally let’s grab just the variables we really need including the newly
recoded versions and make them a new dataset and take the full version
out of
memory.

OfInterest <- select(FullFile, AGE, REGION, SEX, EDUCATION, EARNINGS, PDMED12M, PNMED12M, NOTCOV, MEDBILL)
str(OfInterest)
## 'data.frame':    103789 obs. of  9 variables:
##  $ AGE      : Factor w/ 3 levels "Less than 18",..: 2 2 2 1 1 2 2 2 3 2 ...
##  $ REGION   : Factor w/ 4 levels "Northeast","Midwest",..: 3 4 4 4 4 4 3 3 4 3 ...
##  $ SEX      : Factor w/ 2 levels "Male","Female": 1 2 1 1 2 2 2 1 1 1 ...
##  $ EDUCATION: Factor w/ 3 levels "HSchool Grad or less",..: 2 1 3 1 NA 3 1 1 3 2 ...
##  $ EARNINGS : Factor w/ 3 levels "$01-$34,999",..: 1 2 2 NA NA 3 1 NA 2 3 ...
##  $ PDMED12M : Factor w/ 2 levels "Yes","No": 2 1 2 2 2 2 2 2 2 2 ...
##  $ PNMED12M : Factor w/ 2 levels "Yes","No": 2 2 2 2 2 2 2 2 2 2 ...
##  $ NOTCOV   : Factor w/ 2 levels "Not covered",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ MEDBILL  : Factor w/ 2 levels "Yes","No": 2 1 1 1 1 2 2 2 2 2 ...
summary(OfInterest)
##            AGE              REGION          SEX       
##  Less than 18:25680   Northeast:16883   Male  :50257  
##  18 to 60    :58025   Midwest  :20507   Female:53532  
##  More than 60:20084   South    :35826                 
##                       West     :30573                 
##                      EDUCATION                 EARNINGS     PDMED12M    
##  HSchool Grad or less     :50746   $01-$34,999     :20390   Yes : 6744  
##  Some college or AA degree:23201   $35,000-$74,999 :12927   No  :96986  
##  Bachelor's or higher     :21812   $75,000 and over: 6413   NA's:   59  
##  NA's                     : 8030   NA's            :64059               
##  PNMED12M             NOTCOV      MEDBILL     
##  Yes : 4978   Not covered:10506   Yes :16282  
##  No  :98741   Covered    :92181   No  :87096  
##  NA's:   70   NA's       : 1102   NA's:  411  
## 
rm(FullFile)

Driving Excel

Everything we have done so far has been in preparation for actually
providing my colleague the data she wanted in Excel so now on to the
main event. In this post I am deliberately going to NOT build
functions or use ggplot to make the graphs she wanted. Those will be
topics for future posts.

You may have noticed that a table is almost exactly what she wants. If
the variables of interest are AGE and whether or not the person has
health care coverage (NOTCOV) then the command would be
table(OfInterest$AGE,OfInterest$NOTCOV) and we have results we need.
The openxlsx package provides a write.xlsx function that will accept the
table command and produce a properly formatted workbook as output. The
code below will produce SimpleExcelExample.xlsx in your working
directory. The number of options is legion and ?write.xlsx will
display them for you.

table(OfInterest$AGE,OfInterest$NOTCOV)
##               
##                Not covered Covered
##   Less than 18        1312   24193
##   18 to 60            8730   48484
##   More than 60         464   19504
write.xlsx(table(OfInterest$AGE,OfInterest$NOTCOV), file = "SimpleExcelExample.xlsx")

Finishing up the manual way

The last thing I will describe in this post is how to make one workbook
with multiple sheets. Where each sheet represents one variable pairing
(e.g. AGE by NOTCOV). We’ll even label the tabs (worksheets) in a
thoughtful way. So my colleague can easily see which sheet corresponds
to which variable pairing.

The code below follows this pattern

  1. Create a new empty workbook object wb <- createWorkbook()
  2. Invent a name for the tab or worksheet inside the workbook
    NameofSheet
  3. Make a table for a pair of variables TheData <-
    table(OfInterest$EDUCATION,OfInterest$NOTCOV)
  4. Add a worksheet (tab) into the workbook addWorksheet
  5. Write the table we made onto the worksheet with writeData
  6. Repeat steps 2 through 5 12 times

### Manual and painful way
## Create a new workbook
wb <- createWorkbook()
# education by each of the other 4 variables
NameofSheet <- "CoverageByEducation"
TheData <- table(OfInterest$EDUCATION,OfInterest$NOTCOV)
addWorksheet(wb = wb, sheetName = NameofSheet)
writeData(wb = wb, sheet = NameofSheet, x = TheData, borders = "n")
NameofSheet <- "MedbillByEducation"
TheData <- table(OfInterest$EDUCATION,OfInterest$MEDBILL)
addWorksheet(wb = wb, sheetName = NameofSheet)
writeData(wb = wb, sheet = NameofSheet, x = TheData, borders = "n")
NameofSheet <- "PDMED12MByEducation"
TheData <- table(OfInterest$EDUCATION,OfInterest$PDMED12M)
addWorksheet(wb = wb, sheetName = NameofSheet)
writeData(wb = wb, sheet = NameofSheet, x = TheData, borders = "n")
NameofSheet <- "PNMED12MByEducation"
TheData <- table(OfInterest$EDUCATION,OfInterest$PNMED12M)
addWorksheet(wb = wb, sheetName = NameofSheet)
writeData(wb = wb, sheet = NameofSheet, x = TheData, borders = "n")
# earnings  by each of the other 4 variables
NameofSheet <- "CoverageByEarnings"
TheData <- table(OfInterest$EARNINGS,OfInterest$NOTCOV)
addWorksheet(wb = wb, sheetName = NameofSheet)
writeData(wb = wb, sheet = NameofSheet, x = TheData, borders = "n")
NameofSheet <- "MedbillByEarnings"
TheData <- table(OfInterest$EARNINGS,OfInterest$MEDBILL)
addWorksheet(wb = wb, sheetName = NameofSheet)
writeData(wb = wb, sheet = NameofSheet, x = TheData, borders = "n")
NameofSheet <- "PDMED12MByEarnings"
TheData <- table(OfInterest$EARNINGS,OfInterest$PDMED12M)
addWorksheet(wb = wb, sheetName = NameofSheet)
writeData(wb = wb, sheet = NameofSheet, x = TheData, borders = "n")
NameofSheet <- "PNMED12MByEarnings"
TheData <- table(OfInterest$EARNINGS,OfInterest$PNMED12M)
addWorksheet(wb = wb, sheetName = NameofSheet)
writeData(wb = wb, sheet = NameofSheet, x = TheData, borders = "n")
# age by each of the other 4 variables
NameofSheet <- "CoverageByAge"
TheData <- table(OfInterest$AGE,OfInterest$NOTCOV)
addWorksheet(wb = wb, sheetName = NameofSheet)
writeData(wb = wb, sheet = NameofSheet, x = TheData, borders = "n")
NameofSheet <- "MedbillByAge"
TheData <- table(OfInterest$AGE,OfInterest$MEDBILL)
addWorksheet(wb = wb, sheetName = NameofSheet)
writeData(wb = wb, sheet = NameofSheet, x = TheData, borders = "n")
NameofSheet <- "PDMED12MByAge"
TheData <- table(OfInterest$AGE,OfInterest$PDMED12M)
addWorksheet(wb = wb, sheetName = NameofSheet)
writeData(wb = wb, sheet = NameofSheet, x = TheData, borders = "n")
NameofSheet <- "PNMED12MByAge"
TheData <- table(OfInterest$AGE,OfInterest$PNMED12M)
addWorksheet(wb = wb, sheetName = NameofSheet)
writeData(wb = wb, sheet = NameofSheet, x = TheData, borders = "n")
#
saveWorkbook(wb, "BetterExcelExample.xlsx", overwrite = TRUE) ## save to working directory

All done (not really)

Hopefully this post helps you understand how to use the openxlsx
package to have R drive Excel to help you with your data analysis. In
my next post I’ll build on this scaffolding to discuss how to make these
very same graphs in ggplot2 (which IMHO runs circles around Excel for
scientific plotting), as well as making this all more efficient through
the use of functions to take care of some of the more repetitive chores.

I hope you’ve found this useful. I am always open to comments,
corrections and suggestions.

Chuck (ibecav at gmail dot
com)

License

Creative Commons License
This
work is licensed under a
Creative
Commons Attribution-ShareAlike 4.0 International License
.

To leave a comment for the author, please follow the link and comment on their blog: Chuck Powell.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...



If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.

Search R-bloggers

Sponsors

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)