This article was first published on analyze stuff. It has been contributed to Anything but R-bitrary as the third article in its introductory series.
By Ben Ogorek
Alongside interstate highways, national defense, and social security, your tax dollars are used to collect data. Sometimes it’s high profile and relevant, like the census or NSA’s controversial PRISM surveillance program. Other times it’s just high profile, like the three-billion dollar brain dataset that nobody has figured out how to use. Then there are the lower profile data sets, studied by researchers, available to the public, but with enough barriers to discourage the data hobbyist.
In this article, we’ll work with the NLSY97, a National Longitudinal Survey (NLS) that follows approximately 9,000 youths, documenting “the transition from school to work and into adulthood.” The core areas of the survey are education and employment, but subject areas also include relationships with parents, marital and fertility histories, dating, life expectations, and criminal behavior. The NLS data sets are rich, complex, and full of insights on the “making” of a human being.
In this article I introduce NLSdata, an R package facilitating the analysis of National Longitudinal Survey data. I’ll discuss the NLS Investigator, a web utility offered by the BLS to download a packet, and demonstrate how the NLSdata package helps to extract value from it. This culminates in the analysis of the belief, “I don’t need religion to have good values,” and how it has changed with age.
For providing National Longitudinal Survey to the public, the Bureau of Labor Statistics provides an online tool, the NLS Investigator. This article will refer to included files within the Investigator subdirectory of the NLSdata package, but information on how to pull additional data from the NLS Investigator is provided in the Appendix.
While you don’t need to visit the NLS Investigator to run the following examples, the files it provides are the NLSdata package’s raison d’etre and deserve a brief discussion. There are two key files: a structured data file and a semi-structured metadata file called a “codebook.”
The data file is in a .csv format and can contain thousands of columns. While it is highly structured, it will probably be unintelligible (see the image below for an example).
The codebook is a text file with a .cdb extension. While it is intelligible, it is only semi-structured (below is an excerpt for one of the training variables).
The NLSdata package
The NLSdata package is currently on GitHub and in its earliest stages of development. The easiest way to install it is with the devtools package.
library(devtools) install_github("NLSdata", user = "baogorek")(Hadley informed me that install_github(“baogorek/NLSdata”) also works in the newest version of devtools.)
After installation, load in the library and read in the codebook and data files in NLSdata’s Investigator folder.
library(NLSdata) codebook <- system.file("Investigator", "Religion.cdb", package = "NLSdata") csv.extract <- system.file("Investigator", "Religion.csv", package = "NLSdata")The first key step is to create an "NLSdata" object using the CreateNLSdata constructor function. You will then supply the filepath for the codebook and the data. For now, I'll note that there are missing data issues out of the scope of this introduction.
nls.obj <- CreateNLSdata(codebook, csv.extract) class(nls.obj)  "NLSdata" names(nls.obj)  "metadata" "data"The element "data" is a data frame with variable names that correspond to those in the codebook, but with the survey year appended. Factor labels are added for categorical factors and logical factors have been properly converted. The unique key will always be respondent identifier PUBID.1997.
head(nls.obj$data[order(nls.obj$data$PUBID.1997), c(2, 8, 9, 11)]) PUBID.1997 YSAQ_282A2.2002 YSAQ_282A2.2005 YSAQ_282A2.2008 5203 1 FALSE TRUE TRUE 1918 2 TRUE TRUE FALSE 6687 3 TRUE NA NA 3682 4 TRUE TRUE TRUE 1730 5 TRUE FALSE FALSE 2838 6 TRUE FALSE FALSEThe metadata element is a list that contains information about each variable in the data set, including the original codebook chunk.
nls.obj$metadata[["YSAQ_282A2.2005"]] $name  "YSAQ_282A2.2005" $summary  "R DOES NOT NEED RELIGION FOR GOOD VALUES" $year  "2005" $r.id  "S6316800" $chunk  "S63168.00 [YSAQ-282A2] Survey Year: 2005"  " PRIMARY VARIABLE"  "I don't need religion to have good values."  "UNIVERSE: All"(The element "chunk" has been modified for presentation.)
Attitudes about religion through time
The YSAQ_282A2 values indicate each respondent's answer to the question, "I don't need religion to have good values," but the wide format makes it awkward. Thus the NLSdata package provides the function CreateTimeSeriesDf to coerce a portion of the data frame into a long format. Under the hood, it’s using the reshape2's melt function.
religion.df <- CreateTimeSeriesDf(nls.obj, "YSAQ_282A2") head(religion.df[order(religion.df$PUBID.1997), ]) PUBID.1997 YSAQ_282A2 year 5203 1 FALSE 2002 14187 1 TRUE 2005 23171 1 NA 2006 32155 1 TRUE 2008 1918 2 TRUE 2002 10902 2 TRUE 2005Looking at the cell counts, we can see something is strange with the year 2006.
(cell.counts <- with(religion.df, table(YSAQ_282A2, year))) year YSAQ_282A2 2002 2005 2006 2008 FALSE 4014 3306 0 3213 TRUE 3829 3679 2 3970Since only two people answered the question that year, I’m choosing to exclude it from the analysis.
religion.df <- religion.df[religion.df$year != 2006, ]For the purpose of this article, I want both simplicity and protection from obvious confounding. Thus I'll enforce balance over respondents and years via the ThrowAwayDataForBalance function, which keeps records for respondents who answered a given question in every observed year.
religion.df <- ThrowAwayDataForBalance(religion.df, "YSAQ_282A2") table(religion.df$year) 2002 2005 2008 6013 6013 6013 head(table(religion.df$PUBID.1997)) 1 2 4 5 6 9 3 3 3 3 3 3After deletions, the final cell counts are shown below:
(cell.counts <- with(religion.df, table(YSAQ_282A2, year))) year YSAQ_282A2 2002 2005 2008 FALSE 3077 2882 2682 TRUE 2936 3131 3331We can test whether there is evidence of a changing response distribution by a Chi Square test for association.
chisq.test(cell.counts) Pearson's Chi-squared test data: cell.counts X-squared = 51.9902, df = 2, p-value = 5.134e-12There is strong evidence that the likelihood of answering "yes" is not constant over time. Recall that these are the same people in all three years.
And here are the actual proportions.
(proportions <- aggregate(religion.df$YSAQ_282A2, by = list(year = religion.df$year), FUN = mean, na.rm = TRUE)) year x 1 2002 0.4882754 2 2005 0.5207051 3 2008 0.5539664In the plot below, I chose a range of 0.40 to 0.65 for the agreement proportion. I believe this is a range which, if covered, would represent a meaningful shift in attitudes.
plot(x ~ year, data = proportions, ylim = c(0.4, 0.65), type = "b", ylab = "Proportion agreeing with statement", main = 'Belief: "I don\'t need religion to have good values"')
The NLSdata package is still very much a work in progress, and I fully expect that certain untested variables available from the NLS Investigator will cause problems. Reports of such occurrences and general feedback is encouraged and appreciated. The NLS data set itself has many interesting variables. With minimal effort, we determined that beliefs about religion and values changed through time. Many more interesting relationships, those involving education, training, employment, relationships, and even religion, wait to be uncovered. I hope that NLSdata makes the search for these relationships easier and more accessible.
- Thanks to Max Ghenis, for convincing me to analyze real data and not the output of rnorm(100). He also provided multiple suggestions that improved this article.
- Thanks to Mindy Greenberg, who deserves a title like "Editor in Chief" for her many contributions to style and readability.
Navigate your browser to the NLS Investigator homepage and sign up for an account. Once you've logged in, you'll be able to choose a survey. In this article, I'm working with NLSY97.
- the required PubID identifier, which serves as a primary key
- demographic variables (gender, ethnicity)
- survey metadata (e.g., release version)
The many variables is due to a massive flattening of the data. The primary ways in which this flattening occurs is summarized below:
- Each survey round (year) gets a new variable name even with the same question text.
- Questions are tweaked for different subpopulations, called “universes,” and are repeated.
- Nested lists called “rosters” add multiple columns for collections like jobs (job1, job2, ..., job 9).