A codebook is a technical document that provides an
overview of and information about the variables in a dataset. The
codebook ensures that the statistician has the complete background
information necessary to undertake the analysis, and a codebook
documents the data to make sure that the data is well understood and
reusable in the future. Here we will show how to create codebooks in R
The help pages for the datasets in R packages usually provide thorough
information although the level of detail may vary quite substantially
from dataset to dataset. As an example we will consider the
dataset. The help page gives decent information so we will just use it
to show how we would create a codebook.
Real datasets, however, are messy and not as polished as the datasets found in R packages. A substantial amount of data wrangling, tweaks, cleaning, and custom solutions are necessary to transform the data into shape before it is ready for statistical analysis. Creating the finished dataset is not enough as it is also necessary to produce the corresponding data documentation.
We have previously
shown how the
dataMaid package can produce automated reports to summarise
datasets, to identify potential errors, and to check the data quality
dataMaid package produces an
Rmarkdown summary document with
information on each variable in the data frame, and the document can
be rendered to a report in HTML, pdf, or word. The final report can be
given to scientific collaborators since proper data validation often
requires a collaborative effort between an expert in the field and a
data scientist. It is easy to tweak report generated by
obtain a document that can serve as codebook for the cleaned dataset.
makeCodebook() accepts a data frame and produces a
document that provides a summary of the data frame and its variables.
The result is the 3-page document reproduced below in the two figures. The codebook consists of 4 parts: the first two parts are tables giving an overview of the data frame and the variables. Here we see the number of observations, the number of variables, their class type, and the proportion of missing observations.
Part 3 lists each variable and provides class-dependent summary
statistics and a data-visualisation. If a variable is a
the unique factor levels are listed beneath the summary statistics.
Part 4 documents the report generation information (who made it, when, directory, the function call, the operating system platform).
While the information about each variable (pages 1 and 2) serves as a reference manual when doing subsequent analyses, the last page provides meta-information about the codebook to ensure documentable and reproducible research.
The codebooks can be improved by adding additional information about
the variables. There are two ways to add extra information to the
codebook. The first uses the same approach as the
where the attribute
labels can be set for a variable in a data frame
to contain label information. These labels are set directly
attr(iris$Sepal.Length, "labels") <- "Sepal length in cm"
or it is possible to use the functions from the
labels attribute is intended for condensed information
and it is particularly useful if the variable names are not
meaningful. When variable names are not self-explanatory we can keep
the original variable names from the raw data but provide meaningful, explanatory
labels through the
Another type of label is the
shortDescription attribute. This is
intended to provide additional details that might come in handy
shortDescription attribute is set similarly to the
attr(iris$Sepal.Length, "shortDescription") <- "Measured using a line gauge produced by Acme factories." attr(iris$Species, "shortDescription") <- paste0( "Two of the three species were collected in the Gaspé ", "Peninsula all from the same pasture, and picked on the ", "same day and measured at the same time by the same ", "person with the same apparatus")
When we run
makeCodebook() again (with argument
overwrite the report we generated earlier) we can see the additional
information appear in the codebook produced.
makeCodebook() works by tweaking the arguments for
makeDataReport() from the
dataMaid package. The
function is very
it is possible to change the arguments to modify the content of the
material that goes into the codebook.
makeCodebook() function in the
should make it easier to create and provide codebooks for small and
larger projects, and will encourage more people to provide
documentable and reproducible research. Comments and suggestions to
expand the codebook possibilities are very