A codebook is a technical document that provides an overview of and information about the variables in a dataset. The codebook ensures that the statistician has the complete background information necessary to undertake the analysis, and a codebook documents the data to make sure that the data is well understood and reusable in the future. Here we will show how to create codebooks in R using the
The help pages for the datasets in R packages usually provide thorough information although the level of detail may vary quite substantially from dataset to dataset. As an example we will consider the
iris dataset. The help page gives decent information so we will just use it to show how we would create a codebook.
Real datasets, however, are messy and not as polished as the datasets found in R packages. A substantial amount of data wrangling, tweaks, cleaning, and custom solutions are necessary to transform the data into shape before it is ready for statistical analysis. Creating the finished dataset is not enough as it is also necessary to produce the corresponding data documentation.
We have previously shown how the
dataMaid package can produce automated reports to summarise datasets, to identify potential errors, and to check the data quality and integrity.
dataMaid package produces an
Rmarkdown summary document with information on each variable in the data frame, and the document can be rendered to a report in HTML, pdf, or word. The final report can be given to scientific collaborators since proper data validation often requires a collaborative effort between an expert in the field and a data scientist. It is easy to tweak report generated by
dataMaid to obtain a document that can serve as codebook for the cleaned dataset.
makeCodebook() accepts a data frame and produces a document that provides a summary of the data frame and its variables.
The result is the 3-page document reproduced below in the two figures. The codebook consists of 4 parts: the first two parts are tables giving an overview of the data frame and the variables. Here we see the number of observations, the number of variables, their class type, and the proportion of missing observations.
Part 3 lists each variable and provides class-dependent summary statistics and a data-visualisation. If a variable is a
factor then the unique factor levels are listed beneath the summary statistics.
Part 4 documents the report generation information (who made it, when, directory, the function call, the operating system platform).
While the information about each variable (pages 1 and 2) serves as a reference manual when doing subsequent analyses, the last page provides meta-information about the codebook to ensure documentable and reproducible research.
The codebooks can be improved by adding additional information about the variables. There are two ways to add extra information to the codebook. The first uses the same approach as the
labelled package where the attribute
label can be set for a variable in a data frame to contain label information. These labels are set directly
attr(iris$Sepal.Length, "label") <- "Sepal length in cm"
or it is possible to use the functions from the
labelled package. The
label attribute is intended for condensed information and it is particularly useful if the variable names are not meaningful. When variable names are not self-explanatory we can keep the original variable names from the raw data but provide meaningful, explanatory labels through the
Another type of label is the
shortDescription attribute. This is intended to provide additional details that might come in handy later. The
shortDescription attribute is set similarly to the
attr(iris$Sepal.Length, "shortDescription") <- "Measured using a line gauge produced by Acme factories." attr(iris$Species, "shortDescription") <- paste0( "Two of the three species were collected in the Gaspé ", "Peninsula all from the same pasture, and picked on the ", "same day and measured at the same time by the same ", "person with the same apparatus")
When we run
makeCodebook() again (with argument
replace=TRUE to overwrite the report we generated earlier) we can see the additional information appear in the codebook produced.
makeCodebook() works by tweaking the arguments for
makeDataReport() from the
dataMaid package. The
makeDataReport() function is very versatile and it is possible to change the arguments to modify the content of the material that goes into the codebook.
makeCodebook() function in the
dataMaid package should make it easier to create and provide codebooks for small and larger projects, and will encourage more people to provide documentable and reproducible research. Comments and suggestions to expand the codebook possibilities are very welcome.