Generating codebooks in R

March 2, 2018

(This article was first published on, and kindly contributed to R-bloggers)

A codebook is a technical document that provides an
overview of and information about the variables in a dataset. The
codebook ensures that the statistician has the complete background
information necessary to undertake the analysis, and a codebook
documents the data to make sure that the data is well understood and
reusable in the future. Here we will show how to create codebooks in R
using the dataMaid packages.

The help pages for the datasets in R packages usually provide thorough
information although the level of detail may vary quite substantially
from dataset to dataset. As an example we will consider the iris
dataset. The help page gives decent information so we will just use it
to show how we would create a codebook.


Real datasets, however, are messy and not as polished as the datasets
found in R packages. A substantial amount of data wrangling, tweaks,
cleaning, and custom solutions are necessary to transform the data
into shape before it is ready for statistical analysis. Creating the
finished dataset is not enough as it is also necessary to produce the
corresponding data documentation.

We have previously
how the
dataMaid package can produce automated reports to summarise
datasets, to identify potential errors, and to check the data quality
and integrity.

The dataMaid package produces an Rmarkdown summary document with
information on each variable in the data frame, and the document can
be rendered to a report in HTML, pdf, or word. The final report can be
given to scientific collaborators since proper data validation often
requires a collaborative effort between an expert in the field and a
data scientist. It is easy to tweak report generated by dataMaid to
obtain a document that can serve as codebook for the cleaned dataset.

The function makeCodebook() accepts a data frame and produces a
document that provides a summary of the data frame and its variables.


The result is the 3-page document reproduced below in the two
figures. The codebook consists of 4 parts: the first two parts are
tables giving an overview of the data frame and the variables. Here we
see the number of observations, the number of variables, their class
type, and the proportion of missing observations.

Part 3 lists each variable and provides class-dependent summary
statistics and a data-visualisation. If a variable is a factor then
the unique factor levels are listed beneath the summary statistics.

Part 4 documents the report generation information (who made it, when,
directory, the function call, the operating system platform).

plot of chunk includeReport11

The full codebook generated with the default arguments to `makeCodebook()`. The codebook created is an Rmarkdown document that can be rendered to pdf, html, or word.

While the information about each variable (pages 1 and 2) serves as a
reference manual when doing subsequent analyses, the last page
provides meta-information about the codebook to ensure documentable
and reproducible research.

The codebooks can be improved by adding additional information about
the variables. There are two ways to add extra information to the
codebook. The first uses the same approach as the labelled package
where the attribute labels can be set for a variable in a data frame
to contain label information. These labels are set directly

attr(iris$Sepal.Length, "labels") <- "Sepal length in cm"

or it is possible to use the functions from the labelled
package. The labels attribute is intended for condensed information
and it is particularly useful if the variable names are not
meaningful. When variable names are not self-explanatory we can keep
the original variable names from the raw data but provide meaningful, explanatory
labels through the labels attribute.

Another type of label is the shortDescription attribute. This is
intended to provide additional details that might come in handy
later. The shortDescription attribute is set similarly to the
labels attribute.

attr(iris$Sepal.Length, "shortDescription") <-
     "Measured using a line gauge produced by Acme factories."
attr(iris$Species, "shortDescription") <- paste0(
     "Two of the three species were collected in the Gaspé ",
     "Peninsula all from the same pasture, and picked on the ",
     "same day and measured at the same time by the same ",
     "person with the same apparatus")

When we run makeCodebook() again (with argument replace=TRUE to
overwrite the report we generated earlier) we can see the additional
information appear in the codebook produced.

makeCodebook(iris, replace=TRUE)

plot of chunk includeReport21

makeCodebook() works by tweaking the arguments for
makeDataReport() from the dataMaid package. The makeDataReport()
function is very
it is possible to change the arguments to modify the content of the
material that goes into the codebook.

Hopefully, the makeCodebook() function in the dataMaid package
should make it easier to create and provide codebooks for small and
larger projects, and will encourage more people to provide
documentable and reproducible research. Comments and suggestions to
expand the codebook possibilities are very

To leave a comment for the author, please follow the link and comment on their blog: offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.

Search R-bloggers


Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)