midfieldr v1.0.1

[This article was first published on Layton R blog, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

A qualitative overview of the midfieldr package and its application following its initial CRAN release. midfieldr provides tools and recommended methods for working with individual undergraduate student-level records (registrar’s longitudinal data) in R.

midfieldr is designed to work with data from the MIDFIELD research database, a sample of which is available in the midfielddata data package. Tools in midfieldr include filters for US academic program codes, data sufficiency, and timely completion. Recommended methods—illustrated in the package website—include gathering blocs of records, computing quantitative metrics such as graduation rate, and creating charts to visualize comparisons.


Begun in 2004 as an extension of the SUCCEED Longitudinal Database, MIDFIELD contains student records for all undergraduate, degree-seeking students at partner institutions, currently 2.4M unique students at 21 US institutions from 1987 to 2022.

While originally intended for studying engineering programs, the database can be used to study any set of programs at the member institutions—the data are whole-population data, that is, student records for all undergraduates for the span of years provided by each institution.

An early version of the package was presented at useR! 2018 in Brisbane [slide deck]. In the five years since that talk, the package has undergone significant development, testing, and revision such that the current release (v1.0.1) is ready for dissemination via CRAN.

Data structure

The functions in midfieldr are designed to interact with the data structure implemented in the MIDFIELD database. MIDFIELD data are organized in four tables (student, term, course, degree) linked by an anonymized student ID with variables as outlined below. (The midfielddata website provides a data dictionary.)

Figure 1: MIDFIELD data structure.

Each table is in block-record form consistent with Codd’s 2nd rule for relational databases:

Each and every datum (atomic value) in a relational data base is guaranteed to be logically accessible by resorting to a combination of table name, primary key value, and column name.

In MIDFIELD, student, term, course, and degree are the tables; the anonymized student ID is the primary key; and the column names encode the variables outlined in Figure 1. Each row is an observation and each column a variable.

What midfieldr does

The purpose of midfieldr functions is to implement what we consider good practices for treating longitudinal student-level data.

For example, consider the concept of data sufficiency. The time span of MIDFIELD data varies by institution, each having their own lower and upper bounds. For some student records, being at or near these bounds creates unavoidable ambiguity when trying to assess degree completion. Such records must be identified and in most cases excluded to prevent false summary counts.

To illustrate, consider a student whose first term at an institution occurs two years before the upper limit of their institution’s data range. A researcher has no way of knowing the student’s completion status. The student may have: graduated in a timely manner (say, in 6 years or less); graduated, but in more than 6 years; or left the database without a degree. Failing to exclude such students leads to false counts when grouping and summarizing blocs of records, for example, when counting starters, graduates, ever-enrolled, etc.

The articles posted to the midfieldr website describe how the package is used to treat fundamental issues (like data sufficiency) in the context of longitudinal student-level data. Thus, midfieldr provides tools and recommended methods designed specifically for treating student-level records.

Tools include:

Methods include:

  • Planning. Identify the groups of students, programs, and metrics with which we intend to work.
  • Initial processing. Filter for data sufficiency, degree-seeking, and academic programs.
  • Blocs. Identify and label records to be treated as a unit, for example, starters, students ever-enrolled, graduates, transfer students, etc.
  • Groupings. Add relevant grouping variables such as race/ethnicity, sex, and program label. Group and summarize.
  • Metrics. Compute measures of academic success such as graduation rates, stickiness, etc., disaggregated by grouping variables.
  • Displays. Display results of quantitative metrics in charts and tables.

Sample result

Figure 2 displays a count of engineering graduates grouped by race/ethnicity, sex, and program—illustrating a typical set of grouping variables and a typical chart design (a Cleveland multiway chart). The script for this specific chart is given in the multiway article in the package website.

Figure 2: Count of engineering graduates from the practice data in the midfielddata package

Note that midfielddata is suitable for learning to work with student-level data but not for drawing inferences about program attributes or student experiences. midfielddata supplies practice data, not research data.

Notes on syntax

Throughout the work, we use the data.table package for data manipulation (Dowle & Srinivasan, 2021) and the ggplot2 package for charts (Wickham, 2016). Some users may prefer base R or dplyr for data (Wickham et al., 2023) or lattice for charts (Sarkar, 2008). Each system has its strengths—users are welcome to translate our examples to their preferred syntax.

Note that midfieldr functions yield data.table-type data frames and do not preserve tibble structures. A user wanting to use the tibble form in tidyverse-style scripts would probably want to apply as_tibble() following each application of most midfieldr functions.

For more information

MIDFIELD.   A database of anonymized student-level records for approximately 2.4M undergraduates at 21 US institutions from 1987–2022, of which midfielddata provides a sample. This research database is currently accessible to MIDFIELD partner institutions only.

midfielddata.   An R data package that supplies anonymized student-level records for 98,000 undergraduates at three US institutions from 1988–2018. A sample of the MIDFIELD database, midfielddata provides practice data for the tools and methods in the midfieldr package.

MIDFIELD Institute.   Materials from the 2023 workshop introducing the application of the midfieldr package.

Software credits

  • checkmate for internal function argument checks
  • data.table for data manipulation tools
  • ggplot2 for data graphics tools
  • wrapr for internal function authoring tools


Dowle, M., & Srinivasan, A. (2021). Data.table: Extension of ‘data.frame‘. https://CRAN.R-project.org/package=data.table
Sarkar, D. (2008). Lattice: Multivariate data visualization with r. Springer. http://lmdvr.r-forge.r-project.org
Wickham, H. (2016). ggplot2: Elegant graphics for data analysis. Springer-Verlag New York. https://ggplot2.tidyverse.org
Wickham, H., François, R., Henry, L., Müller, K., & Vaughan, D. (2023). Dplyr: A grammar of data manipulation. https://CRAN.R-project.org/package=dplyr
To leave a comment for the author, please follow the link and comment on their blog: Layton R blog.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)