Open-Source Authorship of Data Science in Education Using R

[This article was first published on R Views, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Joshua M. Rosenberg, Ph.D., is Assistant Professor of STEM Education at the
University of Tennessee, Knoxville.

Photo by Alex Ware

In earlier posts, we shared how we wrote Data Science in Education Using
as an open book
(Post 1,
Post 2).
In this post, we describe what we consider to be the open-source authorship
process we took to write the book.

We think of open-source authorship as a broader—and perhaps better—term for
describing what authors of some open books undertake. In our characterization,
open-source authorship draws upon:

  • parts of open-source software (OSS) values and tools
  • parts of open science that establish the importance of scholarly work beyond
    original, discovery research
  • the values surrounding the creation of open educational resources (OER)

We believe that combining elements from OSS, open science, and OER is notable
because while OSS and open science emphasize the sharing of technical work
(including technology and code) and OER emphasizes the sharing of resources,
technical books have not been as much a focus of the conversation. Moreover, the
way in which the conversation about open books has taken place in different
communities and contexts means that some books that are open do not fully
receive the attention (for being openly available) that they merit from those
interested in OER. This also might mean that those involved with OSS development
and open science may fail to recognize the creation of a book as a substantial

In this way, we argue for open-source authorship as an important, new type of
work, one that we increasingly see by the authors of other books, especially in
the R community1
2 3

After describing how we wrote our book in an open way, we elaborate on these
ideas and draw connections to the process we undertook.

How We Wrote Data Science in Education Using R as an Open Book

Early in our process, we determined that we wanted to share the book in an open
way. Since we were using GitHub as a repository for the
, it was easy for
the contents of the book to be available for anyone to view–even before and as
the book was being written. Despite the benefits of using GitHub, GitHub can be
difficult to navigate for those who are unfamiliar with it, and so sharing the
book in a more widely-accessible way was also important. To do this, we used
{bookdown} and Netlify to
share the book as a website. Additionally, we chose an easy-to-remember URL
( to help others (and us!) to be able to
access it easily.

Being available for others to contribute was important. Because we used GitHub,
we were able to receive feedback at a very early-stage on issues such as how we
referred to data (as data or
. Other
issues (by non-authors) raised questions about whether certain content was in
scope—such as content on
which we included a chapter on. We found that apart from the five of us as
authors, fifteen individuals made contributions, and another one hundred forty-four individuals starred
Moreover, we received feedback through Twitter and an email account we created
for the book for those unfamiliar with GitHub (or Twitter) to be able to provide
feedback directly to us. In this way, making the book available to others to
contribute made the book better, and points to the importance of sharing work at
only one stage of the writing process.

Lastly, we shared products that could be seen as tangential to the book, but
which were important given its focus on data science and R. Namely, we created
an R package, {dataedu}, to accompany the
book. This package includes code to install the packages necessary to reproduce
the book as well as all of the data sets used in it. By doing so, we invited
others to contribute to the book in ways not related to its prose. This also led
to (pleasantly) surprising contributions, including the creation of an iPython
Notebook with python code to carry out comparable steps as those carried in a
walkthrough chapter of our

Collectively, these practices—involving not only making the book open, but also
planning for others to contribute and creating other, shared (open) products—
comprise what we think of as the results of open-source authorship.

Originally a niche effort, open-source software (OSS) and OSS development are
(likely not to the surprise of R users!) now widespread
There are some insights that can be gained from efforts to understand how OSS
development proceeds. For example, in foundational, work Mockus et al. found
that OSS is often characterized by a core group of 10-15 individuals
contributing around 80% of the code, but that a group around one order of magnitude
larger than that core will repair specific problems, and a group another order
of magnitude larger will report
issues7; proportions
(generally) similar to those we found for those who contributed to our book.

Second, open science is both a perspective about how science should operate and
a set of practices that reflect a perspective about how science should proceed
9. Related to
open science are open scholarly practices. Others trace the origin of the idea of
open scholarly practices to a book by Boyer,
who shared a broad description of intellectual (especially academic) work. This suggests that
scholarly work is not only original, discovery research; it also includes the
applications of advances in one’s own discipline (or “translational research”)
and sharing the results of research with multiple stakeholders. Open science and
open scholarly practices point to the scientific or scholarly contributions of
open books; while different from original, scientific research, books such as
our own—which focused on providing a language for data science in education—may
serve as helpful examples (of open science) or forms of a broader view of

Last, OER are “teaching, learning, and research resources that reside in the
public domain or have been released under an intellectual property license that
permits their free use and re-purposing by others”
10. These resources range from
courses and books to tests and technologies. By being open, they are not only
available to others to use, but also to reuse, redistribute (or share), revise
(adapt or change the work), and remix (combining existing resources to create a
new one)
OER can serve as an inspiration for authors of open books, especially those who
see their books as being used to teach and learn from. At the moment, OER and
traditional publishing modes are largely separate: For most books that are
published, the publisher retains the copyright, and authors are typically not
allowed to share their book in the open, though this may be changing. Many
authors of books about R have negotiated with their publisher to share their
books in the open (often only as a website, as we have) in addition to sharing
them through print and e-book formats. In addition, a number of platforms for
creating books that are OER are emerging; one example is EdTech
. There are increasing conversations related to
making materials, resources, and even education as an enterprise more open; OER
may be an area in which authors of books about R and other technical books can
both learn from the work of authors as well as advance the conversation.


This post was an effort to step back from what we did to write our book to
reflect on what we meant by open-source authorship and to attempt to situate what
we did (and what others have done) in broader conversations about OSS, open
science, and OER. In this open mode, we invite others to revise or remix these
ideas to advance other, new forms of authorship of books.

You can reach us on Twitter: Emily @ebovee09,
Jesse @kierisi, Joshua
@jrosenberg6432, Isabella
@ivelasq3, and me

See you in two weeks for our next post! Josh, with help from Ryan, Emily, Jesse,
Joshua, and Isabella

  • Ryan A. Estrellado is a public education leader and data scientist helping
    administrators use practical data analysis to improve the student

  • Emily A. Bovee, Ph.D., is an educational data scientist working in dental

  • Jesse Mostipak, M.Ed., is a community advocate, Kaggle educator, and data

  • Isabella C. Velásquez, MS, is a data analyst committed to nonprofit work
    with the aim of reducing racial and socioeconomic inequities.

To leave a comment for the author, please follow the link and comment on their blog: R Views. offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)