Open-Source Authorship of Data Science in Education Using R

[This article was first published on R Views, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Joshua M. Rosenberg, Ph.D., is Assistant Professor of STEM Education at the University of Tennessee, Knoxville.

Photo by Alex Ware

In earlier posts, we shared how we wrote Data Science in Education Using R as an open book (Post 1, Post 2). In this post, we describe what we consider to be the open-source authorship process we took to write the book.

We think of open-source authorship as a broader—and perhaps better—term for describing what authors of some open books undertake. In our characterization, open-source authorship draws upon:

  • parts of open-source software (OSS) values and tools
  • parts of open science that establish the importance of scholarly work beyond original, discovery research
  • the values surrounding the creation of open educational resources (OER)

We believe that combining elements from OSS, open science, and OER is notable because while OSS and open science emphasize the sharing of technical work (including technology and code) and OER emphasizes the sharing of resources, technical books have not been as much a focus of the conversation. Moreover, the way in which the conversation about open books has taken place in different communities and contexts means that some books that are open do not fully receive the attention (for being openly available) that they merit from those interested in OER. This also might mean that those involved with OSS development and open science may fail to recognize the creation of a book as a substantial contribution.

In this way, we argue for open-source authorship as an important, new type of work, one that we increasingly see by the authors of other books, especially in the R community1 2 3 4.

After describing how we wrote our book in an open way, we elaborate on these ideas and draw connections to the process we undertook.

How We Wrote Data Science in Education Using R as an Open Book

Early in our process, we determined that we wanted to share the book in an open way. Since we were using GitHub as a repository for the book, it was easy for the contents of the book to be available for anyone to view–even before and as the book was being written. Despite the benefits of using GitHub, GitHub can be difficult to navigate for those who are unfamiliar with it, and so sharing the book in a more widely-accessible way was also important. To do this, we used {bookdown} and Netlify to share the book as a website. Additionally, we chose an easy-to-remember URL ( to help others (and us!) to be able to access it easily.

Being available for others to contribute was important. Because we used GitHub, we were able to receive feedback at a very early-stage on issues such as how we referred to data (as data or datum). Other issues (by non-authors) raised questions about whether certain content was in scope—such as content on gradebooks, which we included a chapter on. We found that apart from the five of us as authors, fifteen individuals made contributions, and another one hundred forty-four individuals starred the repository5. Moreover, we received feedback through Twitter and an email account we created for the book for those unfamiliar with GitHub (or Twitter) to be able to provide feedback directly to us. In this way, making the book available to others to contribute made the book better, and points to the importance of sharing work at only one stage of the writing process.

Lastly, we shared products that could be seen as tangential to the book, but which were important given its focus on data science and R. Namely, we created an R package, {dataedu}, to accompany the book. This package includes code to install the packages necessary to reproduce the book as well as all of the data sets used in it. By doing so, we invited others to contribute to the book in ways not related to its prose. This also led to (pleasantly) surprising contributions, including the creation of an iPython Notebook with python code to carry out comparable steps as those carried in a walkthrough chapter of our book.

Collectively, these practices—involving not only making the book open, but also planning for others to contribute and creating other, shared (open) products— comprise what we think of as the results of open-source authorship.

Originally a niche effort, open-source software (OSS) and OSS development are (likely not to the surprise of R users!) now widespread 6. There are some insights that can be gained from efforts to understand how OSS development proceeds. For example, in foundational, work Mockus et al. found that OSS is often characterized by a core group of 10-15 individuals contributing around 80% of the code, but that a group around one order of magnitude larger than that core will repair specific problems, and a group another order of magnitude larger will report issues7; proportions (generally) similar to those we found for those who contributed to our book.

Second, open science is both a perspective about how science should operate and a set of practices that reflect a perspective about how science should proceed 8 9. Related to open science are open scholarly practices. Others trace the origin of the idea of open scholarly practices to a book by Boyer, who shared a broad description of intellectual (especially academic) work. This suggests that scholarly work is not only original, discovery research; it also includes the applications of advances in one’s own discipline (or “translational research”) and sharing the results of research with multiple stakeholders. Open science and open scholarly practices point to the scientific or scholarly contributions of open books; while different from original, scientific research, books such as our own—which focused on providing a language for data science in education—may serve as helpful examples (of open science) or forms of a broader view of scholarship.

Last, OER are “teaching, learning, and research resources that reside in the public domain or have been released under an intellectual property license that permits their free use and re-purposing by others” 10. These resources range from courses and books to tests and technologies. By being open, they are not only available to others to use, but also to reuse, redistribute (or share), revise (adapt or change the work), and remix (combining existing resources to create a new one) 11. OER can serve as an inspiration for authors of open books, especially those who see their books as being used to teach and learn from. At the moment, OER and traditional publishing modes are largely separate: For most books that are published, the publisher retains the copyright, and authors are typically not allowed to share their book in the open, though this may be changing. Many authors of books about R have negotiated with their publisher to share their books in the open (often only as a website, as we have) in addition to sharing them through print and e-book formats. In addition, a number of platforms for creating books that are OER are emerging; one example is EdTech Books. There are increasing conversations related to making materials, resources, and even education as an enterprise more open; OER may be an area in which authors of books about R and other technical books can both learn from the work of authors as well as advance the conversation.


This post was an effort to step back from what we did to write our book to reflect on what we meant by open-source authorship and to attempt to situate what we did (and what others have done) in broader conversations about OSS, open science, and OER. In this open mode, we invite others to revise or remix these ideas to advance other, new forms of authorship of books.

You can reach us on Twitter: Emily @ebovee09, Jesse @kierisi, Joshua @jrosenberg6432, Isabella @ivelasq3, and me @RyanEs.

See you in two weeks for our next post! Josh, with help from Ryan, Emily, Jesse, Joshua, and Isabella

  • Ryan A. Estrellado is a public education leader and data scientist helping administrators use practical data analysis to improve the student experience.

  • Emily A. Bovee, Ph.D., is an educational data scientist working in dental education.

  • Jesse Mostipak, M.Ed., is a community advocate, Kaggle educator, and data scientist.

  • Isabella C. Velásquez, MS, is a data analyst committed to nonprofit work with the aim of reducing racial and socioeconomic inequities.

To leave a comment for the author, please follow the link and comment on their blog: R Views. offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)