Sharing the Big Book of R upgrade proposal
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
01 June 2023
Earlier this year I applied to the R Consortium for a grant to upgrade the Big Book of R. Unfortunately my proposal wasn’t accepted, but I am really proud of how the proposal turned out. I’ve never written a proposal before and wasn’t really sure of how to do it, but with a little bit of guidance from Andrew Collier (Data Wookie) I think it turned out pretty well.
I want to give a special thanks to everyone who submitted a statement of support – it was really inspiring to get a glimpse of how much you value the Big Book of R!
I’m sharing the proposal here so it can serve as inspiration for your next grant proposal.
I’ve heard that grant-writing can take a long time and it did – this one took me maybe 5 or 6 solid hours to put together. So, not for the time-scarce among us but I suggest you email the committee members early and get some feedback if they think your idea is a viable candidate before sinking all your hours into it.
Thanks and enjoy the read!
Big Book of R Upgrade proposal
In response to the 2023 ISC Grant Programme Call for Proposals
Submitted by: Oscar Baruffa
Date: 29 March 2023
Summary
Seeking ISG Grant Funding of [redacted] spread over 3 years for a website upgrade of the Big Book of R. This enhances the discoverability of R programming books which will improve the Social Infrastructure of the R ecosystem.
The upgrade will be completed in 4 months. Development costs primarily cover labour and include 3 years of maintenance support and storage/compute fees.
“This will be great for knowledge sharing and improving R accessibility and usability. Please fund generously.” – Brendan Ansell
Signatories
The proposal has broad support in the wider R community. A call for statements of support was issued to the wider R community via Twitter, Reddit, Mastodon, LinkedIn and Oscar Baruffa’s newsletter subscribers.
At the time of writing, over 100 statements have been submitted and are overwhelmingly supportive of the proposed upgrades to the Big Book of R.
Project team
Oscar Baruffa – Project Lead
Oscar is the creator and maintainer of Big Book of R. He works as a Senior Analytics Manager, overseeing the development of a data and analytics pipeline including web portals. He has extensive project management experience for large and small projects.
Andrew Collier – Tech Lead
Andrew is the founder of and Lead Data Scientist at Fathom Data. Andrew has a wide range of experience in bringing data solutions to life directly and together with his team. He is the creator of a number of R packages and has hosted multiple SatRdays in South Africa.
Bianca Peterson – Data Scientist
Bianca is a data scientist with a strong background in the higher education industry, is trainer of The Carpentries and co-chair of the CODATA-RDA Schools of Research Data Science.
In addition to this team, Andrew is able to assign team members from his consultancy Fathom Data to support if required.
All project members have confirmed their availability to work on this project.
Consulted
Hadley Whickham (ISC Member) was contacted on 18th March 2023 via email for informal feedback on the proposal. Informal feedback was that the proposal was a likely candidate for consideration along with some notes to clarify the need and cost breakdown. These have been addressed in this proposal.
Oscar Baruffa consulted with Andrew Collier of Fathom Data on the objectives, technical approach and project team.
The wider R community has been asked for their support and feedback with over 100 responses received, which can be viewed here.
“The Big Book of R is my go-to resource when people ask, “What are some resources for learning about R in [industry]?”. I would love to see it further develop and grow.” – Isabella Velásquez
The Problem
Background
The Big Book of R (https://www.bigbookofr.com/) aka “the site” or “BBoR” is a curated collection of over 350 R programming books. Almost all of these books are open-source and free to read.
The site is essentially a library that gathers these amazing resources and puts them in one place. The problem the site is trying to solve is that of discoverability of R programming books. Prior to Big Book of R, the only way to find these books were by googling them or if one happened to come across a limited list of books contained in a git repo.
No other online resource contains the depth and breadth of the books available at BBoR and at a high quality, for free. Oscar launched the site in August 2020. The site stats are publicly available. Since launch, it has had some 240k visitors. The latest 12-month figures are that it receives about 95k visitors per year. Average time on the site is 2 minutes, with a click through rate of 31% i.e. about a 1/3rd of visitors click through to a book’s website.
Feedback from the R community has been positive since its launch, Survey respondents have stated they refer others to it as a go-to resource for new and experienced R programmers.
New books are added at the rate of 5-10 books every 6-8 weeks. Blog posts that highlight new book additions get 1-2k views each.
“As a 26 year user of R I think the upgrade will make The Big Book of R an even more indispensable resource for everyone from beginners through to experts. I regularly recommend it to those I mentor.” – Dr Lyndon Walker
The issues
The size of the already-large and continually-growing collection presents new problems:
- Curating books is becoming difficult, particularly removing books that are old. Outdated books clutter the user experience and reduce the relevance of the site.
- Some subjects have many books, making it difficult for visitors to navigate which has the content they need.
- The backend of the site is a google sheet. This is becoming increasingly difficult to use as a data store and is hard to pick up duplicate books and creates variations and multiple versions of author bio entries.
In summary, the general discoverability of relevant books is decreasing as the collection grows.
“The proposed upgrades would be well worth it. … This is ever more important in the face of increasingly ineffective search results from eg google.”- Jim Gardner
The Proposal
Overview
This proposal is for an upgrade of the Big Book of R which will:
- Improve the quality of the collection
- The durability of the system supporting it
- Greatly enhance the discovery of books in the collection
The proposal aims to balance improvements across a number of facets to get the most of the time and funds available. These are:
- Improve the back-end by migrating from google sheet to a database.
- Improve the data quality by cleaning up author bio links.
- Lower the time-cost of curation by using an app interface for data entry.
- Enrich the database with additional data for each entry i.e. date of last update and a table of contents.
- Improve discoverability by adding:
- A “trending books” chapter showing the most popular books of the last week and month.
- A “new additions” chapter highlighting the books most recently added to the collection.
- Positioning the book for future improvements by:
- Modernising the back end (see above)
- Improving data quality of existing and new books (see above)
- Porting the site to quarto.
- Annual maintenance support for 3 years
The R community will benefit from this upgrade. The annual 95k (and increasing) visitors will get a better user experience to find the information they are looking for. Authors of R books get wider distribution of their material and better return on investment for themselves and the R ecosystem.
“There’s an abundance of R literature and this book has provided a port for it. It helps new starters and seasoned users equally and it’s literally the only bookmark on R that anyone would need. Any upgrades on the delivery method of such an important source will go a long way and increase adoption of the R language!”- Vasileios Plessas
Project plan
Start-up phase
The startup phase will consist of creating the feature roadmap, familiarising the delivery team with the codebase, making technical decisions on architecture and setting up a review mechanism.
The team have already worked together professionally for ~3 years, and will mirror our already successful collaboration model.
The project already exists on github and collaborators can be easily added.
The team already uses Asana to work in an agile manner, and we’ll create a dedicated project board for this work.
Inter-team communication already works well via slack, and we’ll create a slack group for this work also.
As this is an ongoing project, most of the “start up” work is only related to this upgrade specifically which means we can start almost immediately.
Technical delivery
Migrating to a database
Rationale: This is a key step to improve the data quality and resilience vs the current google sheet solution. It also opens up the ability to store the Table of Contents for each entry.
Activities:
- Decide on architecture/tech provider.
- Design and develop the data model
- Clean the existing dataset
- Port data from google sheets to the model
- Document the model and tables
- Reconfigure the site to pull data from the database
- Refactor code that builds book entries.
Develop backend app interface
Rationale: Data capture is a time consuming exercise. A lightweight app interface to be able to do basic maintenance activities e.g. add and update data, check for duplicates prior to entry, select from existing categories, existing authors etc. .
Activities:
- Develop minimal functionality specification
- Select tech solution
- Design and develop the interface.
Port to Quarto
Rationale: Improves the site’s styling and opens up future User Interface improvements as Quarto develops.
Activities:
- Port book to quarto
Develop “trending books” section
Rationale: Opportunities to highlight books are currently limited to new additions, a random selection or a curators selection. Creating a “trending books” section will highlight books that have been accessed the most over the past 7/30 days.
Activities:
- Build an interface to the site analytics API (Plausible Analytics).
- Develop a chapter that lists the most popular books of the past period.
Develop “new books” section
Rationale: New book editions are very popular blog posts on Oscar Barufa’s website, and it would be useful for visitors to reference the site itself for new book additions.
Activities:
- Develop a chapter that displays the most recently added books.
Web scraping
Rationale: In order to retrieve the date books were last updated, and the table of contents of each book, web scraping of the books will need to be deployed. Web scraping will be limited to the two most common book formats to reduce development time and maintenance requirements. These will run infrequently to minimise compute costs.
Activities:
- Develop web scrapers
- Set up scheduling
Table of Contents dropdowns
Rationale: Often the title and book description are not enough on their own for a visitor to know what the book contains. A quick scan of the Table of Contents is a much better method and being able to do so within BBoR will make searching through volumes for the correct one much easier.
Activities:
- Incorporate table of contents for books that have it
Annual maintenance support
Rationale: The upgrade will result in the site having more moving parts and monthly service costs which will require occasional maintenance. 3 years worth of support is sought as this gives enough runway to assess how much annual support costs, and for Oscar Baruffa to plan for how to fund additional years of support (if support is required at all).
Activities:
- Monthly database and cloud computing costs.
- Occasional debug of web scrapers
- Minimum maintenance for bugs and security fixes
Note that Fathom data is offering to provide this labour support at cost.
Timeline and milestones
Milestones represented with an “X”below assuming a 1st June grant acceptance.
After the upgrade is complete, 6 monthly milestones for compute and maintenance costs.
Other aspects
Additional non-technical activities are not required to complete this work, however as new milestones are reached, they will be communicated via:
- Oscar Baruffa’s Twitter account (4k followers), email newsletter (1000 subscribers and syndication via R-bloggers and R-weekly newsletters), Linkedin and Mastodon.
- Big Book of R’s twitter account (5k followers)
- Fathom Data and Andrew Collier’s channels.
Requirements
People
The project requires people with skills in data modelling, web scraping and reporting.
The data-science skills will provided by two data science professionals on the project i.e. Andrew Collier and Bianca Peterson.
Project ownership, vision and direction will come from Oscar Baruffa as the creator and maintainer of Big Book of R.
“Oscar Baruffa has been consistent with providing and supporting learners across the world within the R learning ecosystem. I strongly support his passion and contribution to further improve the R ecosystem, and creating a better experience for the community. Yes, this upgrade is a positive contribution. – Timipa Ikidi
Processes
This is a relatively small project for this team, and bi-weekly progress check-ins may be needed where asynchronous communication via slack and Asana is not sufficient.
Handover to the community happens continuously as new features are developed, as this is a live website.
Tools & Tech
Database hosting will be needed to store the data.
Cloud computing will be needed to host a small api interface for data capture and to periodically run web scrapers on books. The development of the backend app interface and web scrapers are part of this proposal.
The main deciding factors on which provider/s to use will be based on cost and familiarity with the provider’s solutions.
Funding
[redacted]
Summary
Funding is required to cover development and maintenance costs.
- Development is provided by Fathom Data at their competitive-prices day rates.
- Maintenance support by Fathom Data will be provided at cost.
- Oscar Baruffa’s time is provided at zero cost.
The cost of the project is comprised of 94% labour costs and 6% storage/compute fees.
“The Big Book of R is my go-to resource whenever I need to learn something new or find specific techniques, and I love browsing its pages to stay up-to-date on the latest trends and innovations in the world of R.
What I appreciate most about it is how comprehensive it is. No matter what level you’re at, from beginner to advanced, there’s something in here for you.
Regular updates keep me excited about the new and exciting developments in the R community and help me stay on top of my game. I would love to see the Big Book of R upgraded to improve the functionality and accessibility of this essential R Community resource.”- Michael Underwood
Success
Definition of done
The New Books and Trending books sections are live on the website.
When these two components are in place, it will mean that all the underlying infrastructure work has been completed.
Book has been ported from Rmarkdown to Quarto with the ability to see the Table of Contents of books. When this is done the last component of front-end work will be completed.
Measuring success
Deliverables are tangible and objective, so we can use them to measure success.
- Dataset ported to database.
- App interface to manage data.
- “New Books” section available on the website.
- “Trending Books” section available on the website.
- Table of Contents viewable
“The Big Book of R is a fantastic resource for me as someone [who] teaches R and uses it for own research. An upgrade with added functionality would improve its value for me and my students. – Hendrik Jürges
Future work
Note that the Big Book of R will continue to be open sourced, and the webscrapers built for the upgrade will also be open sourced.
Confirmed future development:
- Big Book of R will continue to be updated and curated.
Possible future development:
- Development of an R package/API to query collection content e..g Jon Harmon (Chief Community Manager of R4DS Online Learning) cited this is very useful for his plans to create courses based on R programming books)
- Integration of an LLM chatbot (e.g. ChatGPT) to create a “librarian” that can help visitors find the right book based on their needs.
Key risks
Data Scientists are unavailable
If one or both of the technical delivery team are unable to do the work, this would delay the progress. A mitigating factor is that Andrew Collier is able to draw from other members of his staff at Fathom Data.
Work is more labour intensive than estimated
It can happen that work is more labour intensive than initial estimates suggest. We have tried to balance overoptimism and conservatism.
To mitigate an underestimate (which could hamper completion), a 20% contingency has been included in development costs.
Grant funding does not extend to maintenance costs
If the project is not able to receive the grant funding for the maintenance support, this could stall or break the project. Some possible mitigation steps are:
- Oscar Baruffa could cover the storage/compute costs personally.
- Attempt to crowdfund the developer support costs annually
Grant funding payments are late or do not materialise
This will stop future development until each tranche of funding is received, as neither Oscar nor Fathom Data can carry the costs of this development work.
Mitigating factors are that the milestones are self-contained, in short iterations and version controlled, so there is no contagion of unfinished work threatening the site’s functioning..
Other risks are very low as we are extending and improving an already functioning website.
“.. the big book of R is one of the greatest resources in the entire R ecosystem and it could use some improvements that reflect that.” – Gordon Blasco
“…this is a very positive contribution to the R ecosystem and the environment itself for this tool is something that adds to the scientific knowledge of R. Having the Big Book of R as an available tools for both beginners and seasoned coders of R makes it easier to learn new things, as well as to find new areas of interest or existing literature on a very specific topic. No other resource like this one exists.” – Daniel Sanchez
/end
Keep up to date with new data posts and Big Book of R updates by signing up to my newsletter. Subscribers get a free copy of Project Management Fundamentals for Data Analysts worth $12.
Once you’ve subscribed, you’ll get a follow up email with a link to your free copy.
The post Sharing the Big Book of R upgrade proposal appeared first on Oscar Baruffa.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.