Great code review is one of the most underrated skills a Data Scientist can have. In this blog I will share my top 3 code review checklist, specifically for R Code. In our data science team we regularly do code review to make sure it is up to the standard it needs to be. The top 3 elements mentioned in this blog should in my opinion be included in any code review for R code.
One of the greatest pitfalls of code review is developer bias. The goal of code review is to create the best code, not the code you as a developer like the best. I can’t claim this blog is completely whiped clean of developer bias, we all have it. For me it is the main reason why I disliked code review so much. I wanted to write the code I like best. My reasoning was that if it works, why does it even matter how it works. Over the years I have learned as members of our team come and go what code review can add. By using best practices we design our code to be better understood and easier to maintain. Anyway, here is my code review checklist Top 3.
R Code Review Checklist: 1. Helper Functions
Helper functions are without a doubt my number 1 on any code review checklist. It is not an understatement that every for loop ever written in the R language should be replaced with a helper function. When I review someone elses code this is the first scan I make. At its core R is a functional programming language, this should say alot on why we should put focus on this aspect. Some other reasons are:
The benefits of helper functions are many, for me, since I love a great function it is also makes code easier to read. But seriously, one of the main reasons for me is actually how it seperates your code into concise chunks. Great production ready code consists of a central script that loads different helper functions as needed. In the helper functions the actual data processing, modelling, etc is performed. As a reviewer of your code this allows me to evaluate it in the context of what this specific code chunk is ment to do.
Transferability of Helper Functions
Once we build our code in functional chuncks/functions, it allows us to improve transferability of our code. This is true both for projects and colleagues. A core benefit of code review in R is that it allows you and your team to derive a core set of functions for your company data. Ultimately these functions could be gathered and transformed into a company package.
I started this segment by focusing on for loops, in reality helper functions should be more common. Helper functions are helpful to prevent repitition not only through for loops but also through cross-project functionality. So if I see a step written out in a colleagues code which has a high likelihood of being in need of repitition, I will mark it as a potential helper function.
R Code Review Checklist: 2. Package Optimization
What I remember most from starting out with R is the bewonderement of packages. Now I never had much other programming experience so stick with me on this. What was this magic that you could load functions to do stuff to your data as needed? And there are currently how many packages on Cran? The answer is 17260 packages. Thats a hole lot of code read to work with for you in your projects. One of the great skills an experienced developer can have is to be library agnostic, and this is my focus when reviewing code.
Why We Dont Hammer Down Screws
What exactly is library agnostic? it means that you are willing and capable to switch packages when needed. I’m trying to put emphasis on using the right tool for the right job, this should be a central element in code review. An area where this long was a problem for me is BaseR, I would nearly always use BaseR code as this is what I learned and was comfortable with. I had no immediate reason to use packages such as the tidyverse or data.table, my code was working fine. But it ended up holding me back, because I was incapable of switching packages as needed my code was inefficient in specific areas.
It is in my top 3 jobs of the reviewer to identify these areas and give feedback on optimizing the usage of packages. Look at what packages might fit the specific use case, what packages are implemented in the code? When providing feedback in this area I like to explain different options and their potential benefit (speed, understandability, etc). This in my opinion also helps the reviewer think more about the code and is an area that benefits both.
R Code Review Checklist: 3. Code Placement
When I write an initial script for a project, it always starts out quite structured. Its like a ritual, load data, select, transform, make visualizations, etc. But then one thing always happens, I notice something in my output, a weird pattern. Perhaps its a colleague that wants a slight alteration of the insights. What happens to my code now? I end up placing a quick filter on row 241. A week goes by, maybe even a month goes by, I revisit the project. The scope has changed slightly and I start adding code again.
If you are lucky this is a project that ends at some point, and your code gets retired with it. But what if it becomes a success? We are going to need production ready code that is shareable within the organization. Code review is all about this, we focus on reordering and optimizing code placement. As a code reviewer split the code in front of you into use cases and their approriate chunks. If someone loads data outside the ‘load_data’ chunk, mark it. If someone filters their data within a data visualization, mark it.
The value of great code placement is difficult to overestimate. It hits all the relevant boxes of code review. It improves readability of code and transferability between projects, and very often increases its functionality too.
Wrapping Up Code Review Checklist
These 3 points together make up the core of my code review, and belong on any code review checklist. These 3 points also have one thing in common. They all focus on outer appearance of code, I am a strong believer that this is the biggest benefit of code review. By optimizing readability and transferability we create synergies within data science teams and proejcts that are much needed.
You will find online articles that see validation of the output of code as part of code review. I personally think that any validation is part of the code writing process itself. If you dont have checks and balances in your code that validates your inputs and outputs thats a big problem. The role of the code reviewer is to check for these elements, but they are not responsible for writing or doing it themselves.
This blog was very focused on the ‘how’ of code review. If you are interested in an introduction on the what and why of code review, why it benefits your team or company or a practical example of code review I will release content on those subjects soon. Hopefully put together some of it can benefit you in starting out with code review.