A. What is Code Review?
Code reviews are traditionally done in the context of a software development team that is building out a new product or feature. The goal is to ensure that anything added to the common code base is free of bugs, follows established coding conventions, and is optimized. Code reviews are a practice that I first experienced after transitioning from working as a statistical analyst to a data scientist. One of the most important lessons I’ve learned over the past few years is that code reviews are critical for data science teams to ensure that good code and accurate analysis is being shipped. In this post, I will provide a review of practices that I’ve found most useful in my work leading code reviews. This will be specific to the R language as I work on a team where that is our primary language for performing analysis.
B. Why Conduct Code Reviews?
The primary stated benefit of code review in industry is improving software quality. By having small groups of colleagues review each others’ classes, functions, closures, and so forth on a regular basis, it will help ensure that the team writes elegant code, which in turn benefits the overall process or software that is being constructed. For data scientists and advanced analytics professionals, the rationale for conducting code reviews is similar. We want to write efficient code that contains sound logic and produces the appropriate output.
There are two other benefits to conducting code review that are worth mentioning.
- Consistent Design
Code review can help enforce a consistent coding style that makes the source code readable by a variety of members on the team. If different members on the data science team are following a single coding style, this will ensure that different parts of the project can be passed from one team member to another with greater fluidity. By emphasising a single coding style during the code review process, it will ensure consistent design and contribute to the maintainability and longevity of the code.
- Knowledge Sharing and Mentorship
Code reviews also allow colleagues to learn from one another and for junior folks to learn from more experienced team members. By allowing all team members to review others’ code, it allows employees at different experience levels to learn a lot by better comprehending the code. Furthermore, employees can also share new technologies and techniques with each other during the review process.
C. What Code Should be Reviewed?
As data scientists, we often write processes using R, Python, or other language where certain inputs are taken, a series of analysis is executed, and the desired results are generated. This type of process should generally be ‘automated’ and will be scheduled to run at particular times.
Consider the following R project. Let’s say that only one person is working on this, but they are part of a team of three data scientists.
The directory with R code contains the following files.
This file will just be a place where the employee takes the data set and does some exploratory data analysis. The goal is just to better understand the data through data visualization, simple regression models, and so forth. This file is really meant for running once or twice, and won’t part of the eventual pipeline.
The dataset builder will eventually be part of a pipeline where a SQL query will be used to pull the raw data and construct the processed input data. This file will utilize user defined functions to undertake these actions and the goal is for this part of the process to be as abstracted as possible.
This is the main execution file that runs the full analysis. It sources in the dataset builder and modeling functions, and conducts the desired process. This file will also need to contain a series of parameters that will determine filtering criteria and other parameters that dictate how the analysis will be run.
- helper_functions.R and modeling_functions.R
The helper and modeling functions files contain user defined functions that are used at other parts of the analysis. These functions need to be fairly abstract and reusable code. The basic idea is that many tasks can be abstracted into a function or piece of code that can be reused regardless of the specific task.
So given the files that are available in this example project, what files should be evaluated during a code review process?
In general, we would never want to review four or five different files at a time. Instead, code review should be more targeted and the focus must be on the code that requires more prudent attention. The goal should be to review lines of code that contain complex logic and may benefit from a glance from other team members.
Given that guidance, the single file from the above example that we should never consider for code review is basic_eda.R. It’s just a simple file with procedural code and will only be run once or twice. The files that should receive attention during code review are dataset_builder.R and execute_analysis.R. These are the files with the bulk of the complex logic and so it would help to see if any issues are present in that code.
D. How Frequently Should Code Review be Performed?
I lead the code review process on the data science team at my current employer. If I were the manager, I’d push for the team to perform two hour code reviews every week on Thursday or Friday during which every member of the team would have their critical code reviewed. Currently, we don’t do that, and code reviews occur on an as needed basis. This “works” to some extent, but the frequency of of code reviews will be dictated by how much time the team spends on complex processes.
E. How to Conduct a Code Review?
During each session, here are the instructions that I set forth to guide the code review.
- Every member of the team will focus on reviewing code produced by the other members. So each person on the data science team will have to review code from two others.
- A copy of each R file that needs review should be made and shared with the other two members of the team. Ideally, this file should contain fewer than 500 lines of code.
- The reviewer should use the file shared by the original author
- The reviewer sould make any issues, suggestions, or reccomendations using comments that are in upper case.
Any suggestions made about specific code should reference the function, line number, or section.
F. What Factors Should be Considered During a Code Review?
When the reviewer is looking at an R file for code review, here are the specific factors that they should evaluate.
- Does this code accomplish the author’s purpose?
- Are there any obvious logic errors in the code?
- Looking at the requirements, are all cases fully implemented?
- Does the code conform to existing style guidelines?
- Are there any areas where code could be improved? (made shorter, faster, etc.)
- Is this the best way to achieve the desired result?
- Does the code handle all edge cases?
- Do you see potential for useful abstractions?
- Were the unit tests appropriate?
- Is there adequate documentation and comments?
Any cases in which the reviewer is suggesting a change, I recommend that they provide a legitimate reason.
Furthermore, I provide the following guidance.
- Think like an adversary, but be nice about it. Try to “catch” authors taking shortcuts or missing cases by coming up with problematic configurations/input data that breaks their code.
- Compliment / reinforce good practices: One of the most important parts of the code review is to reward developers for growth and effort
These are some of the best practices that I’ve found from leading code review sessions on a small data science team. There is no single right way to set up a code review process and it will likely be dictated by the size of the team and type of work.
For any businesses interested in hiring a data scientist with over eight years of work experience, be it for freelance, part time, or full time opportunities, please contact me at [email protected]