Peer Reviews for Data Science
If it’s worth doing, it’s worth reviewing
Peer reviews increase the quality of data science output by diversifying our thinking and decreasing the probability of making an error.
Once you’ve felt the terrible sinking feeling in your stomach, you never forget it. I was in my first year as a data scientist in my first industry job, still getting used to the tools and cadence of the tech world. My assignment was to analyze a fairly simple A/B test, but one with company-level revenue implications, and I had just completed the read-out to the engineering team. I returned to my desk to answer some follow-up questions, and was scanning my analysis code from the top to figure out what parts needed to change when I noticed the inner join that should have been a left join. I had unintentionally excluded a bunch of zeros from my analysis. Everything I’d just spent an hour explaining to a dozen very interested engineers might be wrong.
At that point in my career, I wasn’t in a particularly unusual position — today’s data scientists, even relatively junior ones, have a broad scope and a large degree of autonomy. While these circumstances are great for engendering the creativity and ownership necessary to produce high-quality data science output, they can also result in excessive specialization to the detriment of the team, and (as I had just discovered) errors going unnoticed. I took a deep breath, fixed my query, and re-ran my script. As I waited, I spiraled, wondering how I could have missed the error.
Through nothing other than dumb luck, the error turned out not to affect my overall conclusions. I updated my document, added a note on the correction, and started following up on the questions. Nevertheless, that moment of terrible uncertainty left its mark, and when I started leading other data scientists I wanted them to learn from my mistake. In that case, I found my own error through repetition and luck, but at Square, we’ve found that peer reviews are more effective at catching issues than hoping we’ll all notice our own. They also help break down barriers between data scientists, forge a stronger team community based on shared knowledge, and ultimately increase the quality of data science output.
So what are data science peer reviews? The process we created at Square is inspired by two related traditions: code reviews in software engineering and peer reviews in scientific research (I spent 10 years as a neuroscientist in the days before widespread adoption of bioRxiv enabled broad dissemination of unreviewed work). There are many great discussions of those processes in the public literature, so we won’t spend any time discussing them here. Data science work requires tools from both endeavors, so we have attempted to define a review process appropriate for this newer discipline. The goal is to gain the advantages of having more than one person checking for errors, looking for missing pieces, and thinking about problems and solutions. A secondary benefit of peer reviews is that they spread information and context across the team, which reduces silos and creates more shared understanding and team cohesion.
At minimum, there are two people involved in data science peer reviews — the primary producer of the work and the reviewer — and we’ll discuss expectations for each.
The primary producer of any data science work is ultimately responsible for the quality of the output. Without vigilance, teams will tend toward the producer being left alone with this responsibility and not given adequate support and structure around improving their output. With peer reviews, producers are also responsible for identifying one or more reviewers when they start work (roughly corresponding to when a ticket goes from to-do to in-progress, if you track your work in tickets). Producers should share their code, written documents, figures, etc. in a format that makes in-line commenting possible. Finally, the producer is responsible for letting the reviewer know when the work is ready to be reviewed, sharing links to relevant context and related work, and providing a few representative examples or test cases that they used to manually verify their approach.
A peer review should consist of the following three parts:
- Direct inspection (and execution!) of the code
- Spot-checking several examples
- Checking against alternative data sources or methods and existing analyses
For very long analyses, review the logic and technical details for the key results first. This is one of the clear time-saving benefits that makes the review take much less time than the production — the reviewer has the benefit of hindsight, and only needs to check the assumptions and findings that turned out to matter. Given the ever-present need to prioritize our time, we recommend spending less time checking less important parts of the work.
Should I develop a partnership with an individual reviewer so that there is a default person with context and domain knowledge?
One of the goals of this process is to distribute knowledge and context throughout the team, so tight pairs of reviewers are discouraged. However, early in rolling out a peer review process, it is reasonable to turn first to the person with the most relevant knowledge.
In special circumstances, at the discretion of the producer, it is OK to share results before getting a peer review. We advise warning any viewers that the work has not yet been reviewed and therefore it may be subject to change. The motivation for sharing unreviewed work is based on the observation that even completely non technical or non expert stakeholders can provide useful considerations, context, and follow-up questions that improve the analysis before completion. In this way, stakeholders often act as supplementary peer reviewers. The intention of sharing unreviewed work is to get feedback and help improve the work, NOT to give a sneak peek of results that may change and therefore lead to confusion, or, worse, distrust.
At what stage should I identify a reviewer? (When planning a project, starting a project, or completing a project)?
Generally you should identify and contact a reviewer when starting a project.
What if I’m working on something that does not need a high degree of precision or accuracy, i.e. if I’m just looking for rough estimates?
Unfortunately, it is difficult (or even impossible) to know ahead of time if an error in even a simple query has a small or a large effect. One common cause of errors in data science is an incorrect assumption about the structure or meaning of data: For example, assuming rows are unique along a given column, or assuming that NULL values are uncommon. This type of error can have a huge impact on the results, even though it appears to be just a negligible detail at first glance.
Should I focus narrowly on technical correctness, or should I also review logic and business rationale?
Taking a broad view is more valuable — technical correctness is not important if it is answering the wrong question or ignoring important context.
For unfamiliar domains, we recommend setting up a live meeting to discuss the work. Ask for existing documentation. Check assumptions and look for errors using the three components of any peer review (above).
Remember, the producer is still ultimately responsible for the quality of the output. Given that time is always limited, it is reasonable to be judicious with yours — what parts of the work are most likely to have an error, which are most critical to the key recommendations, where have you seen issues in the past? Focus on those critical areas, and spot-check the rest. Most peer reviews should take about an hour to complete.
For unfamiliar topics, a live meeting is often helpful and actually decreases the total time spent reviewing. In areas about which you already have some knowledge, asynchronous peer reviews are often preferable.
Do standardized sample means tend towards the standard normal distribution for independent and identically distributed random variables?!?
- Should I develop a partnership with an individual reviewer so that there is a default person with context and domain knowledge?
- Is it OK to share results with stakeholders before getting a peer review?
- At what stage should I identify a reviewer? (When planning a project, starting a project, or completing a project)?
- What if I’m working on something that does not need a high degree of precision or accuracy, i.e. if I’m just looking for rough estimates?
- Should I focus narrowly on technical correctness, or should I also review logic and business rationale?
- How do I review work when I am unfamiliar with the domain, context, data structures, etc?
- What level of detail and care is appropriate for peer reviews?
- Is it a good idea to discuss live, or can peer reviews be entirely asynchronous?
- Are reviews worth the effort?