Lessons Learned From Running Web Experiments
Unveiling key strategies & frameworks
I am a Product Data Scientist for the Ecosystem Discovery team at Square, which helps design, develop, and maintain several key pages of the public website squareup.com. These include the Homepage, Pricing page, Why Square, Point of Sale, etc which have a significant volume of visitors who want to discover and sign up for Square solutions. Due to the substantial traffic volume on these pages, any modifications to them can have a significant impact on downstream metrics and multiple cross-functional teams that depend on these pages for seller acquisition and lead generation. Thus experimentation is one of key focus areas in our team, to provide visibility on the impact of changes to stakeholders, assess risk and enable data driven decision making. Here are some of the key learnings from the dozens of experiments our team has run on the website traffic:
Note: At Square we avoid the term ‘user’ when we talk about our customers or potential customers, instead we call them ‘sellers,’ to better understand who the audience is, while building products. In this document I have used the word sellers and visitors interchangeably depending on the context.
Square’s Ecosystem Discovery team leverages several metrics that help measure various aspects of the website. We have metrics used for lead generation where potential sellers reach out to sales for help with product/software purchase, metrics for brand and marketing which focus on awareness and metrics for product teams that focus on onboarding sellers to the right products, and many more. Having a clear understanding of which metrics are key decision making metrics based on the goal of the experiment (primary metrics - eg: revenue), which metrics provide additional insights (secondary metrics - eg: Bottom of funnel conversions) and which metrics are meant as a do-no harm metric for monitoring (guardrail metrics - eg: page load speed) can provide clarity on what metrics decide the success of the test. However, sometimes, primary and secondary metrics may not move in the same direction, which can make the decision on rollouts more complicated. In these cases, having a tradeoff matrix (see an example below) that clearly defines metrics that matter the most and the trade-offs to be made when not all metrics move in the same direction, can be very helpful. It provides stakeholders a good starting point to brainstorm decisions for every scenario and align on the rollout decision as a team. The below tradeoff matrix shows one way to make different decisions based on varying importance of metrics. Note that it does not take all possible cases into account, this can be modified to add/remove scenarios based on the experiment and team needs. Also note that the significance levels categories mentioned in the legend below are not universal rules, it’s just one way to categorize them & can vary depending on the risk tolerance.
- Green and Red → (>= 95% confidence) Statistically significant
- Blue → (90%-95% confidence) Directional
- Grey → (<90% confidence) Failed to reject null hypothesis or neutral
|Scenario||Primary metric||Secondary metric 1||Secondary metric 2||Decision|
(within a threshold)
(within a threshold)
|Roll out the variant as is.
(within a threshold)
|Roll out the variant as is.|
|C||directionally positive||neutral||positive||Perform a deep dive analysis, revise the variant and test again.|
|E||neutral||neutral||neutral||Perform a deep dive analysis, revise the variant and test again.|
|G||neutral||neutral||negative||Do not roll out the variant, re-evaluate strategy for the variant.|
|H||neutral||negative||any||Do not roll out the variant, re-evaluate strategy for the variant.|
|I||negative||any||any||Do not roll out the variant, re-evaluate strategy for the variant.|
An example trade-off matrix for experiment result evaluation
On Square's public website, we see a combination of visitors that are anonymous (likely Prospects who are looking for potential solutions) and existing merchants who want to login to their Dashboard or check out the new offerings. The state of a Prospect changes to an Existing merchant after they signup on the website. Before deciding on bucketing the test only for ‘Anonymous visitors’ or ‘Existing merchants’ for an A/B test, we consider how the state of the visitor might change during the duration of the test and how that impacts their experience. For instance, in an A/B test which only targets Anonymous visitors, a visitor who starts off as ‘Anonymous’ but signs up during the same session, will get out of the test experience (and possibly start seeing a different variant). This can not only create noise in test results but also cause a negative visitor experience. Certain third party tools offer built-in options for sticky bucketing, which ensures that the visitor sees the same variant throughout, even if their state changes, which can be helpful in these cases. If sticky bucketing is not an option, consider targeting all visitors and segment the results as a post experiment deep dive to better understand their behavior. When segmenting results, remember to adjust for multiple hypothesis tests accordingly using correction techniques like Bonferroni correction.
The Ecosystem Discovery pages like the Homepage and Pricing page are high visibility high traffic front-facing pages impacting Marketing, Brand, and Product teams. Any new changes and test setups are heavily vetted i.e. QA’d before starting the A/B test. However, since these are front-facing pages, testing on 100% of the traffic (with a 50/50 split) can be risky, since there can be uncaught bugs in the page or test implementation, page load speed can be impacted, links can be broken or metrics can drop suddenly, or there can be vulnerabilities around bots, etc. To avoid these risks, draft an experiment rollout plan with a decision making framework to ramp up the experiment traffic in phases. This can better capture metric changes on a small percentage of traffic, bring down the risk involved by ensuring that any big issues are captured early on, avoid negative seller experience, and ensure that the stakeholders feel confident about rolling out the experiment to a higher percentage of traffic. The below matrix shows one way in which the experiment can be ramped up for a single variant experiment.
|Phase 0 - Debug||Phase I - Expand||Phase II - Experiment||Phase III - Full Rollout|
|Objective||Confirm that the new redesign doesn’t hurt guardrail metrics.||Confirm that the new doesn’t hurt key metrics (get directional results)||Determine the winning design for the page||Continue optimization to improve metrics|
|Test Split||5% - Variant
95% - Control
80% - Control
50% - Control
|100% - Winning Variant and continue iterating|
|KPIs to check||Guardrail metrics||Primary metric||Primary metric
|Monitor long term curing metrics post launch.|
|Criteria to move to the next phase||No harm(+/- 5%) to guardrail metrics. No bugs detected.||Primary metric is directionally positive/flat
(90% stat sig)
|Primary metric is positive with 95% stat sig (or use trade-off matrix)||Long term metrics are directionally positive/flat.|
|Duration||3 days||2 weeks||2 weeks||Curing period for long term metrics|
An example experiment rollout plan for a page redesign with single variant
It takes a village to run a good A/B test. At Square, we have amazing folks working alongside the Data Scientist on various aspects of A/B testing including the UX research team to gather hypothesis and identify seller problems, Design team for explorations and building pages, Engineering team for implementation, tracking, site speed, and platform support, Platform teams for test setup, tooling and QA and the Product Manager to provide strategy across the board and get alignment with stakeholders. We also have many stakeholders including Marketing, Sales, Brand, Lead Gen, etc who provide inputs and feedback on the design and A/B test strategy. With stakeholders, it always helps to err on the side of over communication and frequent updates that are thorough and transparent. At Square, we typically have many sessions dedicated for presentations and feedback, where all the key stakeholders are invited to provide their inputs. We also heavily rely on good documentation to gather feedback offline. Understanding best ways to collaborate with these partners, supporting them with data insights, and keeping them informed at every step can go a long way in designing and running a successful experiment.
Experiment analysis can be repetitive, time consuming, and error prone if Data Scientists manually run the same queries across every experiment. To avoid this, we started off by building a python automation framework for all redundant tasks like calculating experimentation duration, determining lift between the variants across multiple metrics, determining statistical significance, etc. We also generalized it to be scaled across teams which helped speed up experiment calculations significantly. We recently invested in a third party tool to scale up automation further, which automatically calculates experiment results and allows non technical stakeholders to see experiment results easily. We also have a dedicated experimentation program team at Square which has been valuable in building tools and processes to enable teams at Square to A/B test effectively. Investing in robust tooling for experimentation automation and having a dedicated team for experimentation programs can substantially improve the experimentation velocity and in turn drive better results!
At Square we have an open communication culture, which enables easy knowledge sharing and provides visibility. Thus documenting best practices and sharing learnings not only benefits the team, but also the entire company, which can be super valuable in building the knowledge base and increasing the impact of the A/B test results. It empowers new hires to learn from best practices and learnings from prior experiments and build on top of it, instead of reinventing the wheel. Sharing learnings can also start conversations that help uncover better experimentation practices or tools. Our team takes an additional step of having a retrospective meeting after every major experiment to understand gaps, assess solutions, and document learnings, which better equips our team for future experiments.
It’s natural to get excited when experiment results are positive or to feel disappointed when the variant performs worse or is inconclusive. However, it’s best to not take the results personally. Know that it is completely alright for an A/B test to not generate positive results or perform worse, and a lot of times, it’s expected. Based on this HBR report, a vast majority of A/B tests fail. Negative A/B test results can help stop a rollout that could potentially harm metrics, save the company money in the long run, and avoid negative visitor experience on the website. It can also provide a valuable perspective on future changes when followed up with a deep dive. Inconclusive tests on the other hand, can be typically avoided by ensuring that you have a clear hypothesis, sufficient sample size, adequate test duration and prioritizing A/B tests that have the potential to yield substantial differences. The more you A/B test important changes, the more you learn from them and the better your chances are of seeing positive results over time!
Special thanks to Victor Umansky, for his ideas on framework-driven approach to experimentation, which greatly enriched the content of this blog post.
We are hiring if you’d like to join us!
- 1. Create a metric hierarchy and trade-off matrix to simplify rollout decisions
- 2. Ensure that the bucketing is done correctly
- 3. Lower the risk by ramping up A/B test traffic in phases
- 4. Learn the best ways to collaborate with internal teams and stakeholders
- 5. Invest in experiment automation
- 6. Document best practices and share learnings
- 7. Don’t take A/B test results personally