Evolution of Developer Productivity at Square - Part Four
Investing in reliability and test engineering
Welcome to the fourth and final part of our blog series, where we focus on a crucial aspect of our journey at Square: Investing in Reliability and Test Engineering. At the end of the day, our goal is to deliver high quality products that delight Square customers. However, as our codebase and Square’s customer numbers grew, maintaining product quality became increasingly demanding. We started seeing an increasing number of customer cases tied to reliability. A deep dive showed that many of these stemmed from bugs that could’ve been caught during development and testing phases. Recognizing this, we took a hard look at our development cycle and decided to refocus on improving reliability.
The first order of business was to cultivate a strong reliability culture. Every quarter, we gathered GMs, teams from engineering, product, design, QA, and customer success for a comprehensive Reliability Operational Review. It wasn’t just a catch-up, but an exercise in collecting insights, spotting recurring issues, and most importantly, charting out concrete next steps. Through these sessions, our mindset started to shift. Business units started committing to Service Level Objectives for their crucial services. Monthly reports kept the momentum, ensuring everyone was on the same page. There was even a special task force formed by various product engineering teams, which we lovingly called the 'tiger team', who went on to tackle and resolve top performance issues across different business units.
As our tech landscape shifted, we realized that our approach to mobile quality needed an overhaul. We transitioned our mobile QA organization into release engineering and test automation teams. This newly minted Mobile Release Engineering team transformed the release process by automating it from start to finish, empowering product teams to release Beta versions of their apps independently. They even gave our internal release portal a much-needed speed boost. The new Mobile Performance and Reliability team came out swinging with an automated tool for crash reporting which also created tickets and assigned them to the right teams, alerting the right oncall channels, and slashing our crash resolution times. We formed a mobile test engineering squad who took charge of testing frameworks and infrastructure. They not only addressed flaky tests but also became the go-to resource for test automation. Following this, we delved into code coverage and rolled out visibility tooling across our major repos.
Despite these reliability improvements, our work was far from done. We also made strides in enhancing our testing infrastructure. A pivotal move was our transition from Firebase Test Lab to a more efficient, in-house solution. This strategic shift not only slashed our testing expenses by an impressive 95% but also accelerated test times by 60%, translating to an annual savings of $2.5M. To further bolster our testing infrastructure, the Hardware team created an on-site fleet of Square hardware products, specifically for software and firmware validation.
Yet, the journey didn't stop there—especially when faced with new challenges brought on by the pandemic. As it put a halt to our "Lunch Testing" programs where we used to dogfood Square products for transactions during lunchtime in our cafeteria, we had to innovate. To fill this void, we pioneered a hardware robotic testing lab, capable of testing all Square Readers and Registers. Using OCR technology, our testing system even analyzes actual paper receipts. We managed to automate a significant 70% of our essential smoke tests, integrating with peripherals like printers, cash drawers, and scanners. Using real hardware also enabled us to pinpoint UI performance challenges. Moreover, we rolled out a mobile device lab, armed with physical devices. Any QA engineer can now remotely access these devices via a browser, and test any version of our mobile apps, significantly boosting their productivity.
Key Learnings
Everything we do is in the service of delivering exceptional products that delight our customers. As we grew, we saw that true customer delight means having reliability as the foundation of our development practices. This shift to a proactive, preventative stance has been instrumental. We didn’t just adapt, we transformed, creating a synergy across teams that put reliability at the forefront. Our move to an in-house testing framework and our quick innovation amid a global pandemic weren’t just about being more productive, they were about understanding that our customers’ joy is rooted in the trust they place in our products’ reliability.
Moreover, we recognized that delivering impactful, far-reaching solutions requires a collaborative effort, a 'village' of diverse teams across multiple business units, each contributing its distinct expertise. It's this blend of interdisciplinary insights and collective wisdom that equips us to tackle complex challenges holistically. Impactful changes arise from a culture of shared responsibility and collaboration between many teams working together towards a shared goal.
Wrapping Up
Thanks for reading about how we're working to make things better for our developers at Square. I want to give a big shoutout to our teams and engineers. They're the ones who really make all of this happen! From our humble beginnings to the present day, with all the complicated challenges we face, it's their hard work that keeps pushing us forward. Looking ahead, it’s not merely about tackling bigger challenges; it’s equally about the teamwork involved in getting there. This team-first approach is what will help us keep moving forward, no matter what comes next.