Evolution of Developer Productivity at Square - Part Two
Accelerating development velocity
Kicking off the second part of our blog series, I'd like to talk about our efforts to address one of our most pressing challenges at Square: improving the development velocity. A recurring piece of feedback that came to light was a shared sentiment among engineers – the time taken to roll out features felt longer than it should. Recognizing this, we focused our efforts on streamlining every step of the software development lifecycle from inception to deployment.
The Client Foundations team introduced the 'Development apps' framework to expedite feature development. This allows us to build features in isolation, dramatically reducing both build and installation times. To put it in perspective, local building of a development app now takes just 18 seconds, compared to the 96 seconds required for a standard build.
In our quest for further efficiency, we also implemented Remote Development Environments and Cloud IDEs specifically for Android. By utilizing powerful EC2 instances with 32 cores and 96GB of memory, we successfully doubled the speed of local builds. Additionally, configuration caching was rolled out, making local Android builds up to 50% faster. This alone has saved over 5,000 developer hours annually.
In parallel with local development improvements, we made significant strides in Android Continuous Integration. Our novel shard skipping algorithm improved Android CI job speeds by as much as 90%. Concurrent optimization of Android and iOS emulators and simulators led to a 23% reduction in CI durations.
But that's not all. We've also been focusing on addressing Android's technical debt and improving build-related APIs. Through Conventions Plugins and Gradle upgrades, we’ve managed to cut down build times by over 30%. We haven't overlooked IDEs either; despite their occasional struggles with our large codebases, we've been actively involved in IDE releases. This includes continuous benchmarking of build and sync times, as well as problem-solving, leading to further performance improvements.
As the number of Gradle subprojects in our large codebase increased, IDE sync times kept increasing which led to frustration among our Android developers. After digging in deeper, we identified that our modularization strategy was contributing to the slowdown and implemented optimizations that increased build graph traversal speeds by 4x. We also discovered inefficiencies in the IntelliJ Kotlin plugin and were able to improve sync times by an additional 3 minutes. Further, we engaged deeply in the release cycles of new IDE versions. To systematically assess performance, we developed a tool that benchmarks different versions of key components in the Android tooling ecosystem, such as AGP, Kotlin, Android Studio, and Gradle. This data-driven approach helped us pinpoint performance bottlenecks, guiding us in filing targeted bug reports. These efforts collectively resulted in a 60% improvement in IDE sync times and saved an estimated 1,600 developer hours per year, significantly enhancing the local development experience.
We continued to innovate with more transformative strategies. Migrating our iOS and Java monorepos to Bazel had a profound impact, slashing build times by 80% and 30%, respectively. Prior to this, nearly a third of our build failures on main branches were due to merge conflicts. Introducing merge queues, particularly in our iOS monorepo, significantly enhanced branch stability. Another persistent issue was the lag between making code changes and testing them in staging environments, a delay that negatively impacted workflows and productivity. The Deploy team's introduction of "SKI (Square Kubernetes Infrastructure) Playpen" addressed this challenge head-on, enabling local changes to be deployed to staging environments within seconds. This has been a vital improvement for both our backend and frontend developers.
Key Learnings
This is not to say everything was smooth sailing. When it comes to Bazel, its support for Android has been somewhat... trailing. Back in October 2020, internal Android rules were introduced with the hope that the open-source community would take the lead, much like what happened for iOS. But four years in, and Bazel is still missing some critical features. Its IDE plugin isn't quite in sync with the latest IntelliJ versions. And even with yearly announcements at major developer conferences, the core focus remains on Gradle. Reflecting on the rapid innovations within the IDE and the Android toolchain, we've concluded that, at least for now, Bazel might not be the best fit for our Android apps. We're confident that we can still achieve a great developer experience without necessarily transitioning to Bazel. If you're among the pioneers who've had more success, I'm eager to gain insights from your experience and engage in more in-depth discussions!
It may come as a surprise that the main bottleneck in delivering features is not related to build times or test speeds. Rather, the issue lies in the time spent on reviewing pull requests. While it's easy to focus on the technical aspects, we've discovered that streamlining processes can significantly accelerate our entire development lifecycle. To address this, we've implemented tools that optimize reviewer assignments and consider evenly distributing workloads. However, we believe that cultivating a culture of dynamic collaboration is equally impactful. To this end, we are promoting innovative strategies like establishing Service Level Objectives (SLOs) for code reviews. To further incentivize this, we're considering increasing visibility around this metric and offering public recognition to our top reviewers.
Let’s switch gears and talk about what it takes to maintain the performance wins. In our domain, keeping performance steady (if not better) year after year is a real win. We want to make sure that we’re taking advantage of every bit of performance that our machines can deliver. On that note, a while back, we partnered with our security team to explore an anti-malware tool with notable advantages. In order to make sure these enhancements didn’t come at the expense of performance regressions of our systems, we engaged in constructive dialogues with our security colleagues. Together, we explored a variety of solutions, experimented with diverse approaches, and communicated with the vendor for potential optimizations. Ultimately, our collaboration showed that we can have an inclusive solution that raises the bar for both performance and security.
Wrapping Up
As we wrap up this second part of our blog series, we've taken a deep dive into Square's efforts to improve development velocity. From introducing frameworks that cut down build times to tackling Android's technical debt, we've shared a range of strategies that have made a substantial impact. While some initiatives, like our transition to Bazel, had mixed results, they provided valuable lessons in adaptability and the importance of aligning tools with our specific needs. Our journey underscores a key takeaway: enhancing development speed goes beyond just technical solutions; it's about continuously evolving and finding the right balance between innovation, efficiency, and collaboration. In the next installment, we'll focus on our platform-specific tools and strategies for reliability and testing. Stay tuned!