The data-driven case for best practices and against silver bullets
Just over a year ago, we published Herding Elephants, in which we explained how we—Square's Mobile Developer Experience (MDX) Android team—successfully modernized the build logic of Square's Point of Sale apps. In that post, we described the rationale for why we did this, and what we thought we gained from the effort (such as better separation of concerns, testability, and ultimately improved performance). That rationale was fairly abstract, however, with talk about treating the build domain as a legitimate software engineering domain and following the same software engineering best practices you would for your feature work. Those abstract principles, plus the fact that the best-practices migration had only recently been concluded at time of writing, might have led to some justifiable reader skepticism.
This post is a triumphant follow-up to last year's. We will show that the gains predicted were real, enduring, and cumulative. Our build is faster now than it was before we started our project, even after adding a million lines of code (a 33% increase). We firmly believe that our approach is generalizable, and that a focus on best practices and incremental improvements (which can unlock large wins, for example when we were ultimately able to enable configuration caching by default) should be the default choice, and that risky moonshots or silver bullet solutions should be treated with great skepticism. As always though, YMMV.
This is not a technical deep dive. Instead, it is a high-level overview of what we did, why we did it, what we gained from the effort, and why we believe our approach is generalizable.
But first, let's update the repo statistics from last time. One year ago, the Point of Sale repo contained about 3 million lines of code spread across 3,500 Gradle modules; and the many dozens of developers working on it ran nearly 2,000 local builds daily for a cumulative build cost of nearly 3 days per day. Our CI system ran more than 11,000 builds per day, for a cumulative build cost of over 48 days per day. Today, that same repo stands at more than 4 million lines of code (+33%) spread across nearly 4,400 Gradle modules (+25%). Developers run nearly 2,200 builds per day (+10%), for a cumulative cost of more than 3 days per day (±0%). Our CI system now runs nearly 12,000 builds per day (+9%), or more than 58 days per day (+30%).
By almost every metric, our repo and its related metrics have grown significantly year-on-year. Based on this growth, it would be reasonable to expect that local build time would also grow (that is, get worse). Has it?
As a matter of fact, no. Since we began modernizing our Gradle build about a year and a half ago, we have continued to wrestle with the long tail of pre-modern idioms, making incremental gains along the way (see below for some examples of modernization). Those incremental gains compounded such that, today, despite tremendous growth in code size, our local builds are actually faster than they were before we started. In fact, local build performance has improved so much that, even though we now employ more developers who run more builds, cumulative time spent in those builds hasn't grown significantly at all. We are shipping more features, faster.
"More code does not have to mean slower builds."
Back in 2019, our local dev experience was far from optimal. Local builds took "forever" and, due to an inability in keeping Buck working for us, we had just migrated off of it and back to Gradle. That crisis and its solution were articulated well by former Square engineer Ralf Wondratschek in his Droidcon talk Android at Scale @Square. In that talk, Ralf described our (then new) module structure and the concept of isolated or "demo" apps, which gave developers a way to iterate on their new features in a smaller sandbox than a full (and very expensive-to-build) app flavor. Unfortunately, a consequence of this structure was an explosion in the module count which, over time, led to build and IDE performance degradation due to unexpected inefficiencies in both the Gradle build system and the Android Gradle Plugin (AGP).
Gradle's greatest inefficiency in this context is its single-threaded configuration phase, which meant that each additional module imposed a constant overhead on the build; since we were adding modules at a constant rate, we were experiencing linear growth in configuration time, which had a major impact on build and IDE sync time. AGP's most interesting inefficiency is that each Android module has a large number of Configuration instances (dependency buckets), which leads to an explosion in heap (memory) usage; since we were adding Android modules at a constant rate, this led to a near linear growth in heap usage over time, putting increasing pressure on all of our infrastructure.
In response to these performance and other issues, we embarked on what turned out to be a two-pronged approach. First, we invested in converting our build to Bazel, which was a risky strategy due to its all-or-nothing nature, but nevertheless held tremendous promise if we could pull it off. When that didn't pay off in the timescale we had hoped (and when other more pressing issues became apparent, such as IDE performance and CI issues), we next began deepening our understanding of the Gradle build tool as well as our relationship with Gradle, Inc. Let's examine these two strategies in turn.
While not the main focus of this post, it is worth calling out that one of the advantages Bazel provides out of the box is the powerful ability to query the build graph without needing to configure any build. This makes it fairly straightforward to build a CI partitioner that can, say, skip 80% of the shards (jobs) in any given PR. In our Java backend repo, we use this feature to save literally millions of dollars per year in costs associated with CI. In our Point of Sale Android repo, which uses Gradle, we had to build a custom query system to achieve the same results. It works very well, but is much slower because we can't yet skip Gradle's expensive configuration phase. This is an area of ongoing R&D for the MDX team.
Bazel—the one build tool to rule them all. Infinitely scalable, no configuration bottleneck, unifies backend and both mobile ecosystems under one build roof. And in truth, we've had great success with it in our Java backend and iOS repos. Nevertheless we must admit that we have not had great success in our Android Point of Sale repo. There are several reasons for this, which might be summarized as: Bazel is not the official build tool for Android. Our historic usage of Gradle (that is to say, the official build tool for Android) has enabled us to make more or less rapid uptake of new features in the Android build ecosystem. Some of those features, either not yet implemented in Bazel or only implemented after a very significant lag, include: app bundles (open ticket); D8 (open PR); R8 (open PR); view binding (open ticket).
In addition Bazel, the Bazel Android Rules, and the Bazel IntelliJ plugin have lackluster support in other areas where we have greater confidence in the official toolchain (to wit: Gradle and AGP), including Jetpack Compose IDE preview integration (open issue), as well as a variety of other IDE integrations (list of open issues). In the end, we think we'd be responsible for addressing IDE integration ourselves, which would be a very significant investment with little guarantee of payoff when compared against the aggressive investments Google's dedicated team is clearly making in vanilla Android Studio.
As an example of how this can play out in practice (and how decisions in one domain—build systems—can have repercussions in other domains—IDE performance and the development loop), consider Android Studio Electric Eel. It is such a massive improvement over prior iterations of Android Studio that supporting it may turn out to be our entire developer retention strategy. And we got those improvements almost entirely "for free," as a result of not ignoring the official toolchain. Achieving the same results with Bazel… well, it would first require getting our entire build to work with Bazel.
"So far it's a massive difference. Code analysis takes just a second or two… about 1–2 seconds for it to analyze a whole file and enable syntax highlighting. Still a slight delay when actually typing… but before this, it took several minutes (if ever) for a file to be analyzed, and real-time typing didn't exist." (From feature engineer on what it's like to develop using Android Studio Electric Eel.)
Gradle—the build tool everyone loves to hate (if one is to judge from Twitter curmudgeons). With its well-known scaling issues, it's obvious Gradle is a non-starter for one of the Earth's largest Android repos. And yet: what if it wasn't? What if, instead of complaining, we took the build domain seriously as an area of endeavor, created a (small) dedicated team, and made a concerted effort? What if we embraced every best practice—not only Gradle-specific but generic to software engineering itself—what if we got rigorous? And what if we did it with the official toolchain, such that our small (but mighty) team could reap the rewards of collaboration with industry partners like Google, Gradle, and Jetbrains? An investment in the official toolchain is a decision to accept free multipliers. It's also an investment in the community, since only the largest companies with the fattest platform budgets will be able to make the investments necessary to keep the lights on with Bazel. And as we've demonstrated quite clearly, it's not necessary to switch build tools. We have proven that Gradle can handle large Android code bases. An investment in the official toolchain is a decision to accept free multipliers.
Though this post isn't meant as a deep dive, we do want to list some of the specific changes we made as we fixed our Gradle build. There's no magic here! Note that some of these changes are also discussed here, and some are even enforced with the Gradle Best Practices Plugin.
This change is discussed in some detail in Herding elephants, but to recap, we strongly believe that large projects should avoid
buildSrc and should use Gradle's composite build facility instead. Very large projects should strongly consider taking this a step further and publishing their Gradle plugins and consuming them as binaries in their main builds, as this avoids a configuration-time penalty that is easily observable when using composite builds.
As Gradle has evolved in recent years, it has increasingly moved away from what it calls "eager APIs" to "lazy APIs." Readers should take a look at the documentation on task configuration avoidance and lazy configuration for an in-depth discussion on how to defer work and make your builds as lazy as possible.
Script plugins are Gradle scripts that can be arbitrary applied. Think:
apply from: 'complicated_thing.gradle'. These are a poor man's Gradle plugin, and can bloat heap usage and make your builds harder to understand. Just write a plugin.
The most common examples of cross-project configuration are usages of
allprojects blocks. These defeat configuration on demand, will prevent usage of the forthcoming project isolation feature, and dangerously couples the projects in your multi-project build to one another, among other bad consequences.
No, we don't mean the implicit integration test named "does my build still work?" If you actually test your plugins using Gradle TestKit, you can have high confidence that those plugins work and that your changes haven't broken anything. We have literally hundreds of such tests that run on CI (if you count their parameterization, that is); see for example the functional tests for the Dependency Analysis Gradle Plugin, whose style heavily influenced our own tests. Testing lets us iterate rapidly and confidently, and unlocks much of the rest of the improvements discussed here.
One of the most useful things we did (and which was enabled by all the other things!) was turn on the configuration cache by default. We estimate that doing so will save us about 5400 hours of developer time annually, at minimum.
The above list is far from exhaustive, but should give the reader a sense for where to focus efforts.
We title this section "incremental improvements, compounding gains" because that's what we got, regularly, as we invested in our extant build system. We think the advantage here is clear—because we could demonstrate near-constant (and occasionally stepwise) progress, support for the project only grew with time, and it was always very easy to justify continued investment. Eventually we reached the point where complaints about the build system had practically evaporated.
Gradle isn't perfect, and we have big plans for 2023. One of Gradle's most critical gaps, we think, is the frankly exuberant amount of memory that the various Gradle processes will consume if you give them a chance. This problem has already gotten a lot better since we started using the configuration cache by default, because builds that reuse that cache use a lot less memory than those that don't. This doesn't help with builds that can't reuse the cache, though, and many builds still don't—including the build associated with the IDE sync process. Right now it takes Gradle up to 30 GiB of memory to sync our full repo, which is far too much and also unsustainable. We hope to work closely with Gradle on these and other memory issues in the coming year.
Another long-standing issue is the fact that the configuration phase of a Gradle build is single-threaded, which imposes a substantial bottleneck on large builds such as ours (with its 4,400 modules). This, along with related issues like fine-grained configuration, is something Gradle aims to make progress on in 2023 via its isolated projects initiative. We expect to work closely with them on dogfooding this feature.
That perennial question: do I fix what I have or do I rewrite it from scratch? Spicy takes aside, we all know that the only real answer to this question is "it depends." The companies that migrated their Android builds to Bazel may very well have made the best choice under their particular circumstances. (As an external observer, it is literally impossible to say with any certainty. Tech blogs are not peer-reviewed research papers, you know?) What we think we have shown in this post, however, is that it is possible for developer productivity teams to achieve their goals incrementally, by fixing their existing systems; it is not necessary to replace them wholesale.
Some of the factors at play here include:
- Timing. Gradle is arguably better now than it used to be. For example, Gradle is increasingly lazy.
- Resources. Company X may have the wherewithal to establish a large team focused on build tooling, while Company Y is more constrained.
- Expertise. Company X may have access to significant Bazel expertise while Company Y has access to a Gradle expert.
- Collaboration. Sometimes an improvement requires working with teams of people in multiple companies. That kind of collaborative relationship takes years to establish.
Let's recap. In this post, we have shown that even Very Large Android repos can successfully manage build performance with Gradle. Change can be achieved incrementally, and the benefits often compound. Change can also be achieved transparently, without requiring feature engineers to learn a new build system or to change their workflows in any way. Because the performance benefits accrue regularly, the work is typically self-justifying so long as you remember to benchmark constantly. None of this requires magic, but it does require dedication and a belief (which we hope to have instilled here!) that it is possible.
Silver bullets, on the other hand, while often exciting, carry much greater risk: they may never pay off. You can make a tremendous effort to migrate to an entirely new build system, over months or years, and at the end of the day it'll either pay off or it won't. Think of it like an app rewrite: who doesn't want a beautiful greenfield project with no "legacy" to deal with? Start over, do it right from the beginning! Yet, that "legacy" code holds a lot of implicit domain knowledge that isn't captured anywhere else. Your rewrite may never finish, or if it does, will almost certainly take longer than predicted, and in the meantime you're not rolling out new features and bug fixes to your existing users because your team is focused on a risky rewrite. You may finally deliver the new product only to find you now just have different bugs!
Experienced engineers know that the solution to a complex problem is always context-sensitive; that is to say, it depends. But now we know definitively that, in the context of the build, the solution space includes Gradle.
Special thanks to Pierre-Yves Ricau, Tim Mellor, and Roger Hu for reviewing early drafts, and to everyone else for reviewing the many, many PRs and design docs it took to get to this point.