Breaking up with your MonoRail

Planning an extraction from a monolithic Rails app.

Reddit
LinkedIn

Written by Zachary Anker.

At Square, I work on our Payments Scalability team. In 2015, our task was to modernize our payments infrastructure by extracting it from Square/Web, our monolithic Rails app backed by a single MySQL instance. Square/Web has been a core component in processing transactions since the company was founded in 2009.

Planning in earnest started at the end of 2014, when we already were starting to hit limits on what we were comfortable with managing operationally.

We were faced with three options:

Scale vertically: Buy more disks

This requires spending money on hugely expensive enterprise flash cards. Scaling vertically would have been a short term fix, and over time it would have cost more and more to keep up with growth. Operationally, it’s much harder to rebuild and backup big databases. We discarded this as a long term option.

Scale horizontally: Shard Square/Web’s MySQL database

Instead of managing a single MySQL instance, we could also shard it to help scale horizontally. We evaluated a few gems, such as Octopus and db_charmer, and built a proof of concept sharded system. We decided retrofitting sharding would be more work and doesn’t get us away from simplifying the monolith; in fact, it made things significantly more complex.

Break it up: Extract payments from Square/Web altogether

This ended up being our preferred approach. We decided to extract payments functionality and data from Square/Web into a separate service. This was also in line with Square’s overall architecture philosophy of avoiding monoliths.

Setting the stage

While Square/Web hasn’t handled charging payments for years, other parts of the Register app rely on it for transactional data. To reduce the chance of breaking anything, we needed to break the work into small chunks. Below, I’ll be covering how we built an intermediary API on top of ActiveRecord, Rail’s ORM, then gradually built up the functionality to move away from direct SQL calls.

Before building the intermediary API, we needed to figure out how payments were being accessed. Unfortunately, Rails doesn’t make it very easy to audit queries, and our monolith contained around 100,000 lines of Ruby code, excluding 200,000 lines of tests. So, we created a new gem,activerecord-sqlanalyzer, which can de-dupe SQL queries and tag call sites from Ruby to gain an idea of access patterns.

Our initial run gave us a list of 500 distinct SQL queries, which we used to come up with seven protobuf APIs. Those ended up being the basis for Spot, our new payments searching service. We chose to back it with MySQL, sharded by merchant key, as we knew the majority of our access patterns were scoped to a single merchant.

Abstract the database

ActiveRecord’s flexibility is nice when we were writing SQL queries; not so much when we were trying to convert it to a protobuf API. We also needed one place to funnel calls through to gradually roll out the new service.

We came up with a simple abstraction layer, like so:

*# This is a dramatic simplification from our actual code, but it’s a demonstration of concept.*
**module** PaymentApi
  **class** **NotFoundError** **<** StandardError; **end**

  def self.lookup(payment_token: nil, includes: nil)
    criteria = Payment.where(token: payment_token)
    criteria = criteria.includes(includes)
    # with_tag is provided by active_record-sql_analyzer
    criteria.with_tag(:api).first
  end

  def self.lookup!(payment_token: nil, includes: nil)
    payment = lookup(payment_token: payment_token, includes: includes)
    raise NotFoundError, "Could not find payment #{payment_token}" unless payment

    payment
  end
end

We could then start porting ActiveRecord calls to the PaymentApi class, and split up the work of migrating call sites among multiple engineers.

Abstracting away ActiveRecord

Spot returns protobufs through an RPC API, and we needed a way of switching between data backed by protobufs and backed by MySQL, without having to rewrite every call site.

We settled on writing a converter to turn ActiveRecord models into the same protobuf that Spot uses, and then wrote a class that converted ActiveRecord models to protobufs. This wasn’t the most efficient, but it gave us consistency between Spot and ActiveRecord.

Expanding on our PaymentApi class, we added a* PaymentApi::Converter* and PaymentApi::Wrapper:

*# This is a dramatic simplification from our actual code, but it’s a demonstration of concept.*
**class** **PaymentApi**
  **class** **Converter**
    **def** **to_proto**(payment)
      Proto**::**Spot**::**Payment**.**new(
        card: {
          auth: {
            amount: payment**.**card_authorization**.**amount,
            authed_at: payment**.**card_authorization**.**created_at
          },
          captures: **[**
            {
              success: **true**,
              amount: payment**.**card_capture**.**amount
              captured_at: payment**.**card_capture**.**amount
            }
          **]**
        }
      )
    **end**
  **end**

  class Wrapper
    def self.wrap(proto)
      new(proto)
    end

    def initialize(proto)
      @proto = proto
    end

    def auth_created_at
      @proto.try(:card).try(:auth).try(:authed_at)
    end

    def auth_amount
      @proto.try(:card).try(:auth).try(:amount)
    end

    def successful_capture
      if @proto.captures
        @proto.captures.select { |capture| capture[:success] }.first
      end
    end
  end
end

Now that we had a converter, we could expand our lookup call:

*# This is a dramatic simplification from our actual code, but it’s a demonstration of concept.*
**class** **PaymentApi**
  **def** **self.lookup**(payment_token: **nil**, includes: **nil**, proto: **false**)
    criteria **=** Payment**.**where(token: payment_token)
    criteria **=** criteria**.**includes(includes)
    payment **=** criteria**.**with_tag(:api)**.**first

    if proto && payment
      Wrapper.wrap(Converter.to_proto(payment))
    else
     payment
    end
  end
end

While Spot wasn’t ready yet, we had finalized our proto API and could start migrating code without having to worry about constant changes to the underlying API. Finalizing the proto API also helped the rollout go more smoothly, as we could flip the flag off if we found a bug in production.

At this point, we also started double running our entire test suite with the proto wrapper flags on and off. This prevented any other engineering teams from shipping code that broke our extraction efforts and made sure we knew the old code paths still worked.

Auditing Spot vs Square/Web

Square/Web’s DB schema date’s back to 2010, while Spot’s is from 2013 and based on protobufs. Due to the significant divergence, we confirmed that our wrappers returned data in a manner Square/Web expected:

*# This is a dramatic simplification from our actual code, but it’s a demonstration of concept.*
COMPARE_FIELDS **=** **%**i(auth_created_at auth_amount capture_amount currency_code)
Payment**.**order("id DESC")**.**limit(1000)**.**each **do** **|**payment**|**
   wrapped_payment **=** PaymentApi**::**Wrapper**.**wrap(Converter**.**to_proto(payment))
   spot_payment **=** HTTP**::**Rpc**::**Spot**.**load_payment(payment**.**token)

   puts "#{payment.token}"
   COMPARE_FIELDS.each do |field|
     spot_value = spot_payment.send(field)
     wrapped_value = wrapped_payment.send(field)

     unless spot_value == wrapped_value
       puts "- Mismatch, #{field}, found #{spot_value}, expected #{wrapped_value}"
     end
  end
end

We manually ran this until we had fixed all the discrepancies it found, this actually caught a lot of subtle bugs.

Calling Spot

Now, it was time to tie it all together! We had enough confidence in our wrapper that we could start sending requests through Spot. This just required another flag and a few additional lines of code.

*# This is a dramatic simplification from our actual code, but it’s a demonstration of concept.*
**class** **PaymentApi**
  **def** **lookup**(payment_token: **nil**, includes: **nil**, proto: **false**, via_spot: **nil**)
    **if** via_spot
      **return** Wrapper**.**wrap(HTTP**::**Rpc**::**Spot**.**load_payment(payment_token))
    **end**

    criteria = Payment.where(token: payment_token)
    criteria = criteria.includes(includes)
    payment = criteria.first

    if proto && payment
      Wrapper.wrap(Converter.to_proto(payment))
    else
      payment
    end
  end
end

We’re done! We started rolling out traffic to Spot, and re-running the SQL analyzer and watching the number of queries go down.

For the sake of simplicity, I’ve omitted how we ported searches into Spot. What I outlined above is the same process we followed for searches. The main difference was it required a bit more validation on search arguments.

Things that could have gone better

Before migrating anything, we did an audit of payments related code to find anything we could delete. We ripped out approximately 56,000 lines of code total in the extraction process.

To provide the best experience to our merchants, Square avoids deprecating Register clients as long as possible. We have active versions going back two years on both Android and iOS. Any time we removed fields that were returned to the Register app, we ended up having to audit six different code bases (three versions on both iOS and Android).

We have some APIs that use the ActiveResource pattern, which makes auditing and migrating hard. Since it provides a HTTP -> SQL layer, it’s easy for new calls to be introduced, making migrating it a moving target.

Lack of solid API contracts between services made it hard to determine what data was used and what Square/Web specific idiosyncrasies they were relying on.

What we would have done differently?

Our existing APIs to other services in Square/Web primarily use JSON. Defining a clear contract with something like protobufs would have made it easier to understand exactly what columns and data were expected.

While we can’t do anything about existing Register clients, using protobufs from the start would have provided a clearer audit trail of what data was needed for which version. JSON fields can’t be audited when building a complicated hierarchy that branches based on various payment states.

Finally, we made a mistake that was obvious in hindsight by using randomized rollouts instead of stable to the merchant. We ended up with a regression bug that only occurred when using a six-month-old version and an up-to-date version of Register together. It wasn’t noticeable with randomized rollouts, because at lower percentages, merchants are more likely to just refresh a page and then see it work.

Neither are good user experiences, but a stable rollout mitigates the Register app to consistently breaking for a smaller number of merchants, rather than inconsistently breaking it for a larger subset of them.

Closing thoughts

The final result was a successful extraction in January 2016, with all reads going to Spot for around six weeks prior to turning off local writes. If it had been necessary, we could have turned writes off sooner, but we saw no need to make a major infrastructure change right before holiday break.

I’d like to call out my fellow coworkers: Alyssa Pohahau, Andrew Lazarus, Gabriel Gilder, John Pongsajapan, Kathy Spradlin, and Manik Surtani, who all contributed to the planning and extraction effort. There’s too many people to call out who also contributed early analysis on the stability of the databases and other parts of Square/Web. This ended up being a company-wide effort directly involving most product engineering teams and non-engineering teams. Zachary Anker (@zachanker) | Twitter The latest Tweets from Zachary Anker (@zachanker). Engineer at Square. San Franciscotwitter.com