Optimizing ruby-protobuf deserialization by 50%

How to use ruby-prof to find hotspots in code

Written by Zachary Anker.

I’m on a team colloquially called “WebScale” at Square, which deals with extracting legacy payments data from our monorail. One of our recent tasks has been the rollout of a new internal service to search for payments on Square. We mostly use protobuf as our data transport, and a single payment proto is about 1,500–2,000 bytes when serialized and returns up to 500 records per request.

Once rollout of the new service started, a few requests from a Ruby MRI 2.2.3 client went from 250ms to 1,600ms on a few endpoints. We already have metrics around time spent in RPC requests, so we took a guess that protobuf deserialization was the culprit.

To verify our guess, we used ruby-prof to profile code in Ruby (in this case, the protobuf gem). RubyProf is a helpful tool for profiling MRI code, and here’s how we used it to find hotspots.

Our test case:

require 'protobuf'
require 'ruby-prof'

# Just so we don't get spammed when testing
Protobuf.print_deprecation_warnings = false

# Construct a proto that mimics the size of our production responses
encoded = RecordResponse.new(
  payment: 500.times.map do |i|
    Record.new( ...test data... )
  end
).encode

result = RubyProf.profile do
  RecordResponse.decode(encoded)
end

printer = RubyProf::FlatPrinter.new(result)
printer.print(
  STDOUT,
  # Only want to focus on hotspots right now
  min_percent: 1
)

RubyProf supports various ways of printing a report. I picked FlatPrinter by using the scientific method of it appearing “first on the list”. I’m only showing code that took more than 1% of the time for the sake of brevity.

Measure Mode: wall_time
Thread ID: 70093065552380
Fiber ID: 70093084071700
Total: 4.599523
Sort by: self_time

 %self      total      self      wait     child     calls  name
 16.25      0.748     0.748     0.000     0.000  2031000   Protobuf::Enum#to_i
 15.35      1.454     0.706     0.000     0.748    30000   Array#select
  7.30      0.745     0.336     0.000     0.410   140000   <Class::Protobuf::Decoder>#read_varint
  3.44      0.176     0.158     0.000     0.018    94000   Kernel#respond_to?
  3.32      1.156     0.152     0.000     1.003    72000   <Class::Protobuf::Decoder>#read_field
  2.44      0.508     0.112     0.000     0.396    72000   <Class::Protobuf::Decoder>#read_key
  2.26      0.298     0.104     0.000     0.194    72000   Protobuf::Message::Fields::ClassMethods#get_field
  2.00      0.159     0.092     0.000     0.067   166500   IO::generic_readable#readbyte
  1.76      0.124     0.081     0.000     0.043   171500   Numeric#nonzero?
  1.66      0.076     0.076     0.000     0.000   410500   Fixnum#&
  1.58      0.073     0.073     0.000     0.000   144000   Protobuf::Field::BaseField#repeated?
  1.48      0.068     0.068     0.000     0.000    69000   Protobuf::Field::BaseField#setter
  1.46      0.067     0.067     0.000     0.000   166500   StringIO#getbyte
  1.19      0.055     0.055     0.000     0.000     1500   Kernel#caller
  1.17      0.054     0.054     0.000     0.000    72000   Protobuf::Message::Fields::ClassMethods#field_store
  1.05      0.087     0.048     0.000     0.039    72000   Protobuf::Field::BaseField#packed?
  1.00      0.137     0.046     0.000     0.091    36003   Class#new

Continuing with our scientific method of “first on the list”, we’ll dig into Protobuf::Enum. There are a few uses of to_i in that class, but the one that stands out is:

*# Public: Get an array of Enum objects with the given tag.*
*#*
*# tag - An object that responds to to_i.*
*#*
*# Examples*
*#*
*#   class StateMachine < ::Protobuf::Enum*
*#     set_option :allow_alias*
*#     define :ON, 1*
*#     define :STARTED, 1*
*#     define :OFF, 2*
*#   end*
*#*
*#   StateMachine.enums_for_tag(1) #=> [ #<StateMachine::ON=1>, #<StateMachine::STARTED=1> ]*
*#   StateMachine.enums_for_tag(2) #=> [ #<StateMachine::OFF=2> ]*
*#*
*# Returns an array with zero or more Enum objects or nil.*
*#*
**def** **self.enums_for_tag**(tag)
  enums**.**select **do** **|**enum**|**
    enum**.**to_i **==** tag**.**to_i
  **end**
**end**

The code has great documentation on how the method works, which makes things easy for us. We use a lot of enums in our protos, and looking around the Protobuf::Enum class it’s clear enums_for_tag is called frequently (in enum_for_tag and fetch). Looking at the all_tags and values methods, we know it’s safe to cache enum data. In this case, it’s a straightforward optimization of O(n) -> O(1). We can only map int -> enum because you can alias enums. That’s fine for this case though.

The full changes are at the end, but the gist is:

**def** **self.mapped_enums**
  @mapped_enums **||=** enums**.**each_with_object({}) **do** **|**enum, hash**|**
    list **=** hash**[**enum**.**to_i**]** **||=** **[]**
    list **<<** enum
  **end**
**end**

def self.enums_for_tag(tag)
  mapped_enums[tag.to_i] || []
end

And then rerun it in RubyProf

Measure Mode: wall_time
Thread ID: 70271059223040
Fiber ID: 70271070182440
Total: 3.308983
Sort by: self_time

 %self      total      self      wait     child     calls  name
 10.16      0.736     0.336     0.000     0.400   140000   <Class::Protobuf::Decoder>#read_varint
  4.92      0.182     0.163     0.000     0.019    94000   Kernel#respond_to?
  4.62      1.148     0.153     0.000     0.995    72000   <Class::Protobuf::Decoder>#read_field
  3.33      0.501     0.110     0.000     0.391    72000   <Class::Protobuf::Decoder>#read_key
  3.25      0.308     0.108     0.000     0.201    72000   Protobuf::Message::Fields::ClassMethods#get_field
  2.70      0.156     0.089     0.000     0.066   166500   IO::generic_readable#readbyte
  2.37      0.078     0.078     0.000     0.000   410500   Fixnum#&
  2.34      0.119     0.077     0.000     0.042   171500   Numeric#nonzero?

Much better! The total time listed in RubyProf is accurate in percentage saved (~25%), but the total time is higher because RubyProf hooks into Ruby to profile everything. Without RubyProf, we saw requests from the Ruby MRI 2.2.3 client drop from 1,000ms to around 750ms.

Since 750ms is still too long, we’ll take a look at the next slowest callsite Protobuf::Decoder and the read_varint method.

*# Read varint integer value from +stream+.*
**def** **self.read_varint**(stream)
  value **=** index **=** 0
  **begin**
    byte **=** stream**.**readbyte
    value **|=** (byte **&** 0x7f) **<<** (7 ***** index)
    index **+=** 1
  **end** **while** (byte **&** 0x80)**.**nonzero?
  value
**end**

The code is simple enough that any optimizations will be marginal at best. Ruby is just too slow for the amount of calls we’re making, which makes this a candidate for rewriting the method in C. It turns out, the authors of the ruby-protocol-buffers gem (a different ruby protobuf gem from the one we use), released released a gem called varint which does exactly what we need.

Normally, I don’t like replacing Ruby with C, especially from a library. It’s not guaranteed that the C and Ruby code will function the same way, and both debugging and maintenance are harder. Since it’s already used by another gem, and the code is self contained though, I gave it a shot.

We add a simple monkeypatch to our test case:

**module** Protobuf
  **class** **FastVarint**
    **extend** **::**Varint
  **end**

  class Decoder
    def self.read_varint(stream)
      FastVarint.decode(stream)
    end
  end
end

Rerun RubyProf

Measure Mode: wall_time
Thread ID: 70126242488840
Fiber ID: 70126246561280
Total: 2.722766
Sort by: self_time

 %self      total      self      wait     child     calls  name
  5.65      0.676     0.154     0.000     0.522    72000   <Class::Protobuf::Decoder>#read_field
  5.64      0.172     0.154     0.000     0.018    94000   Kernel#respond_to?
  4.55      0.283     0.124     0.000     0.159    72000   <Class::Protobuf::Decoder>#read_key
  4.04      0.302     0.110     0.000     0.192    72000   Protobuf::Message::Fields::ClassMethods#get_field
  3.48      0.250     0.095     0.000     0.156   140000   <Class::Protobuf::Decoder>#read_varint
  3.13      0.156     0.085     0.000     0.070   140000   Varint#decode
  2.86      0.078     0.078     0.000     0.000   144000   Protobuf::Field::BaseField#repeated?
  2.58      0.070     0.070     0.000     0.000   166500   StringIO#getbyte
  2.49      0.068     0.068     0.000     0.000    69000   Protobuf::Field::BaseField#setter
  2.10      0.058     0.057     0.000     0.000     1500   Kernel#caller

We got a 18% optimization by blatantly cheating and having C do it instead. Even with all these improvements, our decoding time is around 600ms still.

Moving on, we take another look at Protobuf::Decoder. Unfortunately the read_field and read_key methods are too simple for us to get any real optimization. Kernel#respond_to? is used in too many places for safety, such as returning nil if to_i doesn’t exist rather than erroring.

That really just leaves us with Protobuf::Message::Fields::ClassMethods, unless we want to optimize 50 callsites for a few milliseconds each. Looking at get_field, we see:

**def** **get_field**(name_or_tag, allow_extension **=** **false**)
  name_or_tag **=** name_or_tag**.**to_sym **if** name_or_tag**.**respond_to?(:to_sym)
  field **=** field_store**[**name_or_tag**]**

  if field && (allow_extension || !field.extension?)
    field
  else
    nil
  end
end

Looking at the RubyProf trace above, we already know that respond_to? is called 94,000 times total. After adding some debugging to count calls, it turns out get_field accounts for 72,000 (75%) of the calls to respond_to?, and it only calls to_sym 2,000 times. Essentially, only 2.7% of respond_to? calls result in a true.

Instead of calling respond_to? and to_sym, we can cache a map of string field -> field and make it two O(1) lookups. The get_field method then looks like:

**def** **get_field**(name_or_tag, allow_extension **=** **false**)
  field **=** field_store**[**name_or_tag**]** **||** str_field_store**[**name_or_tag**]**

  if field && (allow_extension || !field.extension?)
    field
  else
    nil
  end
end

One last rerun of RubyProf

Measure Mode: wall_time
Thread ID: 70107917574660
Fiber ID: 70107924939720
Total: 2.428846
Sort by: self_time

 %self      total      self      wait     child     calls  name
  6.12      0.666     0.155     0.000     0.511    72000   <Class::Protobuf::Decoder>#read_field
  4.70      0.277     0.119     0.000     0.158    72000   <Class::Protobuf::Decoder>#read_key
  3.70      0.248     0.094     0.000     0.154   140000   <Class::Protobuf::Decoder>#read_varint
  3.39      0.132     0.086     0.000     0.046    72000   Protobuf::Message::Fields::ClassMethods#get_field
  3.39      0.154     0.086     0.000     0.068   140000   Varint#decode
  2.96      0.075     0.075     0.000     0.000   144000   Protobuf::Field::BaseField#repeated?
  2.70      0.068     0.068     0.000     0.000   166500   StringIO#getbyte
  2.60      0.066     0.066     0.000     0.000    69000   Protobuf::Field::BaseField#setter

This gives us another 12% improvement, and Kernel#respond_to? was dropped to 22,000 calls (1.29% of self%). Overall, we optimized deserialization by 50%, and it only took a few hours of investigating with RubyProf.

RubyProf is a great tool, and you can get even more details using a GraphPrinter or CallTreePrinter which can be used with KCachegrind. For our case, FlatPrinter was simple, and the code was easy enough to trace through to narrow down what to optimize. Examples of various printers in RubyProf can be found on the repo.

We shipped all of the changes upstream into the protobuf gem. You can see the exact changes that were made in the varint, get_field/get_extension_field, and enumoptimization PRs. If you’re using the protobuf gem, v3.5.5 and higher with the varintgem (MRI only) installed will give you the full set of optimizations.

In the future, we’re going to look at moving to protobuf3. The Ruby gem is in alpha, but it’s built off Google’s first party C++ library. In benchmarks against the optimizations above, Google’s implementation is about 7x faster than ruby-protobuf. Zachary Anker (@zachanker) | Twitter The latest Tweets from Zachary Anker (@zachanker). Engineer at Square. San Franciscotwitter.com