Reversing Bits in C
A small performance investigation into innocent-looking code.
Written by Charles Nicholson.
Thanks to readers rjs, meteorfox in the comments, and reader lnxturtle over on the Reddit thread (!) for correcting my poor choice of words. I incorrectly used the phrase Cache Coherence when the correct phrase was (Spatial / Temporal) Cache Locality. Your attention to detail is appreciated!
The Setting
Interesting lessons can come from unexpected places! I was pleasantly surprised at how something as “simple” as reversing bits in a byte could lead me on an unexpectedly deep exploration: operation vs instruction count, memory access patterns and cache behavior, and low-level CPU instructions. It’s often very easy to make assumptions about the performance of code that we write, and I hope that this article serves as a reminder that the map is never the territory, and that the only way to understand what’s happening inside your code is by observing and measuring it.
While doing some investigations into one of our core iOS libraries, some code jumped out at me:
**unsigned** **char** **ReverseBitsInByte**(**unsigned** **char** v)
{
**return** (v ***** 0x0202020202ULL **&** 0x010884422010ULL) **%** 1023;
}
Since I’m not a cyborg wizard ninja with the ability to do 64-bit multiplication and bit-twiddling in my head, I decided to try and figure out what this code was doing. A simple search on one of the constant numbers led me immediately to the famous Stanford Bit Hacks page. This, in retrospect, should have been obvious.
The description of the algorithm is: “Reverse the bits in a byte with 3 operations (64-bit multiply and modulus division)”. This confused me a little more, since all current iOS devices have 32-bit ARM CPUs. Was this code actually efficient? It has the fewest “operations”, which is probably why it was originally chosen.
Being curious, I decided to take a few minutes and look at all of the approaches and learn how they actually performed on real-world hardware. This code, like all code, runs on an actual CPU. How do the different approaches perform on real hardware here in the real world?
The Landscape
Here are all of the algorithms listed on the Bit Hacks page, with their advertised “operation counts”:
The “obvious” way (O(N) where N is the index of the highest set bit):
**unsigned** **char** **ReverseBitsObvious**(**unsigned** **char** v)
{
**unsigned** **char** r **=** v;
**int** s **=** **sizeof**(v) ***** CHAR_BIT **-** 1;
**for** (v **>>=** 1; v; v **>>=** 1) {
r **<<=** 1; r **|=** v **&** 1; s**--**;
}
**return** r **<<** s;
}
Lookup table (O(1)):
**#define R2(n) n, n + 2*64, n + 1*64, n + 3*64**
**#define R4(n) R2(n), R2(n + 2*16), R2(n + 1*16), R2(n + 3*16)**
**#define R6(n) R4(n), R4(n + 2*4 ), R4(n + 1*4 ), R4(n + 3*4 )**
static const unsigned char BitReverseTable256[256] =
{
R6(0), R6(2), R6(1), R6(3)
};
unsigned char ReverseBitsLookupTable(unsigned char v)
{
return BitReverseTable256[v];
}
3 operations (O(1)) (64-bit multiply and modulus division):
**unsigned** **char** **ReverseBits3ops64bit**(**unsigned** **char** v)
{
**return** (v ***** 0x0202020202ULL **&** 0x010884422010ULL) **%** 1023;
}
4 operations (O(1)) (64-bit multiply, no division):
**unsigned** **char** **ReverseBits4ops64bit**(**unsigned** **char** v)
{
**return** ((v ***** 0x80200802ULL) **&** 0x0884422110ULL) ***** 0x0101010101ULL **>>** 32;
}
7 operations (O(1)) (32-bit):
**unsigned** **char** **ReverseBits7ops32bit**(**unsigned** **char** v)
{
**return** ((v ***** 0x0802LU **&** 0x22110LU) **|**
(v ***** 0x8020LU **&** 0x88440LU)) ***** 0x10101LU **>>** 16;
}
Parallel bitwise approach (O(5log(N)) where N is the number of bits):
**unsigned** **char** **ReverseBits5logNOps**(**unsigned** **char** v)
{
v **=** ((v **>>** 1) **&** 0x55) **|** ((v **&** 0x55) **<<** 1);
v **=** ((v **>>** 2) **&** 0x33) **|** ((v **&** 0x33) **<<** 2);
v **=** ((v **>>** 4) **&** 0x0F) **|** ((v **&** 0x0F) **<<** 4);
**return** v;
}
“Cheating”
Let’s take a step back here. We’re running this code on a specific family of CPUs, the ARMv6 and greater (available in all iPhones). ARMv6 and greater CPUs have an instruction called RBIT. This single instruction does exactly what we need: it reverses the bits in a 32-bit register. Unfortunately, while a lot of CMSIS implementations provide an RBIT intrinsic, it doesn’t look like one is provided with Xcode. It’s easy enough to drop down into inline assembly, though, and call the instruction ourselves:
**unsigned** **char** **ReverseBitsRBIT**(**unsigned** **char** v)
{
**uint32_t** input **=** v;
**uint32_t** output;
__asm__("rbit %0, %1\n" **:** "=r"(output) **:** "r"(input));
**return** output **>>** 24;
}
Let’s add this approach to our collection and see how it performs.
Note: Intel x86/x64 processors don’t have this instruction, so this is definitely not a portable solution. (That’s why I call it cheating!)
The Shoot-out
Here are the timings for reversing the bits in a byte, performed 50 million times each.
Starting tests...
ReverseBitsLookupTable... 10.058261ns per function call
ReverseBitsRBIT... 10.123462ns per function call
ReverseBits7ops32bit... 17.453080ns per function call
ReverseBits5logNOps... 20.054218ns per function call
ReverseBits4ops64bit... 21.203815ns per function call
ReverseBitsObvious... 65.809257ns per function call
ReverseBits3ops64bit... 509.621657ns per function call
Observations of interest:
- ReverseBits3ops64bit was 50x slower than the fastest 2 algorithms, despite taking O(1) time!
- ReverseBitsObvious (O(N), remember) was only 3–5x slower than the O(1) solutions.
- ReverseBits5logNOps (O(logN)) was slightly faster than ReverseBits4ops64bit O(1).
- ReverseBitsLookupTable is ever-so-slightly faster than ReverseBitsRBIT? Why?
The Lesson
So what’s the takeaway here? What have we learned from this experiment?
Choosing an algorithm based on the number of “operations” is a nonsensical approach. ReverseBits3ops64bit was the clear loser because not only did it have to do some large 64-bit math, it also needed to do a 64-bit modulo with a non-power-of-2 divisor. That’s one mathematical operation, but a large number of CPU instructions. CPU instructions are what matter here, though, as we see, not as much as cache coherency.
Asymptotic Big-O analysis doesn’t tell the whole story. It’s very easy to simplify and select algorithms based on their asymptotic behavior. Unfortunately, many algorithms aren’t actually dominated by their asymptotic behavior until N gets very large. In many cases, and certainly in this one we’re studying, the algorithms are dominated largely by instruction count and cache access patterns. ReverseBits5logNOps runs in O(logN) time, and yet is faster than the O(1) ReverseBits4ops64bit implementation, which doesn’t even do any expensive (non-power-of-2 modulo) math! Asymptotic behavior analysis is crucial when dealing with huge data sets, but is misleading and incorrect here.
Cache locality is crucial. ReverseBitsLookupTable was tied for first place because the 256-byte lookup table fits entirely in D-cache. If we were to evict the table from the cache between each iteration, we would pay the time penalty for the cache miss and subsequent reload.
Take advantage of your specific CPU! RBIT will win in the general case here because it’s a single instruction with relatively low latency and throughput. It does no memory access, so we don’t have to worry about cache misses. It requires no extra storage, like the ReverseBitsLookupTable solution does. If you ever need to reverse a full 32-bit word, RBIT will take the same amount of time. In fact, our ReverseBitsRBIT function would be even faster when reversing a full register-sized 32-bit word, since the final right shift could be eliminated.
Unfortunately, I wasn’t able to figure out a way to get Apple’s LLVM compiler to optimize a C loop down into a single RBIT command. This isn’t surprising, as it’s a fairly complex algorithm for an optimizing compiler to match and apply a strength reduction against. Compilers are getting smarter every day, but in the meantime, it’s always good to know how to do this kind of manual work when necessary.
If you’re curious about this and want to dive in and run your own experiments, I’ve posted the test code I used to gather my results (as well as the cleaned-up assembly output emitted by the various functions), all on my public GitHub space. Fork away, and let me know what you discover! Charles Nicholson (@c_nich) | Twitter *The latest Tweets from Charles Nicholson (@c_nich). machine intelligence at @googleresearch. prev @square, @playstation…*twitter.com
Authored By
- The Setting
- The Landscape
- The “obvious” way (O(N) where N is the index of the highest set bit):
- Lookup table (O(1)):
- 3 operations (O(1)) (64-bit multiply and modulus division):
- 4 operations (O(1)) (64-bit multiply, no division):
- 7 operations (O(1)) (32-bit):
- Parallel bitwise approach (O(5log(N)) where N is the number of bits):
- “Cheating”
- The Shoot-out
- The Lesson