February 18, 2025

An Unusual and Fast Bit Reverse Routine

A function to reverse the order of bits in a byte makes for a good interview question. The candidate has to think about loops, shifting and masking as well as the mundane but important issues of function and variable naming, data types, spacing and comments. As opposed to those "gotcha" questions apparently beloved by the tech giants where you have to know some trick, this can be worked through methodically. Once there is a solution on the table, there is an opportunity to talk about optimization for speed and/or space.

Simple Approach

There are lots of ways to approach this problem, but I suspect that many will take this simple idea of testing each bit and setting the LSB of the output in a loop which shifts up the output and mask one place each time.

//bit reverse eg 0x80 return x01
//simple but straightforward approach
uint8_t bitreverse_simple(uint8_t input)
{
    uint8_t i, output, mask;

    output = 0;
    mask = 0x01;
    for (i = 8;i > 0;i--)
    {
        output <<= 1;
        if (input & mask)
        {
            output |= 1;
        }
        mask <<= 1;
    }
    return output;
};

Testing The Simple Approach

First we need to test that this produces correct results by running it through unit tests. Using the previously described framework for running MS Test in Visual Studio on C files, it is easy to set up tests like these

TEST_CLASS(BitReverserSimpleUnitTest)
    {
    public:
        
        TEST_METHOD(MSBOnly)
        {
            uint8_t in, exp;

            in = 0x80;
            exp = 0x01;
            Assert::AreEqual((uint32_t)exp, (uint32_t)bitreverse_simple(in));
        };

        TEST_METHOD(LSBOnly)
        {
            uint8_t in, exp;

            in = 0x01;
            exp = 0x80;
            Assert::AreEqual((uint32_t)exp, (uint32_t)bitreverse_simple(in));
        };

        TEST_METHOD(TwoMSBs)
        {
            uint8_t in, exp;

            in = 0xC0;
            exp = 0x03;
            Assert::AreEqual((uint32_t)exp, (uint32_t)bitreverse_simple(in));
        };

        TEST_METHOD(ThreeLSBs)
        {
            uint8_t in, exp;

            in = 0x07;
            exp = 0xE0;
            Assert::AreEqual((uint32_t)exp, (uint32_t)bitreverse_simple(in));
        };
    };

To convert these tests to minunit format, use my TestFormatter application. Once they are in minunit format they can just as easily be run on the embedded micro with the results visible over a serial bus or LED.

Thinking about optimization

It doesn't immediately look like there is a lot to gain in this routine because it is already very concise and does not have any code that could be precomputed and hoist outside the loop, which is always a big gain. But previous reading about branch prediction problems made me think that it would be a good idea to eliminate the branch.

No Branch Approach

//eliminates branch, should be slightly faster
uint8_t bitreverse_nobranch(uint8_t input)
{
    uint8_t i, output, mask;
    output = 0;
    mask = 0x01;
    for (i = 8;i > 0;i--)
    {
        output <<= 1;
        output |= (input & mask) > 0;
        mask <<= 1;
    }
    return output;
};

Testing the No Branch Approach

This is such a slight variation from the simple case that it doesn't seem worth testing for correctness. But having set up the test frameworks it is so easy to add tests for this version that it can be done quickly. And we can use the simple version as a reference to easily run an exhaustive test against all 256 values.

TEST_METHOD(Exhaustive)
        {
            uint8_t in, exp, i;

            i = 0xFF;
            do {
                in = i;
                exp = bitreverse_simple(i);
                Assert::AreEqual((uint32_t)exp, (uint32_t)bitreverse_nobranch(in));
            } while (i--);
        };

One of those instances where a do-while structure is preferable to a for loop to cover all possible values without using a wider variable or an if statement, because we need to test after print.

Compiling on GCC with the Optimiser set to None -O0 for STM32 shows that we have actually made things 18% slower than the simple approach, which sounds like the wrong direction but switching to Optimise for Speed -Ofast makes it take about half the time (0.1μs vs 0.193μs) of the simple approach so it is a fantastic improvement.

Shifting Masks Instead Of Output

I had the idea that instead of shifting the output, it might be better to shift input and output mask values. It is only a slightly different approach but worth testing, at least we would then know if it is a good idea. It can be difficult to tell in advance if this small difference would be easier for the optimizer to do it's work.

uint8_t bitreverse_shiftmask(uint8_t input)
{
    uint8_t i, output, imask, omask;
    output = 0;
    imask = 0x01;
    omask = 0x80;
    for (i = 8;i > 0;i--)
    {
        if (input & imask)
            output |= omask;        
        imask <<= 1;
        omask >>= 1;
    }
    return output;
};

Testing the Shift Mask Approach

The unit tests can be copied from the No Branch approach. When compiled at -O0, it is 2% slower than the simple case but is 15% faster when compiled at -Ofast.

Unroll the Loop

Loops with a small number of iterations can be hardly worth the while to set up as a loop. The overhead of setting up, testing and jumping around the loop can be comparable to the body so it can be much faster to just write out all the iterations by hand. Hopefully a sufficiently clever optimizer could write it for you, but in the absence of that it is straightforward to write out such a function.

//eliminates loop, should be slightly faster at expense of code space
uint8_t bitreverse_noloop(uint8_t input)
{
    uint8_t output = 0;
    
    output |= (input & 0x01) > 0;
    output <<= 1;
    output |= (input & 0x02) > 0;
    output <<= 1;
    output |= (input & 0x04) > 0;
    output <<= 1;
    output |= (input & 0x08) > 0;
    output <<= 1;
    output |= (input & 0x10) > 0;
    output <<= 1;
    output |= (input & 0x20) > 0;
    output <<= 1;
    output |= (input & 0x40) > 0;
    output <<= 1;
    output |= (input & 0x80) > 0;

    return output;
};

Testing the No Loop Approach

After running the same unit tests as before to ensure correctness, the No Loop method was found to be 41% faster at -O0 and 46% faster at -Ofast than the simple method. Of course this comes at the expense of code space, this method was 294 bytes at -O0 but only 60 bytes at -Ofast, compared to 82 and 90 for the simple approach.

Has Anyone Else Done This Before?

I suspected that this sort of bit manipulation would be in the Bit Twiddling Hacks library and of course they have a few implementations.

Let's look at what they call the Obvious Way

//from https://graphics.stanford.edu/~seander/bithacks.html#BitReverseObvious
uint8_t bitreverse_minloop(uint8_t v)
{    
    uint8_t r = v; // r will be reversed bits of v; first get LSB of v
    int s = 7; // extra shift needed at end; sizeof(v)*8 - 1;

    for (v >>= 1; v; v >>= 1)
    {
        r <<= 1;
        r |= v & 1;
        s--;
    }
    r <<= s; // shift when v's highest bits are zero
    return r;
}

Testing the Obvious/Min Loop Approach

Even standard library code has bugs so again I exhaustively tested this against my previous implementations before running speed tests on it. It is 18% faster than my simple approach at -O0 but 61% slower at -Ofast despite being only 32 bytes of code space.

The Lookup Table Approach

Any bit manipulation can always be done with a lookup table approach. It is fast because you've moved the calculations to be done before runtime, but the tradeoff is a large table even though the function looks like it is as small as possible.

What's clever about the BitHacks implementation is the table is defined so that it is calculated by the C pre-processor, so it is a compact (in a lines-of-code way) representation but it still consumes 256 bytes of const/code space.

//from https://graphics.stanford.edu/~seander/bithacks.html#BitReverseTable
static const unsigned char BitReverseTable256[256] =
{
#   define R2(n)     n,     n + 2*64,     n + 1*64,     n + 3*64
#   define R4(n) R2(n), R2(n + 2*16), R2(n + 1*16), R2(n + 3*16)
#   define R6(n) R4(n), R4(n + 2*4 ), R4(n + 1*4 ), R4(n + 3*4 )
    R6(0), R6(2), R6(1), R6(3)
};

uint8_t bitreverse_lookup(uint8_t input)
{
    return BitReverseTable256[input];
};

Testing the Lookup Table Approach

In contrast to most of the other approaches, it isn't necessary to ensure the correctness of this function before running speed tests because it will operate at the same speed irrespective of the values in the lookup table, whereas there is no point testing a conventional calculation function for speed if the calculations are wrong.

Of course this approach is the fastest by far but takes up the most code space. At -O0 it is 39x faster than the simple approach and 7x faster at -Ofast. But it takes up 3x the space, which might just about be acceptable for 8 bits but is going to get unacceptable for bigger word sizes.

Although this doesn't apply to the simpler world of STM32s without caches, there is an interesting point made by this commenter that look up tables do artificially well on microbenchmarks because they fit in cache but that doesn't necessarily reflect real-world performance. The more advanced STM32s such as Cortex-M7 based have both instruction and data caches so this might apply.

Novel Approaches

The conventional approaches above give you a set of space and time tradeoffs that allow you to choose one for almost any application. But I was interested in finding novel approaches to see if there was something else which would yield almost as fast a result as the lookup table without consuming such a large code space. Is there anything faster than the unrolled loop method, or between that and the BitHacks Obvious method?

The Crumb Approach

Based on the observation that many of the BitHacks use clever algorithms based on use of the bitwise XOR function, I realised that looking at pairs of bits would help. If the pair was same-valued (00, 11) then the output should contain the same pair, but if the pair was different-valued (01, 10) the output should contain the inverse pair, which can be generated by XOR with 0x03. The loop then only runs 4 times, operating on a pair each time, with a 2-bit shift. A pair of bits is called a dibit or crumb.

//operates on pairs of bits (crumbs) because 01 xor 11 gives 10
uint8_t bitreverse_crumb(uint8_t input)
{
    uint8_t i, output, p;
    output = 0;

    for (i = 8;i > 0;i -= 2)
    {
        output <<= 2;
        p = input & 0x03;
        switch (p)
        {
        case 1:
        case 2:
            output |= p ^ 0x03; //flip least significant crumb only
            break;
        default:
            output |= p; //insert crumb
            break;
        }
        input >>= 2;        
    }
    return output;
};

Testing the Crumb Approach

After testing for correctness, the speed tests found that the crumb method is 34% faster than the simple approach at -O0 but 20% slower at -Ofast and yet 41% faster at -Os.

The Hardware Approach

Just as the lookup table approach is an extreme example in that the tradeoff is all on the code space side to gain on speed, there is a hardware approach which uses parallel output ports for the same tradeoff. The function reduces to just two instructions - send the byte to the output port, read the byte from the input port. The reversing is done by wiring output bit 7 to input bit 0, output bit 6 to input bit 1 and so on.

There aren't that many situations where you can afford to dedicate two whole 8-bit ports just to this inflexible bit reversing function or where the speed is necessary, but it does scale up to 16-bit situations without the 64k code space requirement of the (simple) lookup table. I said it was unusual, this will be why.

The following code omits the setup required for the ports which can be done at startup or any time prior to calling the function. Another possibility is to reuse pins for this function and something else and to switch between those by reinitialising the ports when needed, at the expense of speed, as long as the usages are not incompatible.

//requires two complete GPIO ports to be dedicated to this task, cross wired so that output[7] connects to input[0] etc
//STM32 Example
#define BITREVERSER_HW_PARALLEL_OP GPIOB->ODR
#define BITREVERSER_HW_PARALLEL_IP GPIOE->IDR

//PIC 8 bit example
//#define BITREVERSER_HW_PARALLEL_OP LATA
//#define BITREVERSER_HW_PARALLEL_IP PORTB

//assumes ports are setup as all digital inputs or all digital outputs prior to this.
uint8_t bitreverse_hw(uint8_t input)
{
    BITREVERSER_HW_PARALLEL_OP = input;
    return (uint8_t)BITREVERSER_HW_PARALLEL_IP;
}

Testing the Hardware Approach

Like the lookup table approach, the correctness of this lies outside the function itself so having proved to myself that this is feasible using a few bits (due to limitations of a devkit board), I moved straight on to the speed tests. Unsurprisingly for such as small routine, this is very fast, 95% faster than the simple approach at -O0 and 75% faster at -Ofast. Or in relation to the lookup table, 1.87x slower at -O0 or 1.76x slower at -Ofast.

It is the smallest code space method by far, and is the second fastest method overall.

Non-Portable Solution

Whilst researching for this post, I found that some ARMs have a dedicated rbit instruction, but then you would have to use assembly in your high level (C) program and the function would only work on those ARMs that supported it. The ultimate optimization would be the compiler recognizing the bit reversing function and implementing it using the rbit instruction where available.

Results Table

These results were calculated by measuring the time taken to execute all 256 values in a loop, removing the overhead of the loop and function call then dividing by the number of iterations. Code was run on an STM32F407 at 168MHz.

	-O0		-Ofast		-Os
Method	μs	bytes	μs	bytes	μs	bytes
lookup	0.08	288	0.03	268	0.02	268
hw stm32	0.14	44	0.05	20	0.04	16
no loop	1.74	294	0.10	60	0.11	60
crumb	1.96	100	0.23	90	0.40	42
min loop	2.43	94	0.31	32	0.44	28
shift mask	3.04	86	0.16	80	0.63	38
simple	2.98	82	0.19	90	0.68	38
no branch	3.50	92	0.10	60	0.68	38

Thanks to Maurice Calvert for his Excel to HTML table format code

Contrary to expectations that a Windows laptop using a dual core Intel i5-5200U running at 2.2 GHz, so 13 times faster clock than the STM32, would be faster to run these tests, it actually runs about 10 times slower due to Windows and Visual Studio overheads. For example, NoBranch takes 985μs for 256 iterations with overheads on the STM32, compared to 10ms on Windows. Then again, the laptop typically runs over 200 processes and 4000 threads whilst doing this.

Visual Studio showing MSTests passing