01 may 2021 @ justine's web page

The Byte Order Fiasco

One of the most challenging topics in the C / C++ programming language is how to handle endianness properly. There's a surprising amount of depth here. I've been programming in C for a while and I feel like I keep learning about this subject, even when I thought I'd seen it all.

Let's say we want to deserialize a 32-bit integer off the wire, using code that'll work on machines like IBM's s390x mainframes (one of the few big endian chips still in play). Here's the naive approach:

#include <stdio.h>
#include <stdint.h>
#include <byteswap.h>
#if (defined(__BYTE_ORDER__) && __BYTE_ORDER__ == __ORDER_BIG_ENDIAN__) || \
    defined(__s390x__) || defined(__ppc__) || defined(__PPC__) ||          \
    defined(__powerpc__) || defined(__POWERPC__) ||                        \
    defined(__BIG_ENDIAN__) || defined(__ARMEB__) ||                       \
    defined(__THUMBEB__) || defined(__AARCH64EB__) ||                      \
    defined(_MIPSEB) || defined(__MIPSEB) || defined(__MIPSEB__)
#define READ32BE(p) (*(uint32_t *)(p))
#else
#define READ32BE(p) bswap_32(*(uint32_t *)(p))
#endif
char b[4] = {0x80,0x02,0x03,0x04};
int main(int argc, char *argv[]) { printf("%08x\n", READ32BE(b)); }

Aside from being ugly, the code above is wrong. Modern compiler policies don't even permit code that looks like like that anymore. Your compiler might see that and emit assembly that formats your hard drive with btrfs. That's the thing about behaviors which are undefined according to the C standard.

The compiler benchmark wars have been very competitive ever since the GNU vs. Apple/Google schism these past ten years. Clang and GCC are reaching for any optimization they can get. Undefined Behavior may be hostile and dangerous, but it's legal. So don't let your code become a casualty. In fact, compilers won't even warn you if they rewrite your program in unexpected ways because you're not conforming to the C standard. Someone will probably build an X-Ray one of these days and there'll be a Dumb Ways to Die video that causes things to change. But until that day, we have an excellent tool for avoiding these problems and it's called UBSAN.

$ cc -g -Os -Wall \
     -fsanitize=undefined \
     -pedantic -fstrict-aliasing -Wstrict-aliasing=3 \
     -o /tmp/o endian.c
$ /tmp/o
endian.c:30:3: runtime error: load of misaligned address 0x5575a79fc05f for type 'uint32_t',
               which requires 4 byte alignment
0x5575a79fc05f: note: pointer points here
 5f 74 27 00 80  02 03 04 00
             ^
80020304

So how do we deserialize an int properly? Rob Pike has a good blog post talking about how easy it is. The solution he recommends is as follows:

char b[4] = {0x80,0x02,0x03,0x04};
#define READ32BE(p) p[3] | p[2]<<8 | p[1]<<16 | p[0]<<24
int main(int argc, char *argv[]) { printf("%08x\n", READ32BE(b)); }

Huge improvement! It still has undefined behaviors. Assuming -fsigned-char is the default behavior, we get:

$ cc -fsanitize=undefined -g -Os -o /tmp/o endian.c && /tmp/o
endian.c:19:53: runtime error: left shift of negative value -128
80020304

Let's assume we change the data so the undefined behavior goes away. 0x80 is the same in this context as -128 a.k.a. CHAR_MIN.

char b[4] = {0x02,0x03,0x04,0x80};
#define READ32BE(p) p[3] | p[2]<<8 | p[1]<<16 | p[0]<<24
int main(int argc, char *argv[]) { printf("%08x\n", READ32BE(b)); }

If we do this then something even worse happens. Rather than crashing or a runtime warning we get an incorrect result, because of sign extension.

$ cc -fsanitize=undefined -g -Os -o /tmp/o endian.c && /tmp/o
ffffff80

So the solution is simple right? Let's just use unsigned char instead. Sadly no. Because unsigned char in C expressions gets type promoted to the signed type int. So if we say 0x80<<24 it overwrites the sign bit, which is an undefined behavior according to the C standard.

$ cc -fsanitize=undefined -g -Os -o /tmp/o endian.c && /tmp/o
endian.c:19:53: runtime error: left shift of 128 by 24 places cannot be represented in type 'int'
80020304

That part of the C standard is the legacy of companies like UNIVAC, who didn't agree with John von Neumman's design doc from the second world war for EDVAC, which very clearly laid out that the plan was to have 32-bit little endian two's complement integers. Imagine that. Even Seymour Cray himself admitted that when he chose ones' complement it was because he hadn't yet understood the full meaning of John von Neumman's design at the time.

So here's the secret to writing good generic C code:

Mask, and then shift.
Mask, and then shift.
Mask, and then shift.
Mask, and then shift.

Repeat it like a mantra. You mask first, to define away potential concerns about signedness. Then you cast if needed. And finally, you can shift. Now we can create the fool-proof version for machines with at least 32-bit ints:

#define READ32BE(p) \
  (uint32_t)(255 & p[0]) << 24 | (255 & p[1]) << 16 | (255 & p[2]) << 8 | (255 & p[3])

Another cool trick is we can make the code look more elegant using octal notation:

#define READ16LE(S) ((255 & (S)[1]) << 8 | (255 & (S)[0]))
#define READ16BE(S) ((255 & (S)[0]) << 8 | (255 & (S)[1]))
#define READ32LE(S)                                                    \
  ((uint32_t)(255 & (S)[3]) << 030 | (uint32_t)(255 & (S)[2]) << 020 | \
   (uint32_t)(255 & (S)[1]) << 010 | (uint32_t)(255 & (S)[0]) << 000)
#define READ32BE(S)                                                    \
  ((uint32_t)(255 & (S)[0]) << 030 | (uint32_t)(255 & (S)[1]) << 020 | \
   (uint32_t)(255 & (S)[2]) << 010 | (uint32_t)(255 & (S)[3]) << 000)
#define READ64LE(S)                                                    \
  ((uint64_t)(255 & (S)[7]) << 070 | (uint64_t)(255 & (S)[6]) << 060 | \
   (uint64_t)(255 & (S)[5]) << 050 | (uint64_t)(255 & (S)[4]) << 040 | \
   (uint64_t)(255 & (S)[3]) << 030 | (uint64_t)(255 & (S)[2]) << 020 | \
   (uint64_t)(255 & (S)[1]) << 010 | (uint64_t)(255 & (S)[0]) << 000)
#define READ64BE(S)                                                    \
  ((uint64_t)(255 & (S)[0]) << 070 | (uint64_t)(255 & (S)[1]) << 060 | \
   (uint64_t)(255 & (S)[2]) << 050 | (uint64_t)(255 & (S)[3]) << 040 | \
   (uint64_t)(255 & (S)[4]) << 030 | (uint64_t)(255 & (S)[5]) << 020 | \
   (uint64_t)(255 & (S)[6]) << 010 | (uint64_t)(255 & (S)[7]) << 000)

Now you might be looking at the code above and be thinking, surely that's the slowest thing imaginable. Here's a jewel from the TensorFlow codebase expressing that sentiment:

// Fall back on a non-optimized implementation on other big-endian targets.
// This code swaps one byte at a time and is probably an order of magnitude
// slower.
#define BYTE_SWAP_64(x)                                                      \
  ((((x)&0x00000000000000ffUL) << 56) | (((x)&0x000000000000ff00UL) << 40) | \
   (((x)&0x0000000000ff0000UL) << 24) | (((x)&0x00000000ff000000UL) <<  8) | \
   (((x)&0x000000ff00000000UL) >>  8) | (((x)&0x0000ff0000000000UL) >> 24) | \
   (((x)&0x00ff000000000000UL) >> 40) | (((x)&0xff00000000000000UL) >> 56))

That might have been true with old compilers, but it's not true today. The more verbosely well-defined your code is, then with a good modern compiler, the smaller the generated code will be. Here's what we get if we run it through clang.godbolt.org:

read32le(char*):
	mov	(%rdi),%eax
	ret
byte_swap_64(unsigned long):
	mov	%rdi,%rax
	bswap	%rax
	ret
read32be(char*):
	mov	(%rdi),%eax
	bswap	%eax
	ret

So we can finally delete the ifdef soup, and I'd call that progress.

What if we want to serialize integers to the wire? Using our new knowledge, that becomes easy too. Here's how it can be done with a mempcpy-style API:

#define WRITE16LE(P, V)                        \
  ((P)[0] = (0x00000000000000FF & (V)) >> 000, \
   (P)[1] = (0x000000000000FF00 & (V)) >> 010, (P) + 2)
#define WRITE16BE(P, V)                        \
  ((P)[0] = (0x000000000000FF00 & (V)) >> 010, \
   (P)[1] = (0x00000000000000FF & (V)) >> 000, (P) + 2)
#define WRITE32LE(P, V)                        \
  ((P)[0] = (0x00000000000000FF & (V)) >> 000, \
   (P)[1] = (0x000000000000FF00 & (V)) >> 010, \
   (P)[2] = (0x0000000000FF0000 & (V)) >> 020, \
   (P)[3] = (0x00000000FF000000 & (V)) >> 030, (P) + 4)
#define WRITE32BE(P, V)                        \
  ((P)[0] = (0x00000000FF000000 & (V)) >> 030, \
   (P)[1] = (0x0000000000FF0000 & (V)) >> 020, \
   (P)[2] = (0x000000000000FF00 & (V)) >> 010, \
   (P)[3] = (0x00000000000000FF & (V)) >> 000, (P) + 4)
#define WRITE64LE(P, V)                        \
  ((P)[0] = (0x00000000000000FF & (V)) >> 000, \
   (P)[1] = (0x000000000000FF00 & (V)) >> 010, \
   (P)[2] = (0x0000000000FF0000 & (V)) >> 020, \
   (P)[3] = (0x00000000FF000000 & (V)) >> 030, \
   (P)[4] = (0x000000FF00000000 & (V)) >> 040, \
   (P)[5] = (0x0000FF0000000000 & (V)) >> 050, \
   (P)[6] = (0x00FF000000000000 & (V)) >> 060, \
   (P)[7] = (0xFF00000000000000 & (V)) >> 070, (P) + 8)
#define WRITE64BE(P, V)                        \
  ((P)[0] = (0xFF00000000000000 & (V)) >> 070, \
   (P)[1] = (0x00FF000000000000 & (V)) >> 060, \
   (P)[2] = (0x0000FF0000000000 & (V)) >> 050, \
   (P)[3] = (0x000000FF00000000 & (V)) >> 040, \
   (P)[4] = (0x00000000FF000000 & (V)) >> 030, \
   (P)[5] = (0x0000000000FF0000 & (V)) >> 020, \
   (P)[6] = (0x000000000000FF00 & (V)) >> 010, \
   (P)[7] = (0x00000000000000FF & (V)) >> 000, (P) + 8)

If you program in C long enough, stuff like this becomes second nature, and it starts to almost feel inappropriate to even have macros like the above, since it might be more appropriately inlined into the specific code. Since there have simply been too many APIs introduced over the years for solving this problem. To name a few for 32-bit byte swapping alone: bswap_32, htobe32, htole32, be32toh, le32toh, ntohl, and htonl which all have pretty much the same meaning.

Now you don't need to use those APIs because you know the secret. This blog post covers most of the dark corners of C so if you've understood what you've read so far, you're already practically a master at the language, which is otherwise remarkably simple and beautiful.

see also

The byte order fallacy