Oct 8th, 2022 @ justine's web page

Cosmopolitan Threads

Cosmopolitan Libc makes C a build-once run-anywhere language. It lets you build idiomatic posix code on Linux and have your binaries just work when copied to other operating systems. I'm proud to announce we've implemented POSIX threads support, and used it to build a multithreaded web server that sysadmins worldwide have been battle testing the last few weeks, so I feel confident in saying that our threads support is production worthy and ready for use.

Background

Cosmopolitan is the C library that powers redbean, which is the third most upvoted hobby project in the history of Hacker News and The Register called it "Unbelievably Clever". If you check out Awesome Cosmo you can see we've built up a nice community around the project that's using Cosmopolitan to make software in general better. The biggest blocker to helping more folks benefit from Cosmopolitan has always been the lack of POSIX threads support. So we've spent a lot of time focusing on addressing that. But more importantly, we eat our own dogfood too, since the best way to showcase our tools is to use them to build things ourselves.

IPv4 Games

TurfWar is a web app I rewrote for a friend in the last few weeks that uses Cosmopolitan's new POSIX threads library. It's a web app written as a single .c file with its own homebrew HTTP/1.1 server that runs ipv4.games. It's currently behind a frontend, but you can access our server directly by visiting the subdomain turfwar.ipv4.games. There's also live monitoring data on the /statusz endpoint.

So far Cosmopolitan has held up well in prod. We've seen 100,000 daily active visitors. System administrators around the world have rallied to send us oodles of traffic. It runs on a cheap two core VM because I live off GitHub sponsorships, so it wouldn't be possible for me to host a free game using tools like Kubernetes that cost outrageous amounts of money. Altavista ran an entire search engine on a single computer, that serviced requests for the whole world. That was in the late 1990's and hardware is so much faster today, so there's no reason why we can't do the same for a simple game that handles 90,000 requests per minute.

[United States of Lemuria - two dollar bill - all debts public and primate]

One of the nice things about having a frontend web server, is it lets you get away with having a really simple backend server that doesn't need to worry about the complex aspects of HTTP serving, things like SSL, HTTP/0.9, provisional responses, and expectations. Frontends also shape the traffic somewhat, so we can afford to ignore most system call corner cases too. So I whipped up a one-off web server from scratch just for this app, which uses SQLite as its database and is about a thousand lines of code. What I can't believe is just how fast it is.

For comparison, here's how fast NGINX goes on my local dev machine. The important metric to notice here is the 99th percentile latency, which is 7.42 milliseconds. That's for serving a simple static HTML page.

$ wrk --latency -t 12 -c 120 http://127.0.0.1/tool/net/demo/index.html
Running 10s test @ http://127.0.0.1/tool/net/demo/index.html
  12 threads and 120 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency     0.85ms    1.56ms  22.76ms   89.25%
    Req/Sec    33.19k     6.37k   51.66k    68.75%
  Latency Distribution
     50%  172.00us
     75%    0.89ms
     90%    2.55ms
     99%    7.42ms
  1749435 requests in 4.42s, 1.86GB read
Requests/sec: 395758.94
Transfer/sec:    431.40MB

I've always bragged about how redbean outperforms NGINX. What makes it so impressive is that redbean here is serving compressed responses and using a heavyweight fork() based i/o model (rather than NGINX's popular evio model), yet redbean still manages to have 5x better 99th percentile latency than NGINX:

$ wrk --latency -t 12 -c 120 -H 'Accept-Encoding: gzip' \
    http://127.0.0.1:8080/tool/net/demo/index.html
Running 10s test @ http://127.0.0.1:8080/tool/net/demo/index.html
  12 threads and 120 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency   167.22us    2.91ms 200.78ms   99.68%
    Req/Sec    54.77k    25.77k  120.18k    60.29%
  Latency Distribution
     50%   49.00us
     75%   75.00us
     90%  118.00us
     99%    1.49ms
  3393355 requests in 5.21s, 2.64GB read
Requests/sec: 651185.38
Transfer/sec:    518.55MB

But those are reads. The truly expensive HTTP endpoints are the ones that do writes, and the vast majority of traffic sent to our IPv4 Games server has been directed to its /claim registration endpoint, which is responsible for ensuring that a user's name and IP address get inserted into the SQLite database, or returning an error if the server is too congested. Performance here is critical, since operating a public write endpoint is like begging to be DDOS'd and Cloudflare's proxy isn't really able to do much to protect it. Here we see on my local dev machine the TurfWar HTTP server is able to handle 284,316 writes per second, but most outstandingly, it's able to do so with 219 microsecond 99th percentile latency. That's two orders of a magnitude faster than NGINX!

$ git clone https://github.com/jart/cosmopolitan
$ cd cosmopolitan
$ make -j16 m=fastbuild o/fastbuild/net/turfwar/turfwar.com
$ mkdir /opt/turfwar
$ sqlite3 /opt/turfwar/db.sqlite3 <net/turfwar/schema.sql
$ o/fastbuild/net/turfwar/turfwar.com &
$ wrk --latency -t 12 -c 100 'http://127.0.0.1:8080/claim?name=justine'
Running 10s test @ http://127.0.0.1:8080/claim?name=justine
  12 threads and 100 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency    58.69us    1.33ms 101.61ms   99.86%
    Req/Sec    46.33k    28.31k  119.92k    67.78%
  Latency Distribution
     50%   22.00us
     75%   28.00us
     90%   41.00us
     99%  219.00us
  1679333 requests in 5.91s, 779.95MB read
  Socket errors: connect 0, read 3, write 0, timeout 0
  Non-2xx or 3xx responses: 3
Requests/sec: 284316.40
Transfer/sec:    132.05MB

Before this project, I never could have imagined that a SQLite write endpoint could go faster than ~4,000 messages per second, because file locks aren't very fast. The creator of IPv4 Games was actually on the verge of shutting it down, because the service simply got too much traffic. So we were under a lot of pressure to find a more efficient solution. The brand new Cosmopolitan Libc threads support gave us an opportunity to do just that.

Practice

How did we do it? Earlier I mentioned how the Altavista search engine, which decades ago was Google's strongest competitor. They somehow managed to operate an entire search engine on a single computer. Now that's just inconceivable compared to some of the bloated junk we're forced to live with today, so it clearly must have taken God-tier coding skills for Altavista to have squeezed that much value out of a single server. As it turns out, Mike Burrows also recently wrote another library called *NSYNC. It's probably the most prolific piece of software you've never heard of, since its GitHub project has only 222 stars.

Cosmpolitan Libc uses *NSYNC to implement the synchronization primitives required by the POSIX standard. If you use our pthread locking APIs then you're actually using *NSYNC. It's even better using *NSYNC directly, since its APIs are just a little bit better than the ones defined by POSIX (particularly when it comes to cancellations, which Cosmopolitan hasn't implemented in pthreads yet). So we took Turfwar, and moved all its expensive operations into dedicated background threads and then we used the locking primitives to synchronize them with the foreground threads that communicate with clients.

The TurfWar source code is available in //net/turfwar/turfwar.c and it's received a lot of positive feedback so far. For example, Valtteri Koskivuori reached out and told us, "I do have to say, this is the coolest C code I've ever read." A lot of that's due to how Cosmopolitan Libc makes C fresh and cool again, by abstracting portability toil at the runtime rather than compile time level. That means our code doesn't need those mountains of ugly #ifdef statements like other C projects.

The big selling point to Cosmopolitan of course, is that it does approximately 85 percent of the legwork needed to get the binaries you compile on Linux to run on six other operating systems too. Even though with TurfWar, we only ever intended to run it in production on Linux, that's still a nice trick to have up our sleeves. For example, I did absolutely nothing to make this app run on other platforms. I just wrote normal C code using standard POSIX APIs, and thanks to Cosmopolitan, I was able to copy the turfwar executable over to MacOS, FreeBSD, OpenBSD, NetBSD and it just worked. It even comes pretty close to just working as a WIN32 application.

[TurfWar HTTP server running on WIN32]

It's amazing we've made it this far, but there's still so much more to do. The two biggest known issues with our Windows support both relate to synchronization, since that's really the hardest thing in programming. It's no wonder that when Google poached their fiercest rival (the Altavista *NSYNC guy) they had him implement their lock server. For example, Cosmpolitan has a polyfill for fcntl() POSIX advisory locks on Windows. We even wrote a torture test to show it works. However SQLite isn't happy with it and we're still working to figure out why. Hence locks are currently disabled with SQLite on Windows. It can still be used, but it can't be shared between multiple threads or processes and it certainly shouldn't be used for production purposes. The second known issue is *NSYNC cancellations don't work on Windows and we're not sure why. However that one's less of a showstopper, since it just means the TurfWar server shuts down abruptly rather than gracefully when I press Ctrl-C.

Implementation Details

One of the nice things about writing apps with redbean is that it's a forking web server where synchronization is abstracted by the system. In fact I'd say redbean supporting fork() on platforms like Windows was its biggest achievement. The only other people who've managed to accomplish that was Cygnus Solutions, and they've spoken at lengths about how difficult it was. Windows doesn't support inheriting copy-on-write pages. We need to track all our memory mappings in a table along with a separate handle for each 64kb frame. When cosmo processes call fork(), it then copies its memory to a pipe which gets sent to a subprocess which has to reconstruct its own memory layout while it's running to be the same as the parent process.

[diagram of cosmo clone() polyfill]

Supporting threads was harder. Cygwin and MSYS2 never really attempted it themselves (although there is a separate GPL-licensed winpthreads library) so what we're doing here is breaking new ground. What was nice about fork() is that Windows was the oddball since every other OS had a turn-key solution. The same was not true with threads, because every single operating system decided to go its own way. If you read the source code to our clone() polyfill below, you'll notice that they've each got a very specific incantation for spawning threads, and there doesn't appear to be any common influence between them.

POSIX threads is also the only systems API specified by POSIX that has almost no resemblance to the underlying kernel interfaces. With threads all you really need is an API for spawning them, and everything else is just atomic memory operations which is governed more by the C standard than POSIX. It would have been nice if POSIX had just defined clone() rather than an expansive API with 102 separate functions. Even with *NSYNC doing most of the work, it was quite a marathon to implement. Especially if we consider that number doesn't include the non-POSIX pthreads functions, e.g. pthread_getaffinity_np and pthread_setname_np which are also supported by Cosmopolitan. However the truly difficult part turned out to be thread-local storage. It was one of those projects, where I needed to spend a year thinking about how to do it, before finally sitting down and coding it in a few days.

Thread-Local Storage

%FS %GS
Linux Unrestricted Unrestricted
Mac OS X Inaccessible Unrestricted
Windows Inaccessible Restricted
FreeBSD Unrestricted Unrestricted
NetBSD Unrestricted Unrestricted
OpenBSD Unrestricted Inaccessible

On x86 there's special registers called %fs and %gs which basically give you a free addition operation, each time a memory access occurs. In the old days, it was used by DOS to manage the memory of each program, as a sort of poor man's virtual memory. Its use was then restricted once Intel found better ways to offer virtual memory, which means the operating system gets to decide whether or not a program is allowed to change them. It wasn't until later on that folks realized how useful these registers could be for thread-local storage, but the restricted nature of these registers has mostly stuck, and they just so happen to be stuck in the most pernicious possible way.

Cosmopolitan binaries run on seven operating system without needing to be recompiled. But at compile-time, the compiler needs to choose one of those two registers when it generates TLS code. The open source world always uses %fs for this purpose, whereas Windows and MacOS use %gs. Cosmpolitan Libc is a greenfield quasi-platform, so we should have been able to pick one, but if we'd picked %fs then we wouldn't be able to run on Windows and Mac, because Windows and Mac don't provide an API for changing the %fs register. Worst of all, we couldn't pick %gs either, since OpenBSD doesn't provide an API for changing it.

So what we did was we went with %fs by default, and if your program is loaded on Windows or MacOS then the C runtime will sort of recompile your binary in memory, on the fly, to turn the %fs instructions into %gs ones. It turned out this is relatively easy for the runtime to do, since GCC provides a -mno-tls-direct-seg-refs flag that greatly reduces the number of possible TLS byte sequences it'll generate in the executable. It's just we don't know where they exist in the binary, so we use the following code to hunt them down.

// iterate over modifiable code looking for 9 byte instruction
// this would take 30 ms using xed to enable tls on python.com
__morph_begin();
for (p = _ereal; p + 9 <= __privileged_start; p += n) {
  w = READ64LE(p) & READ64LE("\377\373\377\307\377\377\377\377");
  if ((w == READ64LE("\144\110\213\004\045\000\000\000") ||  // mov %fs:0,%R
       w == READ64LE("\144\110\003\004\045\000\000\000")) && // add %fs:0,%R
      !p[8]) {
    // now change the code
    p[0] = 0145;                       // change %fs to %gs
    p[5] = (dis & 0x000000ff) >> 000;  // displacement
    p[6] = (dis & 0x0000ff00) >> 010;  // displacement
    p[7] = (dis & 0x00ff0000) >> 020;  // displacement
    p[8] = (dis & 0xff000000) >> 030;  // displacement
    // advance to the next instruction
    n = 9;
  } else {
    n = 1;
  }
}
__morph_end();

Now you might be thinking, surely it must be slow to be forced to iterate over the entire executable's content each time it's loaded. That turned out to not be the case, because we were able to implement a wicked fast SSE optimization. We benchmarked this trick on a ten megabyte Python executable and found that it only imposes one millisecond of program startup latency. All we had to do was add this code to the top of the loop above.

// use sse to zoom zoom to fs register prefixes
// that way it'll take 1 ms to morph python.com
while (p + 9 + 16 <= __privileged_start) {
  if ((m = __builtin_ia32_pmovmskb128(
           *(xmm_t *)p == (xmm_t){0144, 0144, 0144, 0144, 0144, 0144,
                                  0144, 0144, 0144, 0144, 0144, 0144,
                                  0144, 0144, 0144, 0144}))) {
    m = __builtin_ctzll(m);
    p += m;
    break;
  } else {
    p += 16;
  }
}

This technique allows for other tricks too. For example, on Windows, you'd think that only MSVC users are able to benefit from fast TLS, whereas everyone else would have to go through WIN32 functions like TlsAlloc() and TlsGetValue(). Those APIs are needed because WIN32 manages the TLS memory and we need to ask it to assign us a word index in its layout. But it turned out we could encode the information given to us by TlsAlloc() into the displacement value when rewriting the binary, effectively enabling us to kill two birds with one stone.

// MSVC __declspec(thread) generates binary code for this
// %gs:0x1480 abi. So long as TlsAlloc() isn't called >64
// times we should be good.
dis = 0x1480 + __tls_index * 8;

The best part is, we managed to make all of this tiny. Not having our TLS code be bloated was an extremely important goal, because TLS is a mandatory runtime dependency. It can't be linked conditionally. It always has to be setup by the C runtime, even in single-threaded apps, just in case something uses the _Thread_local keyword. So thanks to the year of thought I put into this before doing it, our TLS solution was able to be distilled down to a ~1kb footprint. This ensured that our flagship life.com and hello.com example programs remained only 16kb in size when built in MODE=tiny.

Futexes

If thread-local storage was the most difficult part, then the most fun part would be futexes. When incorporating *NSYNC, we had to rewrite the code system call code, because *NSYNC makes the traditional assumption that system integration happens at compile time. You can read the futex() module I wrote here:

The classic way to lock memory is to just spin in a loop, reading the same variable over and over again until it changes. But that uses a lot of CPU time. So the purpose of futexes is to ask the operating system to put a thread asleep and wait for a notification instead. What's nice is you only need to give the kernel a memory address, and you don't have to juggle things like handles or objects, so it really makes things simple for the userspace program.

The idea of futexes is attractive enough as an API that Linux, FreeBSD, OpenBSD, and Windows all decided to natively support it. What's also nice is that we can do a reasonably good job polyfilling it in userspace when platform support isn't available. What Cosmopolitan does on MacOS and NetBSD is just a simple loop with exponential backoff, that calls clock_nanosleep() or select() until either the memory address changes, or the specified deadline expires. I've also gone to the trouble of verifying that the *NSYNC unit tests pass when using this polyfill instead of the official kernel abi.

We love futexes so much, that we even added them to redbean. Your Lua code can now use mmap(MAP_SHARED) to share memory between processes. It works by returning a memory blob object with methods that implement atomics and futexes. It's a powerful low level tool that lets you do things like create your own synchronization primitives in pure Lua. For example, mutexes can be written as follows:

mem = unix.mapshared(8000 * 8)
LOCK = 0 -- pick an arbitrary word index for lock

-- From Futexes Are Tricky Version 1.1 ยง Mutex, Take 3;
-- Ulrich Drepper, Red Hat Incorporated, June 27, 2004.
function Lock()
    local ok, old = mem:cmpxchg(LOCK, 0, 1)
    if not ok then
        if old == 1 then
            old = mem:xchg(LOCK, 2)
        end
        while old > 0 do
            mem:wait(LOCK, 2)
            old = mem:xchg(LOCK, 2)
        end
    end
end

function Unlock()
    old = mem:fetch_add(LOCK, -1)
    if old == 2 then
        mem:store(LOCK, 0)
        mem:wake(LOCK, 1)
    end
end

This will let you do interprocess communication in the fastest possible way, which is useful for situations where SQLite doesn't go fast enough. For example, if you wanted to rate connections based on Class C IPv4 network addresses, then due to the way virtual memory works, you can use unix.mapshared() as a sparse 128mb array of hit counters.

hits = unix.mapshared((2^32>>8)*8)
function OnHttpRequest()
    hits:fetch_add(GetRemoteAddr() >> 8, 1)
end

You could then do something like create a dedicated background process using unix.fork() that manages the shared words as a token bucket. Or you could use the MaxMind module, to obtain the country code and autonomous system number, and hash them to derive a token bucket index.

Getting Started

Whether you prefer to write your apps in plain C like the TurfWar server, or write them in Lua using redbean, there's a lot to love and play with now that Cosmopolitan has threads support.

  1. If you want to write posix threads code in C, then read Getting Started with Cosmopolitan Libc. It shows how you can clone our GitHub mono repo without having to write any build configs, simply by dropping a file in the examples folder. That's how you'd get started in a greenfield. You can also hack on the turfwar.c server using the build instructions above.
  2. If you want write multiprocess shared memory unix code in Lua then redbean is able to work as both a web application server and as an plain old interpreter (if you pass the -i flag). This is our flagship and best supported application. See the redbean website to learn more.
  3. If you want to get existing open source apps to build with Cosmopolitan Libc, then check out our new cosmocc and cosmoc++ toolchain scripts. We've come a long way in the past two years at improving our open source compatibility, and POSIX threads support has been a total game changer in this regard. We're also taking a more conventional approach to open source applications outside the mono repo. For example, we now use an 8mb stack size by default, just like everyone else, even though the mono repo continues to favor a 64kb stack size (because it lets us spawn over 9,000 threads cheaply!)

You're also invited to join our Discord community, where you can come hang out with us, ask questions, and tell us about the cool projects you're building. https://discord.gg/FwAVVu7eJ4

Funding

[United States of Lemuria - two dollar bill - all debts public and primate]

Funding for the development of redbean was crowdsourced from Justine Tunney's GitHub sponsors and Patreon subscribers. Your support is what makes projects like redbean possible. Thank you.