Oct 8th, 2022 @ justine's web page
Cosmopolitan Libc makes C a build-once run-anywhere language. It lets you build idiomatic posix code on Linux and have your binaries just work when copied to other operating systems. I'm proud to announce we've implemented POSIX threads support, and used it to build a multithreaded web server that sysadmins worldwide have been battle testing the last few weeks, so I feel confident in saying that our threads support is production worthy and ready for use.
Cosmopolitan is the C library that powers redbean, which is the third most upvoted hobby project in the history of Hacker News and The Register called it "Unbelievably Clever". If you check out Awesome Cosmo you can see we've built up a nice community around the project that's using Cosmopolitan to make software in general better. The biggest blocker to helping more folks benefit from Cosmopolitan has always been the lack of POSIX threads support. So we've spent a lot of time focusing on addressing that. But more importantly, we eat our own dogfood too, since the best way to showcase our tools is to use them to build things ourselves.
TurfWar is a web app I rewrote for a friend in the last few weeks that uses Cosmopolitan's new POSIX threads library. It's a web app written as a single .c file with its own homebrew HTTP/1.1 server that runs ipv4.games. It's currently behind a frontend, but you can access our server directly by visiting the subdomain turfwar.ipv4.games. There's also live monitoring data on the /statusz endpoint.
So far Cosmopolitan has held up well in prod. We've seen 100,000 daily active visitors. System administrators around the world have rallied to send us oodles of traffic. It runs on a cheap two core VM because I live off GitHub sponsorships, so it wouldn't be possible for me to host a free game using tools like Kubernetes that cost outrageous amounts of money. Altavista ran an entire search engine on a single computer, that serviced requests for the whole world. That was in the late 1990's and hardware is so much faster today, so there's no reason why we can't do the same for a simple game that handles 90,000 requests per minute.
One of the nice things about having a frontend web server, is it lets you get away with having a really simple backend server that doesn't need to worry about the complex aspects of HTTP serving, things like SSL, HTTP/0.9, provisional responses, and expectations. Frontends also shape the traffic somewhat, so we can afford to ignore most system call corner cases too. So I whipped up a one-off web server from scratch just for this app, which uses SQLite as its database and is about a thousand lines of code. What I can't believe is just how fast it is.
For comparison, here's how fast NGINX goes on my local dev machine. The important metric to notice here is the 99th percentile latency, which is 7.42 milliseconds. That's for serving a simple static HTML page.
$ wrk --latency -t 12 -c 120 http://127.0.0.1/tool/net/demo/index.html Running 10s test @ http://127.0.0.1/tool/net/demo/index.html 12 threads and 120 connections Thread Stats Avg Stdev Max +/- Stdev Latency 0.85ms 1.56ms 22.76ms 89.25% Req/Sec 33.19k 6.37k 51.66k 68.75% Latency Distribution 50% 172.00us 75% 0.89ms 90% 2.55ms 99% 7.42ms 1749435 requests in 4.42s, 1.86GB read Requests/sec: 395758.94 Transfer/sec: 431.40MB
I've always bragged about how redbean outperforms NGINX. What makes it so impressive is that redbean here is serving compressed responses and using a heavyweight fork() based i/o model (rather than NGINX's popular evio model), yet redbean still manages to have 5x better 99th percentile latency than NGINX:
$ wrk --latency -t 12 -c 120 -H 'Accept-Encoding: gzip' \ http://127.0.0.1:8080/tool/net/demo/index.html Running 10s test @ http://127.0.0.1:8080/tool/net/demo/index.html 12 threads and 120 connections Thread Stats Avg Stdev Max +/- Stdev Latency 167.22us 2.91ms 200.78ms 99.68% Req/Sec 54.77k 25.77k 120.18k 60.29% Latency Distribution 50% 49.00us 75% 75.00us 90% 118.00us 99% 1.49ms 3393355 requests in 5.21s, 2.64GB read Requests/sec: 651185.38 Transfer/sec: 518.55MB
But those are reads. The truly expensive HTTP endpoints are the ones that do writes, and the vast majority of traffic sent to our IPv4 Games server has been directed to its /claim registration endpoint, which is responsible for ensuring that a user's name and IP address get inserted into the SQLite database, or returning an error if the server is too congested. Performance here is critical, since operating a public write endpoint is like begging to be DDOS'd and Cloudflare's proxy isn't really able to do much to protect it. Here we see on my local dev machine the TurfWar HTTP server is able to handle 284,316 writes per second, but most outstandingly, it's able to do so with 219 microsecond 99th percentile latency. That's two orders of a magnitude faster than NGINX!
$ git clone https://github.com/jart/cosmopolitan $ cd cosmopolitan $ make -j16 m=fastbuild o/fastbuild/net/turfwar/turfwar.com $ mkdir /opt/turfwar $ sqlite3 /opt/turfwar/db.sqlite3 <net/turfwar/schema.sql $ o/fastbuild/net/turfwar/turfwar.com & $ wrk --latency -t 12 -c 100 'http://127.0.0.1:8080/claim?name=justine' Running 10s test @ http://127.0.0.1:8080/claim?name=justine 12 threads and 100 connections Thread Stats Avg Stdev Max +/- Stdev Latency 58.69us 1.33ms 101.61ms 99.86% Req/Sec 46.33k 28.31k 119.92k 67.78% Latency Distribution 50% 22.00us 75% 28.00us 90% 41.00us 99% 219.00us 1679333 requests in 5.91s, 779.95MB read Socket errors: connect 0, read 3, write 0, timeout 0 Non-2xx or 3xx responses: 3 Requests/sec: 284316.40 Transfer/sec: 132.05MB
Before this project, I never could have imagined that a SQLite write endpoint could go faster than ~4,000 messages per second, because file locks aren't very fast. The creator of IPv4 Games was actually on the verge of shutting it down, because the service simply got too much traffic. So we were under a lot of pressure to find a more efficient solution. The brand new Cosmopolitan Libc threads support gave us an opportunity to do just that.
How did we do it? Earlier I mentioned how the Altavista search engine, which decades ago was Google's strongest competitor. They somehow managed to operate an entire search engine on a single computer. Now that's just inconceivable compared to some of the bloated junk we're forced to live with today, so it clearly must have taken God-tier coding skills for Altavista to have squeezed that much value out of a single server. As it turns out, Mike Burrows also recently wrote another library called *NSYNC. It's probably the most prolific piece of software you've never heard of, since its GitHub project has only 222 stars.
Cosmpolitan Libc uses *NSYNC to implement the synchronization primitives required by the POSIX standard. If you use our pthread locking APIs then you're actually using *NSYNC. It's even better using *NSYNC directly, since its APIs are just a little bit better than the ones defined by POSIX (particularly when it comes to cancellations, which Cosmopolitan hasn't implemented in pthreads yet). So we took Turfwar, and moved all its expensive operations into dedicated background threads and then we used the locking primitives to synchronize them with the foreground threads that communicate with clients.
ListenWorker
calls accept
in a loop and then adds the returned
client file descriptors to an nsync_cv queue.
HttpWorker
threads
each of which reads a client from the queue and then calls read
and write
on the socket. This is the request handler and it
doesn't perform any kind of expensive operation, because if it did
things like SQL queries, then an attacker would be able to control
how much disk i/o and file locking the server is doing. When a claim
comes in, the HTTP worker adds it to a blocking message queue with a
deadline, and sends a 502 response if the deadline expires because
the queue is full.
ClaimWorker
reads up to 64 claims at a time as soon as they become
available, and issues the INSERT queries within a single batched
transaction. As soon as it's finished inserting new claims, it then
wakes up RecentWorker.
RecentWorker
generates the JSON endpoint which lets the user see
the list of recent claims. This gets regenerated almost instantly
after new claims are inserted. Locks are used so it can atomically
replace the Asset in memory that the HTTP workers use to send the
JSON data to clients.
ScoreWorker
performs an expensive query periodically to generate
the big board of winners who have the most claim registrations. It
also gzips the json once so the request handler doesn't have to do
expensive deflate operations.
NowWorker
wakes up once per second to generate the HTTP Date
header, because gmtime_r
is a compute-heavy function that needs to
consult the time zone database.
Supervisor
calls fstat
to see if static
HTML assets have changed on disk, and if so, atomically reloads
them. The supervisor also checks to see if the number of active
clients is above 85% of the thread pool, and if so, sends SIGUSR1
signals to HTTP workers to cancel read
operations and
ask them to shut down.
The TurfWar source code is available in //net/turfwar/turfwar.c and it's received a lot of positive feedback so far. For example, Valtteri Koskivuori reached out and told us, "I do have to say, this is the coolest C code I've ever read." A lot of that's due to how Cosmopolitan Libc makes C fresh and cool again, by abstracting portability toil at the runtime rather than compile time level. That means our code doesn't need those mountains of ugly #ifdef statements like other C projects.
The big selling point to Cosmopolitan of course, is that it does approximately 85 percent of the legwork needed to get the binaries you compile on Linux to run on six other operating systems too. Even though with TurfWar, we only ever intended to run it in production on Linux, that's still a nice trick to have up our sleeves. For example, I did absolutely nothing to make this app run on other platforms. I just wrote normal C code using standard POSIX APIs, and thanks to Cosmopolitan, I was able to copy the turfwar executable over to MacOS, FreeBSD, OpenBSD, NetBSD and it just worked. It even comes pretty close to just working as a WIN32 application.
It's amazing we've made it this far, but there's still so much more to do. The two biggest known issues with our Windows support both relate to synchronization, since that's really the hardest thing in programming. It's no wonder that when Google poached their fiercest rival (the Altavista *NSYNC guy) they had him implement their lock server. For example, Cosmpolitan has a polyfill for fcntl() POSIX advisory locks on Windows. We even wrote a torture test to show it works. However SQLite isn't happy with it and we're still working to figure out why. Hence locks are currently disabled with SQLite on Windows. It can still be used, but it can't be shared between multiple threads or processes and it certainly shouldn't be used for production purposes. The second known issue is *NSYNC cancellations don't work on Windows and we're not sure why. However that one's less of a showstopper, since it just means the TurfWar server shuts down abruptly rather than gracefully when I press Ctrl-C.
One of the nice things about writing apps with redbean is that it's a forking web server where synchronization is abstracted by the system. In fact I'd say redbean supporting fork() on platforms like Windows was its biggest achievement. The only other people who've managed to accomplish that was Cygnus Solutions, and they've spoken at lengths about how difficult it was. Windows doesn't support inheriting copy-on-write pages. We need to track all our memory mappings in a table along with a separate handle for each 64kb frame. When cosmo processes call fork(), it then copies its memory to a pipe which gets sent to a subprocess which has to reconstruct its own memory layout while it's running to be the same as the parent process.
Supporting threads was harder. Cygwin and MSYS2 never really attempted it themselves (although there is a separate GPL-licensed winpthreads library) so what we're doing here is breaking new ground. What was nice about fork() is that Windows was the oddball since every other OS had a turn-key solution. The same was not true with threads, because every single operating system decided to go its own way. If you read the source code to our clone() polyfill below, you'll notice that they've each got a very specific incantation for spawning threads, and there doesn't appear to be any common influence between them.
POSIX threads is also the only systems API specified by POSIX that has
almost no resemblance to the underlying kernel interfaces. With threads
all you really need is an API for spawning them, and everything else is
just atomic memory operations which is governed more by the C standard
than POSIX. It would have been nice if POSIX had just defined clone()
rather than an expansive API with 102 separate functions. Even with
*NSYNC doing most of the work, it was quite a marathon to implement.
Especially if we consider that number doesn't include the non-POSIX
pthreads functions, e.g. pthread_getaffinity_np
and pthread_setname_np
which are also supported by
Cosmopolitan. However the truly difficult part turned out to be
thread-local storage. It was one of those projects, where I needed to
spend a year thinking about how to do it, before finally sitting down
and coding it in a few days.
%FS | %GS | |
---|---|---|
Linux | Unrestricted | Unrestricted |
Mac OS X | Inaccessible | Unrestricted |
Windows | Inaccessible | Restricted |
FreeBSD | Unrestricted | Unrestricted |
NetBSD | Unrestricted | Unrestricted |
OpenBSD | Unrestricted | Inaccessible |
On x86 there's special registers called %fs and %gs which basically give you a free addition operation, each time a memory access occurs. In the old days, it was used by DOS to manage the memory of each program, as a sort of poor man's virtual memory. Its use was then restricted once Intel found better ways to offer virtual memory, which means the operating system gets to decide whether or not a program is allowed to change them. It wasn't until later on that folks realized how useful these registers could be for thread-local storage, but the restricted nature of these registers has mostly stuck, and they just so happen to be stuck in the most pernicious possible way.
Cosmopolitan binaries run on seven operating system without needing to be recompiled. But at compile-time, the compiler needs to choose one of those two registers when it generates TLS code. The open source world always uses %fs for this purpose, whereas Windows and MacOS use %gs. Cosmpolitan Libc is a greenfield quasi-platform, so we should have been able to pick one, but if we'd picked %fs then we wouldn't be able to run on Windows and Mac, because Windows and Mac don't provide an API for changing the %fs register. Worst of all, we couldn't pick %gs either, since OpenBSD doesn't provide an API for changing it.
So what we did was we went with %fs by default, and if your program is
loaded on Windows or MacOS then the C runtime will sort of recompile
your binary in memory, on the fly, to turn the %fs instructions into %gs
ones. It turned out this is relatively easy for the runtime to do, since
GCC provides a -mno-tls-direct-seg-refs
flag that greatly
reduces the number of possible TLS byte sequences it'll generate in the
executable. It's just we don't know where they exist in the binary, so
we use the following code to hunt them down.
// iterate over modifiable code looking for 9 byte instruction // this would take 30 ms using xed to enable tls on python.com __morph_begin(); for (p = _ereal; p + 9 <= __privileged_start; p += n) { w = READ64LE(p) & READ64LE("\377\373\377\307\377\377\377\377"); if ((w == READ64LE("\144\110\213\004\045\000\000\000") || // mov %fs:0,%R w == READ64LE("\144\110\003\004\045\000\000\000")) && // add %fs:0,%R !p[8]) { // now change the code p[0] = 0145; // change %fs to %gs p[5] = (dis & 0x000000ff) >> 000; // displacement p[6] = (dis & 0x0000ff00) >> 010; // displacement p[7] = (dis & 0x00ff0000) >> 020; // displacement p[8] = (dis & 0xff000000) >> 030; // displacement // advance to the next instruction n = 9; } else { n = 1; } } __morph_end();
Now you might be thinking, surely it must be slow to be forced to iterate over the entire executable's content each time it's loaded. That turned out to not be the case, because we were able to implement a wicked fast SSE optimization. We benchmarked this trick on a ten megabyte Python executable and found that it only imposes one millisecond of program startup latency. All we had to do was add this code to the top of the loop above.
// use sse to zoom zoom to fs register prefixes // that way it'll take 1 ms to morph python.com while (p + 9 + 16 <= __privileged_start) { if ((m = __builtin_ia32_pmovmskb128( *(xmm_t *)p == (xmm_t){0144, 0144, 0144, 0144, 0144, 0144, 0144, 0144, 0144, 0144, 0144, 0144, 0144, 0144, 0144, 0144}))) { m = __builtin_ctzll(m); p += m; break; } else { p += 16; } }
This technique allows for other tricks too. For example, on Windows,
you'd think that only MSVC users are able to benefit from fast TLS,
whereas everyone else would have to go through WIN32 functions like
TlsAlloc()
and TlsGetValue()
. Those APIs are
needed because WIN32 manages the TLS memory and we need to ask it to
assign us a word index in its layout. But it turned out we could encode
the information given to us by TlsAlloc()
into the
displacement value when rewriting the binary, effectively enabling us to
kill two birds with one stone.
// MSVC __declspec(thread) generates binary code for this // %gs:0x1480 abi. So long as TlsAlloc() isn't called >64 // times we should be good. dis = 0x1480 + __tls_index * 8;
The best part is, we managed to make all of this tiny. Not having our
TLS code be bloated was an extremely important goal, because TLS is a
mandatory runtime dependency. It can't be linked conditionally. It
always has to be setup by the C runtime, even in single-threaded apps,
just in case something uses the _Thread_local
keyword. So
thanks to the year of thought I put into this before doing it, our TLS
solution was able to be distilled down to a ~1kb footprint. This ensured
that our flagship life.com and hello.com example programs remained only
16kb in size when built in MODE=tiny
.
If thread-local storage was the most difficult part, then the most fun part would be futexes. When incorporating *NSYNC, we had to rewrite the code system call code, because *NSYNC makes the traditional assumption that system integration happens at compile time. You can read the futex() module I wrote here:
The classic way to lock memory is to just spin in a loop, reading the same variable over and over again until it changes. But that uses a lot of CPU time. So the purpose of futexes is to ask the operating system to put a thread asleep and wait for a notification instead. What's nice is you only need to give the kernel a memory address, and you don't have to juggle things like handles or objects, so it really makes things simple for the userspace program.
The idea of futexes is attractive enough as an API that Linux, FreeBSD, OpenBSD, and Windows all decided to natively support it. What's also nice is that we can do a reasonably good job polyfilling it in userspace when platform support isn't available. What Cosmopolitan does on MacOS and NetBSD is just a simple loop with exponential backoff, that calls clock_nanosleep() or select() until either the memory address changes, or the specified deadline expires. I've also gone to the trouble of verifying that the *NSYNC unit tests pass when using this polyfill instead of the official kernel abi.
We love futexes so much, that we even added them to redbean. Your Lua
code can now use mmap(MAP_SHARED)
to share memory between
processes. It works by returning a memory blob object with methods that
implement atomics and futexes. It's a powerful low level tool that lets
you do things like create your own synchronization primitives in pure
Lua. For example, mutexes can be written as follows:
mem = unix.mapshared(8000 * 8) LOCK = 0 -- pick an arbitrary word index for lock -- From Futexes Are Tricky Version 1.1 ยง Mutex, Take 3; -- Ulrich Drepper, Red Hat Incorporated, June 27, 2004. function Lock() local ok, old = mem:cmpxchg(LOCK, 0, 1) if not ok then if old == 1 then old = mem:xchg(LOCK, 2) end while old > 0 do mem:wait(LOCK, 2) old = mem:xchg(LOCK, 2) end end end function Unlock() old = mem:fetch_add(LOCK, -1) if old == 2 then mem:store(LOCK, 0) mem:wake(LOCK, 1) end end
This will let you do interprocess communication in the fastest possible
way, which is useful for situations where SQLite doesn't go fast enough.
For example, if you wanted to rate connections based on Class C IPv4
network addresses, then due to the way virtual memory works, you can
use unix.mapshared()
as a sparse 128mb array of hit
counters.
hits = unix.mapshared((2^32>>8)*8) function OnHttpRequest() hits:fetch_add(GetRemoteAddr() >> 8, 1) end
You could then do something like create a dedicated background process
using unix.fork()
that manages the shared words as a token
bucket. Or you could use the MaxMind module, to obtain the country code
and autonomous system number, and hash them to derive a token bucket
index.
Whether you prefer to write your apps in plain C like the TurfWar server, or write them in Lua using redbean, there's a lot to love and play with now that Cosmopolitan has threads support.
-i
flag). This
is our flagship and best supported application. See the
redbean website to learn
more.
You're also invited to join our Discord community, where you can come hang out with us, ask questions, and tell us about the cool projects you're building. https://discord.gg/FwAVVu7eJ4
Funding for the development of redbean was crowdsourced from Justine Tunney's
GitHub sponsors
and Patreon subscribers. Your
support is what makes projects like redbean possible. Thank you.