Oct 2nd, 2024 @ justine's web page
Cosmopolitan Libc is well-known for its polyglot fat binary hack that lets your executables run on six OSes for AMD64 / ARM64. What may surprise you is that it could also be the best C library for your production workloads too. To demonstrate this point, let's compare Cosmo's mutex library with other platforms.
We'll do this by writing a simple test that spawns 30 threads which increment the same integer 100,000 times. This tests how well a mutex implementation performs in the heavily contended use case. In essence, that means the following (see the segment at the bottom of the page for the full source code).
int g_chores; pthread_mutex_t g_locker = PTHREAD_MUTEX_INITIALIZER; void *worker(void *arg) { for (int i = 0; i < ITERATIONS; ++i) { pthread_mutex_lock(&g_locker); ++g_chores; pthread_mutex_unlock(&g_locker); } return 0; }
Now let's start with the exciting part, which are my benchmark results.
Times will be measured in microseconds. Wall time is how long the test program takes to run. That includes the overhead of spawning and joining threads. User time is how much CPU time was spent in userspace, and system time is how much CPU time was spent in the kernel. System and user time can exceed the actual wall time because multiple threads are running in parallel.
The first results I'll show are for Windows because Mark Waterman did an excellent mutex shootout three months ago, where he said, "in the highly contended scenario, Windows wins the day with its SRWLOCK". Contention is where mutex implementations show their inequality. Mark was so impressed by Microsoft's SRWLOCK that he went on to recommend Linux and FreeBSD users consider targeting Windows if mutex contention is an issue.
Windows Mutex Implementations | |||
---|---|---|---|
wall time (µs) | user time (µs) | system time (µs) | implementation |
148,940 | 328,125 | 62,500 | Cosmopolitan pthread_mutex_t |
410,416 | 5,515,625 | 1,640,625 | Microsoft SRWLOCK |
949,187 | 7,937,500 | 5,078,125 | Microsoft CRITICAL_SECTION |
991,750 | 12,156,250 | 4,031,250 | MSVC 2022 std::mutex |
1,165,435 | 24,515,000 | 15,000 | spin lock |
9,780,803 | 1,937,000 | 6,156,000 | Cygwin pthread_mutex_t |
As we can see, Cosmopolitan Libc mutexes go 2.75x faster than Microsoft's SRWLOCK (which was previously believed to be the best of the best) while consuming 18x fewer CPU resources. Cosmopolitan mutexes were also 65x faster than Cygwin, which like Cosmopolitan provides a POSIX implementation on Windows. Cygwin's mutexes are so bad that they would have been better off for this use case just using a spin lock.
Now onto Linux, the lord of all operating systems.
Linux Mutex Implementations | |||
---|---|---|---|
wall time (µs) | user time (µs) | system time (µs) | implementation |
36,905 | 44,511 | 23,492 | Cosmopolitan pthread_mutex_t |
101,353 | 150,706 | 2,724,851 | glibc pthread_mutex_t |
202,423 | 4,694,749 | 2,000 | spin lock |
411,013 | 2,167,898 | 9,926,850 | Musl libc pthread_mutex_t |
Here we see Cosmopolitan mutexes are:
Here's how things actually work in practice. Imagine you have a workload where all your threads need to do a serialized operation. With Cosmo, if you're looking at htop, then it's going to appear like only one core is active, whereas glibc and musl libc will fill up your entire CPU meter. That's bad news if you're running a lot of jobs on the same server. If just one of your servers has a mutex bloodbath, then all your resources are gone, unless you're using cosmo. It's still a new C library and it's a little rough around the edges. But it's getting so good, so fast, that I'm starting to view not using it in production as an abandonment of professional responsibility. The C library is so deeply embedded in the software supply chain, and so depended upon, that you really don't want it to be a planet killer. If essential unquestioned tools are this wasteful then it's no wonder Amazon Cloud makes such a fortune.
Last but not least, we have MacOS.
MacOS Mutex Implementations | |||
---|---|---|---|
wall time (µs) | user time (µs) | system time (µs) | implementation |
52,263 | 43,202 | 911,009 | Apple Libc |
54,700 | 63,055 | 1,003,674 | Cosmopolitan pthread_mutex_t |
On MacOS with an M2 ARM64 microprocessor, Apple's Libc slightly outperforms Cosmopolitan's mutexes. For reasons I do not yet fully understand, Cosmopolitan's normal mutex implementation doesn't work well on this platform. It's possibly because the M2 and XNU are in league. So on MacOS ARM, Cosmopolitan uses a simpler algorithm based on Ulrich Drepper's "Futexes Are Tricky" paper that basically just farms out all the heavy lifting to XNU's ulock system calls. That's why performance is nearly identical to what Apple does.
So in summary, these benchmark results indicate that Cosmopolitan mutexes, in the best case, will be overwhelmingly better than alternatives at the contended + tiny critical section use case, and in the worst case, Cosmopolitan will be roughly as good.
The reason why Cosmopolitan Mutexes are so good is because I used a library called nsync. It only has 371 stars on GitHub, but it was written by a distinguished engineer at Google named Mike Burrows. If you don't know who he is, he's the guy who coded Google's fiercest competitor, which was Altavista. If you're not old enough to remember Altavista, it was the first search engine that was good, and it ran on a single computer.
I've had a lot of fun integrating nsync into Cosmopolitan. I've even had the opportunity to make upstream contributions. For example, I found and fixed a bug in his mutex unlock function that had gone undiscovered for years. I also managed to make contended nsync mutexes go 30% faster than nsync upstream on AARCH64, by porting it to use C11 atomics. I wrote new system integration for things like futexes that enable it do portability at runtime. Lastly I made it work seamlessly with POSIX thread cancelations.
So how does nsync do it? What are the tricks that it uses? Here's some of my takes and analysis:
WaitOnAddress()
. The only OS Cosmo
supports that doesn't have futexes is NetBSD, which implements POSIX
semaphores in kernelspace, and each semaphore sadly requires creating a
new file descriptor. But the important thing about futexes and
semaphores is they allow the OS to put a thread to sleep. That's what
lets nsync avoid consuming CPU time when there's nothing to do.
To learn more of nsync's secrets, you can read the source code here:
cosmopolitan/third_party/nsync/mu.c
.
See also
cosmopolitan/libc/intrin/pthread_mutex_lock.c
.
If you want to see a live demo of a piece of software built with Cosmo mutexes, then do your worst DDOS to the http://ipv4.games/ web server. Now this is truly a game for hackers, competing to dominate the Internet. You're already playing this game because your IP was just claimed for jart. The service runs on a GCE VM with 2 cores and so far it's managed to survive being DDOS'd by botnets as large as 49,131,669 IPs. A lot of that is thanks to nsync which allowed me to move SQL queries to background threads which send messages to each other. There are still improvements to be made, but overall it's held up well. You can even monitor its /statusz health metrics.
#define ITERATIONS 100000 #define THREADS 30 int g_chores; pthread_mutex_t g_locker = PTHREAD_MUTEX_INITIALIZER; void *worker(void *arg) { for (int i = 0; i < ITERATIONS; ++i) { pthread_mutex_lock(&g_locker); ++g_chores; pthread_mutex_unlock(&g_locker); } return 0; } struct timeval tub(struct timeval a, struct timeval b) { a.tv_sec -= b.tv_sec; if (a.tv_usec < b.tv_usec) { a.tv_usec += 1000000; a.tv_sec--; } a.tv_usec -= b.tv_usec; return a; } long tomicros(struct timeval x) { return x.tv_sec * 1000000ul + x.tv_usec; } int main() { struct timeval start; gettimeofday(&start, 0); pthread_t th[THREADS]; for (int i = 0; i < THREADS; ++i) pthread_create(&th[i], 0, worker, 0); for (int i = 0; i < THREADS; ++i) pthread_join(th[i], 0); assert(g_chores == THREADS * ITERATIONS); struct rusage ru; struct timeval end; gettimeofday(&end, 0); getrusage(RUSAGE_SELF, &ru); printf("%16ld us real\n" "%16ld us user\n" "%16ld us sys\n", tomicros(tub(end, start)), tomicros(ru.ru_utime), tomicros(ru.ru_stime)); }
The reason we care about the contended case, because that's where mutex implementations show their inequality. Uncontended mutexes usually perform the same across implementations, and even then you might be better off with a spin lock, which only takes a few lines:
void spin_lock(atomic_int *lock) { if (atomic_exchange_explicit(lock, 1, memory_order_acquire)) { for (;;) { for (;;) if (!atomic_load_explicit(lock, memory_order_relaxed)) break; if (!atomic_exchange_explicit(lock, 1, memory_order_acquire)) break; } } } void spin_unlock(atomic_int *lock) { atomic_store_explicit(lock, 0, memory_order_release); }
Please note that spin locks should really only be used when you have no
other choice. They're useful in kernels, where extreme low level
constraints disallow anything fancy. Spin locks are also a useful
implementation detail in nsync locks. But overall they're bad. I imagine
many developers believe they're good. If so, it's probably because
they've only benchmarked wall time. With locks, it's important to take
CPU time into consideration too. That's why we use
getrusage()
.
Funding for The Fastest Mutex was crowdsourced from Justine Tunney's GitHub sponsors and Patreon subscribers, the backing of Mozilla's MIECO program, and the generous contributions of our developer community on Discord. Your support is what makes projects like Cosmopolitan possible. Thank you!