May 19th, 2022 @ justine's web page

Logging C Functions

One of my favorite features of the Cosmopolitan Libc runtime is its --ftrace flag that logs C function calls. It's the simplest system for debugging programs I've ever used and it surprises me that I found no evidence of someone having invented it before.

Here's one of its most important use cases. Have you ever had you debugger stupified by a NULL function pointer before? Debuggers aren't very good at handling this situation. Because once the control flow jumps to the NULL page, you'd need to find a way to rewind execution history to figure out how it got there. But if you have function call logging, you can just look backwards in the log, see what was happening before, and get a pretty good idea.

For example, here's what happens if I run a Hello World! program:

$ o//examples/hello.com --ftrace
#     tid   nanoseconds   stack symbol
FUN 18900        86'755     160   &__cxa_atexit
FUN 18900        92'737      80 &atexit
FUN 18900        93'525     128   &__cxa_atexit
FUN 18900        94'418      48 &main
FUN 18900        95'034      96   &puts
FUN 18900        95'638     192     &fwrite_unlocked
FUN 18900        96'279     208       &memcpy
FUN 18900        97'002     208       &memrchr
FUN 18900        97'712     128     &fputc_unlocked
FUN 18900        98'311     224       &fwrite_unlocked
FUN 18900        98'932     240         &memcpy
FUN 18900        99'494     240         &memrchr
FUN 18900       100'096     320         &write
hello world
FUN 18900       101'117      64 &exit
FUN 18900       101'677     112   &__cxa_finalize
FUN 18900       102'399     144     &FreeFds
FUN 18900       103'064     144     &fflush_unlocked
FUN 18900       103'657     176       &fflush
FUN 18900       104'302     208         &fflush_unlocked
FUN 18900       104'900     240           &__fflush_impl
FUN 18900       105'508     272             &free_s
FUN 18900       106'100     288               &_isheap

In the above log, you'll see:

  1. the process pid (or tid if using threads)
  2. the number of nanoseconds elapsed since the program started (you'll notice it takes 86µs to load symbol table from the .com.dbg file, sort it so it can be binary searched, and then rewrite the binary in memory)
  3. the current stack usage in bytes
  4. the symbol name with depth (by backcounting rbp frames)

The nanosecond timestamps are particularly useful for pinpointing latency issues in programs. One of the biggest blindspots of open source tools has always been the lack of focus on latency. For example, the wonderful perf record command on Linux won't give you this kind of insight:

FUN 10808        91'065      80 &atexit
FUN 10808        91'242     144   &__cxa_atexit
FUN 10808        92'542      48 &main
FUN 10808        92'638      64   &creat
FUN 10808        92'771     240     &openat
FUN 10808        92'876     320       &sys_openat
FUN 10808       109'805     144   &write
FUN 10808       183'356      80   &fsync
FUN 10808   194'399'931      96   &close
FUN 10808   194'410'030     112     &__releasefd_unlocked
FUN 10808   194'410'962      64 &exit
FUN 10808   194'411'708     112   &__cxa_finalize
FUN 10808   194'412'713     160     &FreeFds

One of the things we've been focusing on lately has been making the Python unit tests go faster. Since we use Python's unit tests to test Cosmopolitan Libc. One of the biggest culprits has always been fsync() at least for those of us who don't use tmpfs. It would frequently cause individual tests to lag for as long as thirty seconds. Every other tool we used for measuring the performance of these tests, all led us astray suggesting that we focus on the core Python evaluator loop. But all that was really needed was adding a way to pacify pointless fsync() calls.

The way these logs work, is they apply to functions at the object code level. For example, if a function gets inlined, then it won't show up in the function call trace. Another thing the log does is fold repeated lines. For example, if you call a pure function in a loop (pure in the sense that it doesn't call any additional subroutines) then you'll only see one line for it in the log with the cumulative timestamp. For example, if you were to say:

dontinline void Pause(void) {
  asm("rep nop");
}

dontinline void DoStuff(void) {
  for (int i = 0; i < 10000; ++i) {
    Pause();
  }
}

int main() {
  DoStuff();
  DoStuff();
  DoStuff();
  DoStuff();
}

Then you would get:

FUN 11650       103'932      80 &atexit
FUN 11650       104'082     144   &__cxa_atexit
FUN 11650       104'402      48 &main
FUN 11650       104'493      72   &DoStuff
FUN 11650       104'571      88     &Pause
FUN 11650     1'084'906      72   &DoStuff
FUN 11650     1'085'073      88     &Pause
FUN 11650     2'054'060      72   &DoStuff
FUN 11650     2'054'174      88     &Pause
FUN 11650     3'154'500      72   &DoStuff
FUN 11650     3'154'599      88     &Pause
FUN 11650     4'172'376      64 &exit
FUN 11650     4'172'511     112   &__cxa_finalize
FUN 11650     4'172'651     160     &FreeFds

In terms of how the flag is interpreted, sometimes it's useful to be able to have --ftrace be in the arguments list, possibly because you're passing it along to a subprocess you're executing. In that case, if you add two additional dashes:

./foo.com ----ftrace

Then "--ftrace" will appear in the argv array passed to main. Additional slashes can be supplied too, for additional layers of nesting.

Getting Started

Run the following commands on Linux:

curl https://justine.lol/cosmopolitan/cosmopolitan.zip >cosmopolitan.zip
unzip cosmopolitan.zip
echo 'int main() { printf("hello world\n"); }' >hello.c
gcc -g -Os -static -nostdlib -nostdinc -no-pie -fno-pie  \
    -fno-omit-frame-pointer -pg -mnop-mcount             \
    -o hello.com.dbg hello.c -fuse-ld=bfd -Wl,-T,ape.lds \
    -include cosmopolitan.h crt.o ape.o cosmopolitan.a
objcopy -S -O binary hello.com.dbg hello.com
./hello.com --ftrace

How It Works

The Cosmopolitan Libc _start() function starts by intercepting the --ftrace flag. If it exists, then it opens and sorts of the symbol table from the elf binary. Then it changes the protection of memory so it's able to iterate over the program's memory to look for nop instructions it can mutate. Those NOPs were inserted by GCC. It's easy to self-modify them in memory, since they have the same byte length as the CALL instruction. Think of it like a mini linker. It just relinks the profiling nops. Once they've been rewritten, functions will start calling ftrace_hook() which is an assembly function that saves the CPU state to the stack. That means ftrace kind of acts like an operating system kernel. Once the assembly saved the CPU it can call the ftracer() C code that acquires a reentrant mutex and unwinds the RBP backtrace pointer (via __builtin_frame_address(0)) to determine the address of the function that called it. Once it has the address of the function, it passes it along to kprintf() which has a special %t syntax for turning numbers into symbols.

GCC has several flags for inserting profiling hooks at the beginning of functions. Namely -pg, -mfentry, and -mnop-mcount. The last one is important. Because the default behavior is to insert a function call to the mcount function.

Pause:	push	%rbp
	mov	%rsp,%rbp
1:	call	mcount
	rep nop
	pop	%rbp
	ret
	.endfn	Pause,globl

Cosmopolitan Libc's mcount function does nothing, and the cost of calling an empty function can be as high as 14 cycles of overhead per function call. However if you can get GCC to insert NOPs instead:

Pause:	push	%rbp
	mov	%rsp,%rbp
1:	nopl	0x00(%rax,%rax,1)
	rep nop
	pop	%rbp
	ret
	.endfn	Pause,globl

Then the overhead is negligible on the order of picoseconds. So it's worth considering just always having the -pg nops in your program. Because Cosmopolitan Libc will self-modify your executable at runtime to turn those NOPs into function calls once you've enabled it. In fact, there's even a runtime function that lets you insert an arbitrary hook into your program's functions:

privileged noinstrument void MyTracer(void) {
  kprintf("hello\n");
}
int main() {
  __hook(MyTracer, GetSymbolTable());
}

Since there's multiple profiling hooks that GCC uses depending on the various combinations of flags, here's the hooks that __hook will understand and be able to rewrite:

0f 1f 44 00 00          nopl 0x0(%rax,%rax,1)
67 0f 1f 04 00          nopl (%eax,%eax,1)
66 0F 1F 44 00 00       nopw 0x00(%rax,%rax,1)
67 e8 xx xx xx xx       addr32 call mcount

The reason why we need the privileged keyword above is because some operating systems don't permit read+write+exec memory. So when we rewrite the binary to change the NOPs into CALL, much of the program's code needs to be disabled. Using the privileged keyword will cause your function to be relocated to a special part of the executable that never gets rewritten.

Custom Code Morphing

Let's say you wanted to write your own function that rewrites your binary at runtime. Intel's disassembler Xed comes included in the Cosmopolitan Libc amalgamation distro. So here's how you would, for instance, iterate over all the instructions in your program:

privileged void RewriteBinary(void) {
  int n, err;
  unsigned char *p;
  struct XedDecodedInst xedd;
  __morph_begin();
  for (p = _ereal; p < __privileged_start; p += n) {
    xed_decoded_inst_zero_set_mode(&xedd, XED_MACHINE_MODE_LONG_64);
    err = xed_instruction_length_decode(&xedd, p, __privileged_start - p);
    if (err == XED_ERROR_NONE) {
      n = xedd.length;
    } else {
      n = 1;
    }
  }
  __morph_end();
}

Measuring Language Complexity

The nice thing about Cosmopolitan's function tracing is that it's shell scriptable and fast. Having this debugging information easily plugged into the shell, means we can do things like study how many function calls programming language interpreters need to print Hello World.

$ o//tool/plinko/plinko.com <<<'(QUOTE HELLO-WORLD)' --ftrace |& wc -l
819
$ o//third_party/sqlite3/sqlite3.com :memory: "select 'hello world'" --ftrace |& wc -l
6253
$ o//third_party/lua/lua.com -e 'print("hello")' --ftrace |& wc -l
9741
$ o//third_party/quickjs/qjs.com -e 'console.log("hello world")' --ftrace |& wc -l
42272
$ o//third_party/python/python.com -c 'print("hello world")' --ftrace |& wc -l
931354

Here's the function logs if you're interested in reading them.

-rw-r--r-- 1 501 jart  43K May 19 19:29 plinko.txt
-rw-r--r-- 1 501 jart 397K May 19 19:28 sqlite.txt
-rw-r--r-- 1 501 jart 751K May 19 19:27 lua.txt
-rw-r--r-- 1 501 jart 2.7M May 19 19:27 qjs.txt
-rw-r--r-- 1 501 jart 150M May 19 19:28 python.txt

We're also integrating ftrace into all the programming languages that are included in Cosmopolitan. For example, the cosmo.ftrace context manager has been quite useful when testing CPython internals, or writing extensions.

import cosmo
with cosmo.ftrace() as F:
    WeirdFunction()

This makes ftrace viable for Python since it's only enabled temporarily. Otherwise there'd simply be too many functions logged. If you want to try using ftrace with Actually Portable Python, then you can build it on Linux with the following commands:

git clone https://github.com/jart/cosmopolitan
cd cosmopolitan
make -j8 o//third_party/python/python.com
o//third_party/python/python.com
>>: import cosmo
>>: cosmo?

There's also a prebuilt binary here:

-rwxr-xr-x 1 501 jart 11M May 20 06:25 python.com

System Call Logging

While we're on the topic of tracing, it's also worth mentioning that Cosmopolitan Libc bakes a similar --strace flag into binaries. For example, if we were to trace a program that calls fsync():

int main() {
  creat("foo", 0644);
  write(3, _base, _end - _base);
  fsync(3);
  close(3);
}

It might show us the following:

SYS     0              5'590 bell system five system call support 97 magnums loaded on gnu/systemd
SYS   384              8'664 mmap(0x700000000000, 131'072, PROT_READ|PROT_WRITE, MAP_STACK|MAP_ANONYMOUS, -1, 0) → 0x700000000000
SYS   384             12'702 getenv("TERM") → "dumb"
SYS   384          1'926'421 openat(AT_FDCWD, "o/foo", O_WRONLY|O_CREAT|O_TRUNC, 0644) → 3
SYS   384          1'973'434 write(3, u"⌂ELF☻☺☺○        ☻ > ☺   ♂!@      ♂      "..., 131'072) → 131'072
SYS   384         29'834'148 fsync(3) → 0
SYS   384         29'839'878 close(3) → 0
SYS   384         29'840'030 exit(0)
SYS   384         29'840'324 __cxa_finalize(&407eef, 0)
SYS   384         29'840'445 FreeFds()
SYS   384         29'840'729 _Exit(0)

It's been an indispensable tool for building Cosmopolitan Libc, since GDB and LLDB aren't very reliable when developing across platforms. For instance, I haven't been able to figure out how to get a debugger or system call tracer to work on Mac OS X for the better part of ten years. So much of the XNU support needed to be written a priori until this tool came along, which has greatly helped in reducing bugs!

Troubleshooting

In the above example, you may have noticed that a number &407eef was printed in place of a symbol. This is one of the tradeoffs that needed to be made by Cosmopolitan Libc's kprintf() function, which is too mission critical to be able to call into all the non-privileged code needed to open() + mmap() + etc. the concomitant ELF .com.dbg binary and load the symbol table. The simple solution for this is to have your main function call GetSymbolTable():

int main() {
  GetSymbolTable();
  creat("foo", 0644);
  write(3, _base, _end - _base);
  fsync(3);
  close(3);
}

Now the symbols should show up:

SYS   401         25'825'348 __cxa_finalize(&FreeFds, 0)

Funding

[United States of Lemuria - two dollar bill - all debts public and primate]

Funding for this blog post was crowdsourced from Justine Tunney's GitHub sponsors and Patreon subscribers. Your support is what makes projects like Cosmopolitan Libc possible. Thank you.

See Also