May 19th, 2022 @ justine's web page
One of my favorite features of the
Cosmopolitan Libc runtime is its
--ftrace
flag that logs C function calls. It's the simplest
system for debugging programs I've ever used and it surprises me that I
found no evidence of someone having invented it before.
Here's one of its most important use cases. Have you ever had you debugger stupified by a NULL function pointer before? Debuggers aren't very good at handling this situation. Because once the control flow jumps to the NULL page, you'd need to find a way to rewind execution history to figure out how it got there. But if you have function call logging, you can just look backwards in the log, see what was happening before, and get a pretty good idea.
For example, here's what happens if I run a Hello World! program:
$ o//examples/hello.com --ftrace
# tid nanoseconds stack symbol
FUN 18900 86'755 160 &__cxa_atexit
FUN 18900 92'737 80 &atexit
FUN 18900 93'525 128 &__cxa_atexit
FUN 18900 94'418 48 &main
FUN 18900 95'034 96 &puts
FUN 18900 95'638 192 &fwrite_unlocked
FUN 18900 96'279 208 &memcpy
FUN 18900 97'002 208 &memrchr
FUN 18900 97'712 128 &fputc_unlocked
FUN 18900 98'311 224 &fwrite_unlocked
FUN 18900 98'932 240 &memcpy
FUN 18900 99'494 240 &memrchr
FUN 18900 100'096 320 &write
hello world
FUN 18900 101'117 64 &exit
FUN 18900 101'677 112 &__cxa_finalize
FUN 18900 102'399 144 &FreeFds
FUN 18900 103'064 144 &fflush_unlocked
FUN 18900 103'657 176 &fflush
FUN 18900 104'302 208 &fflush_unlocked
FUN 18900 104'900 240 &__fflush_impl
FUN 18900 105'508 272 &free_s
FUN 18900 106'100 288 &_isheap
In the above log, you'll see:
The nanosecond timestamps are particularly useful for pinpointing
latency issues in programs. One of the biggest blindspots of open source
tools has always been the lack of focus on latency. For example, the
wonderful perf record
command on Linux won't give you this
kind of insight:
FUN 10808 91'065 80 &atexit FUN 10808 91'242 144 &__cxa_atexit FUN 10808 92'542 48 &main FUN 10808 92'638 64 &creat FUN 10808 92'771 240 &openat FUN 10808 92'876 320 &sys_openat FUN 10808 109'805 144 &write FUN 10808 183'356 80 &fsync FUN 10808 194'399'931 96 &close FUN 10808 194'410'030 112 &__releasefd_unlocked FUN 10808 194'410'962 64 &exit FUN 10808 194'411'708 112 &__cxa_finalize FUN 10808 194'412'713 160 &FreeFds
One of the things we've been focusing on lately has been making the Python unit tests go faster. Since we use Python's unit tests to test Cosmopolitan Libc. One of the biggest culprits has always been fsync() at least for those of us who don't use tmpfs. It would frequently cause individual tests to lag for as long as thirty seconds. Every other tool we used for measuring the performance of these tests, all led us astray suggesting that we focus on the core Python evaluator loop. But all that was really needed was adding a way to pacify pointless fsync() calls.
The way these logs work, is they apply to functions at the object code level. For example, if a function gets inlined, then it won't show up in the function call trace. Another thing the log does is fold repeated lines. For example, if you call a pure function in a loop (pure in the sense that it doesn't call any additional subroutines) then you'll only see one line for it in the log with the cumulative timestamp. For example, if you were to say:
dontinline void Pause(void) { asm("rep nop"); } dontinline void DoStuff(void) { for (int i = 0; i < 10000; ++i) { Pause(); } } int main() { DoStuff(); DoStuff(); DoStuff(); DoStuff(); }
Then you would get:
FUN 11650 103'932 80 &atexit FUN 11650 104'082 144 &__cxa_atexit FUN 11650 104'402 48 &main FUN 11650 104'493 72 &DoStuff FUN 11650 104'571 88 &Pause FUN 11650 1'084'906 72 &DoStuff FUN 11650 1'085'073 88 &Pause FUN 11650 2'054'060 72 &DoStuff FUN 11650 2'054'174 88 &Pause FUN 11650 3'154'500 72 &DoStuff FUN 11650 3'154'599 88 &Pause FUN 11650 4'172'376 64 &exit FUN 11650 4'172'511 112 &__cxa_finalize FUN 11650 4'172'651 160 &FreeFds
In terms of how the flag is interpreted, sometimes it's useful to be
able to have --ftrace
be in the arguments list, possibly
because you're passing it along to a subprocess you're executing. In
that case, if you add two additional dashes:
./foo.com ----ftrace
Then "--ftrace"
will appear in the argv
array
passed to main. Additional slashes can be supplied too, for additional
layers of nesting.
Run the following commands on Linux:
curl https://justine.lol/cosmopolitan/cosmopolitan.zip >cosmopolitan.zip unzip cosmopolitan.zip echo 'int main() { printf("hello world\n"); }' >hello.c gcc -g -Os -static -nostdlib -nostdinc -no-pie -fno-pie \ -fno-omit-frame-pointer -pg -mnop-mcount \ -o hello.com.dbg hello.c -fuse-ld=bfd -Wl,-T,ape.lds \ -include cosmopolitan.h crt.o ape.o cosmopolitan.a objcopy -S -O binary hello.com.dbg hello.com ./hello.com --ftrace
The Cosmopolitan Libc _start()
function starts by
intercepting
the --ftrace
flag. If it exists, then
it opens
and sorts of the symbol table from the elf binary. Then
it changes
the protection of memory so it's able to iterate over the program's
memory to look for nop instructions it can mutate.
Those NOPs were inserted by GCC. It's easy to self-modify them in
memory, since they have the same byte length as the CALL instruction.
Think of it like a mini linker. It just relinks the profiling nops. Once
they've been rewritten, functions will start calling
ftrace_hook()
which is an assembly function that saves the CPU state to the stack. That means ftrace kind of acts
like an operating system kernel. Once the assembly saved the CPU it can
call
the ftracer() C code that acquires a reentrant mutex and unwinds the
RBP backtrace pointer (via __builtin_frame_address(0)
) to
determine the address of the function that called it. Once it has the
address of the function, it passes it along to kprintf() which has
a special %t
syntax for turning numbers into symbols.
GCC has several flags for inserting profiling hooks at the beginning of
functions. Namely -pg
, -mfentry
,
and -mnop-mcount
. The last one is important. Because the
default behavior is to insert a function call to the mcount
function.
Pause: push %rbp mov %rsp,%rbp 1: call mcount rep nop pop %rbp ret .endfn Pause,globl
Cosmopolitan Libc's mcount function does nothing, and the cost of calling an empty function can be as high as 14 cycles of overhead per function call. However if you can get GCC to insert NOPs instead:
Pause: push %rbp mov %rsp,%rbp 1: nopl 0x00(%rax,%rax,1) rep nop pop %rbp ret .endfn Pause,globl
Then the overhead is negligible on the order of picoseconds. So it's
worth considering just always having the -pg
nops in your
program. Because Cosmopolitan Libc will self-modify your executable at
runtime to turn those NOPs into function calls once you've enabled it.
In fact, there's even a runtime function that lets you insert an
arbitrary hook into your program's functions:
privileged noinstrument void MyTracer(void) { kprintf("hello\n"); } int main() { __hook(MyTracer, GetSymbolTable()); }
Since there's multiple profiling hooks that GCC uses depending on the
various combinations of flags, here's the hooks that __hook
will understand and be able to rewrite:
0f 1f 44 00 00 nopl 0x0(%rax,%rax,1) 67 0f 1f 04 00 nopl (%eax,%eax,1) 66 0F 1F 44 00 00 nopw 0x00(%rax,%rax,1) 67 e8 xx xx xx xx addr32 call mcount
The reason why we need the privileged
keyword above is
because some operating systems don't permit read+write+exec memory. So
when we rewrite the binary to change the NOPs into CALL, much of the
program's code needs to be disabled. Using the privileged
keyword will cause your function to be relocated to a special part of
the executable that never gets rewritten.
Let's say you wanted to write your own function that rewrites your binary at runtime. Intel's disassembler Xed comes included in the Cosmopolitan Libc amalgamation distro. So here's how you would, for instance, iterate over all the instructions in your program:
privileged void RewriteBinary(void) { int n, err; unsigned char *p; struct XedDecodedInst xedd; __morph_begin(); for (p = _ereal; p < __privileged_start; p += n) { xed_decoded_inst_zero_set_mode(&xedd, XED_MACHINE_MODE_LONG_64); err = xed_instruction_length_decode(&xedd, p, __privileged_start - p); if (err == XED_ERROR_NONE) { n = xedd.length; } else { n = 1; } } __morph_end(); }
The nice thing about Cosmopolitan's function tracing is that it's shell scriptable and fast. Having this debugging information easily plugged into the shell, means we can do things like study how many function calls programming language interpreters need to print Hello World.
$ o//tool/plinko/plinko.com <<<'(QUOTE HELLO-WORLD)' --ftrace |& wc -l 819 $ o//third_party/sqlite3/sqlite3.com :memory: "select 'hello world'" --ftrace |& wc -l 6253 $ o//third_party/lua/lua.com -e 'print("hello")' --ftrace |& wc -l 9741 $ o//third_party/quickjs/qjs.com -e 'console.log("hello world")' --ftrace |& wc -l 42272 $ o//third_party/python/python.com -c 'print("hello world")' --ftrace |& wc -l 931354
Here's the function logs if you're interested in reading them.
-rw-r--r-- 1 501 jart 43K May 19 19:29 plinko.txt -rw-r--r-- 1 501 jart 397K May 19 19:28 sqlite.txt -rw-r--r-- 1 501 jart 751K May 19 19:27 lua.txt -rw-r--r-- 1 501 jart 2.7M May 19 19:27 qjs.txt -rw-r--r-- 1 501 jart 150M May 19 19:28 python.txt
We're also integrating ftrace into all the programming languages that
are included in Cosmopolitan. For example, the cosmo.ftrace
context manager has been quite useful when testing CPython internals, or
writing extensions.
import cosmo with cosmo.ftrace() as F: WeirdFunction()
This makes ftrace viable for Python since it's only enabled temporarily. Otherwise there'd simply be too many functions logged. If you want to try using ftrace with Actually Portable Python, then you can build it on Linux with the following commands:
git clone https://github.com/jart/cosmopolitan cd cosmopolitan make -j8 o//third_party/python/python.com o//third_party/python/python.com >>: import cosmo >>: cosmo?
There's also a prebuilt binary here:
-rwxr-xr-x 1 501 jart 11M May 20 06:25 python.com
While we're on the topic of tracing, it's also worth mentioning that
Cosmopolitan Libc bakes a similar --strace
flag into
binaries. For example, if we were to trace a program that calls fsync():
int main() { creat("foo", 0644); write(3, _base, _end - _base); fsync(3); close(3); }
It might show us the following:
SYS 0 5'590 bell system five system call support 97 magnums loaded on gnu/systemd SYS 384 8'664 mmap(0x700000000000, 131'072, PROT_READ|PROT_WRITE, MAP_STACK|MAP_ANONYMOUS, -1, 0) → 0x700000000000 SYS 384 12'702 getenv("TERM") → "dumb" SYS 384 1'926'421 openat(AT_FDCWD, "o/foo", O_WRONLY|O_CREAT|O_TRUNC, 0644) → 3 SYS 384 1'973'434 write(3, u"⌂ELF☻☺☺○ ☻ > ☺ ♂!@ ♂ "..., 131'072) → 131'072 SYS 384 29'834'148 fsync(3) → 0 SYS 384 29'839'878 close(3) → 0 SYS 384 29'840'030 exit(0) SYS 384 29'840'324 __cxa_finalize(&407eef, 0) SYS 384 29'840'445 FreeFds() SYS 384 29'840'729 _Exit(0)
It's been an indispensable tool for building Cosmopolitan Libc, since GDB and LLDB aren't very reliable when developing across platforms. For instance, I haven't been able to figure out how to get a debugger or system call tracer to work on Mac OS X for the better part of ten years. So much of the XNU support needed to be written a priori until this tool came along, which has greatly helped in reducing bugs!
In the above example, you may have noticed that a number
&407eef
was printed in place of a symbol. This is one
of the tradeoffs that needed to be made by Cosmopolitan Libc's kprintf()
function, which is too mission critical to be able to call into all the
non-privileged code needed to open() + mmap() + etc. the concomitant ELF
.com.dbg binary and load the symbol table. The simple solution for this
is to have your main function call GetSymbolTable()
:
int main() { GetSymbolTable(); creat("foo", 0644); write(3, _base, _end - _base); fsync(3); close(3); }
Now the symbols should show up:
SYS 401 25'825'348 __cxa_finalize(&FreeFds, 0)
Funding for this blog post was crowdsourced from Justine Tunney's GitHub sponsors and Patreon subscribers. Your support is what makes projects like Cosmopolitan Libc possible. Thank you.