Jens Goldberg
Jens Goldberg

Posted on • Updated on

Analysis of the overhead of a minimal Zig program

If you wanted to make a minimal x86-64 Linux program that did nothing, how would you write it? You'd probably whip out an assembler and type something like this:

mov    eax, 60 ; sys_exit
xor    edi, edi
Enter fullscreen mode Exit fullscreen mode

Letting LLD link it for us nets us a binary that's 600 bytes large. Aggressively stripping out all the unnecessary trash that the linker puts into it makes it 297 bytes — but we're not interested in linker overhead right now, so let's use 600 as a baseline.

If we write a minimal Zig program that does the same thing, will it be just as small? Probably not. Let's go through every assembly instruction of the Zig binary and see what's up!

First, let's write that program:

pub fn main() void {}
Enter fullscreen mode Exit fullscreen mode

Building it with -O ReleaseSmall --strip -fsingle-threaded results in a 5.4KiB binary. The very first thing we realize is that all the debug symbols aren't stripped, because the Zig strip flag isn't completely functional yet and is waiting for the stage 2 compiler. No matter, we just do it manually (with strip -s), shrinking it to 1.7KiB.

What does all that code do? When we objdump it and take a look, we find 208 lines of assembly consuming 715 bytes. In addition, it uses 128 bytes for read-only data and 12624 bytes of .bss zero-initialized static data, only taking up space in a running program and not in the binary itself.

Let's go through each line of assembly to see what's going on. First, we have this:

xor    rbp,rbp
Enter fullscreen mode Exit fullscreen mode

I.e. rbp is cleared. If we take a look in std/start.zig we can see that this is from inline assembly that zig runs immediately on _start(). Why? Presumably because the x86-64 ABI mandates it:

The content of this register is unspecified at process initialization time, but the user code should mark the deepest stack frame by setting the frame pointer to zero

I'll allow it. ABI compliance is a very good reason for "wasting" 3 bytes of code, and should arguably be added to our original assembly program. Now, let's check the next line:

mov    QWORD PTR [rip+0x1e1e],rsp
Enter fullscreen mode Exit fullscreen mode

What's this for? Turns out Zig always saves the initial value of rsp, since it starts out pointing to the auxiliary vector, which you need to parse the program arguments. We're not looking at that though, so this is at first glance a completely unnecessary waste of 7 bytes.

Next up:

2011e2: call   0x2011e7
2011e7: push   rbp
2011e8: [...]
2011f4: and    rsp,0xfffffffffffffff0
Enter fullscreen mode Exit fullscreen mode

So, we're instantly calling a function located directly on the next byte. Looking around the code, we find that this is the only place it's called from. Why? From reading start.zig we find the answer:

If LLVM inlines stack variables into _start, they will overwrite the command line argument data.

So, the reason it's not inlined is because it's called with never_inline, because otherwise LLVM can put things that messes up rsp before the inline assembly that stashed rsp away. Makes sense, except it'd be nicer if there was a non-hacky way of solving it. In any case we don't need rsp so ideally we shouldn't have to pay for this anyway.

What's up with the and rsp,0xfffffffffffffff0? That's because the function manually aligns the stack to the next 16-byte boundary. I'm not sure why the stdlib does this. The SystemV ABI (§2.3.1) guarantees an initial alignment of 16 already, both for x86-64 and i386, so it should be superfluous. From looking around a little, musl does the same alignment, as does glibc, but not dietlibc.

Next up, the code is parsing the auxiliary vector. Not only is this needed for argv, but it also contains the program header which the program uses for PIE relocations (if applicable, which it isn't for us). It also contains the stack size, which if not set to the default of 8MiB Zig asks the kernel to resize (it's not done automatically). This seems superfluous; if we compiled the program ourselves and used our own linker we should be able to hardcode the stack size resize at compile-time if necessary, not store it in some roundabout program header. Since Zig is working on automatically calculating the maximum stack size required as well, this information could be directly available to the compiler in the future and used here.

Lastly, the data is also needed to initialize the static TLS memory. This is for static threadlocal variables that should have an unique copy for each thread, like errno. "But we are using -fsingle-threaded," you may ask, "Why shouldn't the compiler turn all the thread-local variables to normal static ones and strip out the TLS section?". The reason is that you could export a threadlocal symbol to another program that's actually threaded, so we can't just remove them willy-nilly.

Moreover, since the TLS initialization calls mmap if the size is large enough, it can fail, which calls abort(). abort() in turn calls raise(SIG.ABRT), and raise in turn masks out all the signals with sigprocmask. It's this call that uses the 128 bytes of readonly data we saw previously. It's fairly large as it needs to contains the entire set of possible signals.

The TLS initialization is also the explanation for much of the wasted .bss data as well; it uses an 8448 byte static buffer when the TLS data is small enough to fit it.

Tangentially we can see that avoiding TLS when it's not needed is an open issue: #2432, so it's something that's in the pipeline to be handled.

In any case, since we don't use TLS, PIE, argv, nor env variables, all of this is just a waste of space. Let's try commenting all of that out; in start.zig we remove everything that depends on argc, then everything that depends on those lines and so on. After that's done we're more or less back at our initial ideal program size, just with the minor cruft I mentioned at the start:

xor    rbp,rbp
mov    QWORD PTR [rip+0x1016],rsp
call   0x201167
push   rbp ; @ 0x201167
mov    rbp,rsp
and    rsp,0xfffffffffffffff0
push   0x3c
pop    rax
xor    edi,edi
Enter fullscreen mode Exit fullscreen mode

Now, what was the point of all this? I think there are several benefits to minimizing overhead for simple programs:

  • Having minimal overhead for tiny programs is actually relevant for system performance. Many scripts, for example, work by chaining together common Unix programs, so you're potentially having the same startup code running tens of thousands of times in a short duration. This can get fairly significant! Right now Linux ameliorates the performance hit from this by either writing built-in copies of the most common tools directly into the shell (like Bash does), or having a single fat binary that you stuff a ton of programs into (like BusyBox) so you don't have to store the same initialization code across hundreds of programs.

  • The very first thing anybody interested in Zig will attempt to do is compile a "Hello World!" program and look at it. Having it being an order of magnitude smaller than the equivalent C program would be really impressive, and first impressions count for a lot. I've watched friends try Go and immediately uninstall the compiler when they see that the resulting no-op demo program is larger than 2 MiB.

  • Overhead breeds complacency — if your program is already several megabytes in size, what's a few extra bytes wasted? Such thinking leads to atrocities like writing desktop text editors bundled on top of an entire web browser, and I think it would be nice to have a language that pushes people to be a bit more mindful of the amount of resources they're using.

Discussion (2)

ernest0x profile image
Petros Moisiadis

The entire article is based on a program that does nothing. Compilers are meant to be used to compile useful programs that actually do things. Programs that use arguments, environmental variables, etc. So, any overhead to handle what useful programs are doing most of the time is often more than acceptable. For TLS initialization, there could be an option to disable it, but the overhead is still acceptable for many cases.

The first of the three points mentioned at the end is about a program's initialization performance. Most of the time this can be neglected as it is not significant in relation to the whole performance of a program. E.g. when programs run as services. Even if a program is re-spawned very frequently, overhead from system calls involved to create new processes, terminate old ones, etc., could be more significant. Besides, in that case, it makes more sense to look at the architecture of the whole stack than blame your compiler's ability to create smaller binaries.

The second point is about people being impressed when looking at the size of the produced binaries. Binaries produced by zig compiler are pretty small already and on par with C. With today's hardware, it is not a big deal to make it even smaller. There are many other features of Zig for people to be impressed with.

About the third point, I would agree that developers should think carefully about how their program is using system resources, but this is true with writing programs in any language and has nothing to do with how a compiler implementation optimizes the code for using less resources. For most programmers, compilers are mostly a black box. Any optimization in any compiler implementation is meant to be taken as a bonus. So, a compiler that produces the tiniest possible binaries does not teach the programmers how to think wisely about resource usage. The programmers just take advantage of this ability and their programs can use less resources at initialization. But this difference, with today's hardware, could be insignificant. On the other hand, there are cases in which it makes sense to have bigger, statically linked binaries like Go and Rust compilers produce by default (at least when linking with other libs written in Go or Rust accordingly).

kassane profile image
Matheus C. França

Very good article!
However, I would like to suggest not only reporting the minimum size of the binary.
Would it be possible to expand the subject covered to help understand how the zig process is compared to other system languages before the main function? Is there any overhead from zig in baremetal or kernel level applications?