Zig NEWS

Guillaume Wenzek
Guillaume Wenzek

Posted on

Zig: great design for great optimizations

Today I'd like to draw out attention on some under discussed design choices in Zig programming language.

And for this we aren't going to look at the syntax, comptime, C-interops or colorless async. Instead we will dive in the generated machine code, and compare to instructions generated by Clang. Zig and Clang being both backed by LLVM optimizer, you could expect the generated instructions to be pretty similar, and yet.

Integer Overflow

First let's look at a C/C++ function that compares two small data structures: three-letters strings. Since we want this piece of code to go fast we aren't going to use stack allocated strings, but interned strings from a shared buffer of characters, and refer to the strings by their offset into the shared buffer.

int memcmp_signed(char* data, int i1, int i2) {
    int i;
    for (i = 0; i < 2; ++i) {
        if (data[i1 + i] != data[i2 + i])
            return data[i1 + i] < data[i2 + i];
    }
    return data[i1 + i] < data[i2 + i];
}
Enter fullscreen mode Exit fullscreen mode

This gives the following instructions (browse in Godbolt)

memcmp_signed(char*, int, int): 
        movsxd  rsi, esi
        movsxd  rcx, edx
        mov     dl, byte ptr [rdi + rsi]
        mov     al, byte ptr [rdi + rcx]
        cmp     dl, al
        jne     .LBB1_2
        mov     dl, byte ptr [rsi + rdi + 1]
        mov     al, byte ptr [rcx + rdi + 1]
        cmp     dl, al
        jne     .LBB1_2
        mov     al, byte ptr [rsi + rdi + 2]
        cmp     al, byte ptr [rcx + rdi + 2]
        setl    al
        movzx   eax, al
        ret
.LBB1_2:
        cmp     dl, al
        setl    al
        movzx   eax, al
        ret
Enter fullscreen mode Exit fullscreen mode

I'm not gonna go in details of what's going on, but we can see the assembly is pretty optimized, in particular the loop is unrolled and there is no add. Indeed reading at a given memory offset is such a common operations that x86_64 has some special mov instructions for that. The i1 + i has been merged with the next read: mov dl, byte ptr [rcx + rdi + 1].

The astute reader may already wonder why I'm using int to represent an offset into an array. Right, let's change that to unsigned int. This will allow us to store twice as much data for free, right ?

Well not exactly:

memcmp_unsigned(char*, unsigned int, unsigned int):            
        mov     eax, esi
        mov     al, byte ptr [rdi + rax]
        mov     ecx, edx
        mov     cl, byte ptr [rdi + rcx]
        cmp     al, cl
        jne     .LBB0_2
        lea     eax, [rsi + 1]
        mov     al, byte ptr [rdi + rax]
        lea     ecx, [rdx + 1]
        mov     cl, byte ptr [rdi + rcx]
        cmp     al, cl
        jne     .LBB0_2
        add     esi, 2
        mov     al, byte ptr [rdi + rsi]
        add     edx, 2
        cmp     al, byte ptr [rdi + rdx]
        setl    al
        movzx   eax, al
        ret
.LBB0_2:
        cmp     al, cl
        setl    al
        movzx   eax, al
        ret
Enter fullscreen mode Exit fullscreen mode

We can see that there are now more instructions generated when using unsigned integer than signed integer. Notably the lea and add instructions. Incrementing unsigned int seems to be more work than incrementing signed int. This is because of the infamous Undefined Behavior (UB).

In C++ an overflow of signed integer is UB, which means Clang is allowed to assume than in our code i1 + i will never overflow, and such will always provide a valid offset into the data memory. OTOH adding unsigned int is Defined Behavior and is expected to wrap over to zero. Now i1 + i really means i1 = (i1 + i) % 2^32. This prevents LLVM to use one of the offset instructions of x86_64, so it needs to do a proper add before calling mov.

This also shows that Undefined Behavior is not about shooting developers in the foot, but a contract between compiler and programmer, enabling the compiler to understand the programmer intent, and enable more optimizations.

This problem isn't new and Chandler Caruth is talking about it in it's CppCon'16 talk about Undefined Behavior: https://www.youtube.com/watch?v=yG1OZ69H_-o&t=2357s, most of previous paragraph is mostly paraphrasing what he says in that talk.

Interestingly, Chandler conclusion is that "stealing bits (by using unsigned integers) is a thing of the past". This struck me when I heard it. If I'm writing C++, it's because I want to "steal bits" otherwise I'd be writing Python. So I checked what Zig was doing, and Zig has a much consistent behavior: both unsigned and signed overflow is UB in release fast. Therefore the generated machine code is almost the same in the signed/unsigned version of the function, and similar to what C++ produces for signed integers. You can also check this out in Godbolt, using Zig tabs.
I think this is a reasonable choice, because it makes the machine code generation and downstream performance more predictable. And in Zig if programmers want the wrapping behavior they need to explicitly opt-in by using the wrapping operators: +% -% *% /%.

So by carefully choosing the undefined behaviors Zig enables more optimizations, and also more predictable optimizations.

Value semantics

Another case where Clang generates sub-optimal code is when it fails to apply the LLVM "load hoisting" optimization. Let's look at this piece of code:

class A {
    bool cond;
    void loopInvariant();
};

extern void f();

void A::loopInvariant() {
    for (int i = 0; i < 5; i += 1) {
        if (this->cond)
            f();
    }
}

Enter fullscreen mode Exit fullscreen mode

The question here is how many time do we need to load cond from RAM ? You might expect that this->cond will be moved outside of the while loop and only be read once. Note that writing code like this may seem naive, but when you use templates, or during optimization it's possible to have more complex code reducing to this naive code.

A::loopInvariant():
        push    rbx
        cmp     byte ptr [rdi], 0
        je      .LBB0_5
        mov     rbx, rdi
        call    f()
        cmp     byte ptr [rbx], 0
        je      .LBB0_5
        call    f()
        cmp     byte ptr [rbx], 0
        je      .LBB0_5
        call    f()
        cmp     byte ptr [rbx], 0
        je      .LBB0_5
        call    f()
        cmp     byte ptr [rbx], 0
        je      .LBB0_5
        pop     rbx
        jmp     f()
.LBB0_5:
        pop     rbx
        ret
Enter fullscreen mode Exit fullscreen mode

But in fact, there are 5 cmp instructions. Clang isn't able to do the optimization because f could potentially read this through a global variable and modify this->cond during the loop. This is aggravated by the fact that in C++ each .cpp file is compiled independently so f could be any function implemented in another .cpp file and not only functions defined in external libraries. So basically almost any non trivial method code will read its members several times even if they aren't modified. Those extra reads also prevent further optimization (the loop invariant in the example). Note that this is not a defect of Clang, but a consequence of C++ semantics. With LTO enabled Clang might decide to inline f and then will get a chance to optimize away the extra loads.

This is not a made up problem, and some C++ experts are concerned about this, Dave Abrahams gave a full talk on this topic at CppCon'22: https://www.youtube.com/watch?v=QthAU-t3PQ4.

Do we have this problem in Zig ? Let's check:

const A = struct {
    cond: bool,

    fn loopInvariant(self: A) void {
        var i: u8 = 0;
        while (i < 5) : (i += 1) {
            if (self.cond)
                f();
        }
    }
};
Enter fullscreen mode Exit fullscreen mode

Which compiles to:

example.A.loopInvariant:
        push    rax
        test    byte ptr [rdi], 1
        jne     .LBB0_2
        pop     rax
        ret
.LBB0_2:
        call    f@PLT
        call    f@PLT
        call    f@PLT
        call    f@PLT
        pop     rax
        jmp     f@PLT
Enter fullscreen mode Exit fullscreen mode

There is only one cmp done before the loop, and the loop is completely unrolled. Zig compiler can do that because in Zig methods are regular functions, and parameters are immutable. We can pass self "by-value", even if Zig will use a pointer under the hood. This important tidbit is actually buried inside the Zig "one pager" documentation. By contrast if we pass self: *A by pointer then the pointer itself is immutable but not the underlying memory and therefore Zig will also assume that self.cond can be modified by f, and will produce instructions similar to Clang.

Conclusion

Having a compiler I can trust to generate efficient machine code is really important when I chose to work with a low-level language. The UB arithmetic overflow and the by-value parameters make a big difference in how your code is going to be optimized. Those great design decisions are under advertised, I hope this post can shine some light on them.

Top comments (3)

Collapse
 
aryaelfren profile image
Arya-Elfren

Great post!

I noticed that for your first godbolt link you actually manually unroll the loop (and it's <= for the last of one function and < for the other). I wanted to see what your original code does: godbolt.org/z/qjxM3cWjo

Basically it's as you said, unsigned in c is slower.

Just in case anyone else wants to see the asm diffs.

Collapse
 
kfird214 profile image
kfir

I saw that on the second example when changing the const a = ... to var a = ... and using self: *A doesn't make the zig compiler use the code like Clang.

If I make the a var global by declaring var a: A = undefined;
The compiler do use the Clang style.

Whould't it be a problem?

godbolt.org/z/9ocf6oe39

Collapse
 
tecanec profile image
tecanec

I'm of the opinion that compiler optimizations should not be seen as a substitute for manual optimizations, but as a tool for making efficient code without sacrificing clarity. Knowing what the compiler can and can not do is therefore important when writing high-quality code.

Great post!