Today I'd like to draw out attention on some under discussed design choices in Zig programming language.
And for this we aren't going to look at the syntax, comptime
, C-interops or colorless async. Instead we will dive in the generated machine code, and compare to instructions generated by Clang. Zig and Clang being both backed by LLVM optimizer, you could expect the generated instructions to be pretty similar, and yet.
Integer Overflow
First let's look at a C/C++ function that compares two small data structures: three-letters strings. Since we want this piece of code to go fast we aren't going to use stack allocated strings, but interned strings from a shared buffer of characters, and refer to the strings by their offset into the shared buffer.
int memcmp_signed(char* data, int i1, int i2) {
int i;
for (i = 0; i < 2; ++i) {
if (data[i1 + i] != data[i2 + i])
return data[i1 + i] < data[i2 + i];
}
return data[i1 + i] < data[i2 + i];
}
This gives the following instructions (browse in Godbolt)
memcmp_signed(char*, int, int):
movsxd rsi, esi
movsxd rcx, edx
mov dl, byte ptr [rdi + rsi]
mov al, byte ptr [rdi + rcx]
cmp dl, al
jne .LBB1_2
mov dl, byte ptr [rsi + rdi + 1]
mov al, byte ptr [rcx + rdi + 1]
cmp dl, al
jne .LBB1_2
mov al, byte ptr [rsi + rdi + 2]
cmp al, byte ptr [rcx + rdi + 2]
setl al
movzx eax, al
ret
.LBB1_2:
cmp dl, al
setl al
movzx eax, al
ret
I'm not gonna go in details of what's going on, but we can see the assembly is pretty optimized, in particular the loop is unrolled and there is no add
. Indeed reading at a given memory offset is such a common operations that x86_64 has some special mov
instructions for that. The i1 + i
has been merged with the next read: mov dl, byte ptr [rcx + rdi + 1]
.
The astute reader may already wonder why I'm using int
to represent an offset into an array. Right, let's change that to unsigned int
. This will allow us to store twice as much data for free, right ?
Well not exactly:
memcmp_unsigned(char*, unsigned int, unsigned int):
mov eax, esi
mov al, byte ptr [rdi + rax]
mov ecx, edx
mov cl, byte ptr [rdi + rcx]
cmp al, cl
jne .LBB0_2
lea eax, [rsi + 1]
mov al, byte ptr [rdi + rax]
lea ecx, [rdx + 1]
mov cl, byte ptr [rdi + rcx]
cmp al, cl
jne .LBB0_2
add esi, 2
mov al, byte ptr [rdi + rsi]
add edx, 2
cmp al, byte ptr [rdi + rdx]
setl al
movzx eax, al
ret
.LBB0_2:
cmp al, cl
setl al
movzx eax, al
ret
We can see that there are now more instructions generated when using unsigned integer than signed integer. Notably the lea
and add
instructions. Incrementing unsigned int
seems to be more work than incrementing signed int
. This is because of the infamous Undefined Behavior (UB).
In C++ an overflow of signed integer is UB, which means Clang is allowed to assume than in our code i1 + i
will never overflow, and such will always provide a valid offset into the data
memory. OTOH adding unsigned int
is Defined Behavior and is expected to wrap over to zero. Now i1 + i
really means i1 = (i1 + i) % 2^32
. This prevents LLVM to use one of the offset instructions of x86_64, so it needs to do a proper add
before calling mov
.
This also shows that Undefined Behavior is not about shooting developers in the foot, but a contract between compiler and programmer, enabling the compiler to understand the programmer intent, and enable more optimizations.
This problem isn't new and Chandler Caruth is talking about it in it's CppCon'16 talk about Undefined Behavior: https://www.youtube.com/watch?v=yG1OZ69H_-o&t=2357s, most of previous paragraph is mostly paraphrasing what he says in that talk.
Interestingly, Chandler conclusion is that "stealing bits (by using unsigned integers) is a thing of the past". This struck me when I heard it. If I'm writing C++, it's because I want to "steal bits" otherwise I'd be writing Python. So I checked what Zig was doing, and Zig has a much consistent behavior: both unsigned and signed overflow is UB in release fast. Therefore the generated machine code is almost the same in the signed/unsigned version of the function, and similar to what C++ produces for signed integers. You can also check this out in Godbolt, using Zig tabs.
I think this is a reasonable choice, because it makes the machine code generation and downstream performance more predictable. And in Zig if programmers want the wrapping behavior they need to explicitly opt-in by using the wrapping operators: +% -% *% /%
.
So by carefully choosing the undefined behaviors Zig enables more optimizations, and also more predictable optimizations.
Value semantics
Another case where Clang generates sub-optimal code is when it fails to apply the LLVM "load hoisting" optimization. Let's look at this piece of code:
class A {
bool cond;
void loopInvariant();
};
extern void f();
void A::loopInvariant() {
for (int i = 0; i < 5; i += 1) {
if (this->cond)
f();
}
}
The question here is how many time do we need to load cond
from RAM ? You might expect that this->cond
will be moved outside of the while loop and only be read once. Note that writing code like this may seem naive, but when you use templates, or during optimization it's possible to have more complex code reducing to this naive code.
A::loopInvariant():
push rbx
cmp byte ptr [rdi], 0
je .LBB0_5
mov rbx, rdi
call f()
cmp byte ptr [rbx], 0
je .LBB0_5
call f()
cmp byte ptr [rbx], 0
je .LBB0_5
call f()
cmp byte ptr [rbx], 0
je .LBB0_5
call f()
cmp byte ptr [rbx], 0
je .LBB0_5
pop rbx
jmp f()
.LBB0_5:
pop rbx
ret
But in fact, there are 5 cmp
instructions. Clang isn't able to do the optimization because f
could potentially read this
through a global variable and modify this->cond
during the loop. This is aggravated by the fact that in C++ each .cpp
file is compiled independently so f
could be any function implemented in another .cpp
file and not only functions defined in external libraries. So basically almost any non trivial method code will read its members several times even if they aren't modified. Those extra reads also prevent further optimization (the loop invariant in the example). Note that this is not a defect of Clang, but a consequence of C++ semantics. With LTO enabled Clang might decide to inline f
and then will get a chance to optimize away the extra loads.
This is not a made up problem, and some C++ experts are concerned about this, Dave Abrahams gave a full talk on this topic at CppCon'22: https://www.youtube.com/watch?v=QthAU-t3PQ4.
Do we have this problem in Zig ? Let's check:
const A = struct {
cond: bool,
fn loopInvariant(self: A) void {
var i: u8 = 0;
while (i < 5) : (i += 1) {
if (self.cond)
f();
}
}
};
example.A.loopInvariant:
push rax
test byte ptr [rdi], 1
jne .LBB0_2
pop rax
ret
.LBB0_2:
call f@PLT
call f@PLT
call f@PLT
call f@PLT
pop rax
jmp f@PLT
There is only one cmp
done before the loop, and the loop is completely unrolled. Zig compiler can do that because in Zig methods are regular functions, and parameters are immutable. We can pass self
"by-value", even if Zig will use a pointer under the hood. This important tidbit is actually buried inside the Zig "one pager" documentation. By contrast if we pass self: *A
by pointer then the pointer itself is immutable but not the underlying memory and therefore Zig will also assume that self.cond
can be modified by f
, and will produce instructions similar to Clang.
Conclusion
Having a compiler I can trust to generate efficient machine code is really important when I chose to work with a low-level language. The UB arithmetic overflow and the by-value parameters make a big difference in how your code is going to be optimized. Those great design decisions are under advertised, I hope this post can shine some light on them.
Latest comments (4)
In the first example idiomatic way is to use size_t for indexing. Then there will be no problems with wrap-around arithmetic.
In the second example identical C++ code produces results identical to Zig
godbolt.org/z/EehT14xM7
This is simply because of local variable
a
and escape analysis on it and inlining of the methodI saw that on the second example when changing the
const a = ...
tovar a = ...
andusing self: *A
doesn't make the zig compiler use the code like Clang.If I make the a var global by declaring
var a: A = undefined;
The compiler do use the Clang style.
Whould't it be a problem?
godbolt.org/z/9ocf6oe39
I'm of the opinion that compiler optimizations should not be seen as a substitute for manual optimizations, but as a tool for making efficient code without sacrificing clarity. Knowing what the compiler can and can not do is therefore important when writing high-quality code.
Great post!
Great post!
I noticed that for your first godbolt link you actually manually unroll the loop (and it's
<=
for the last of one function and<
for the other). I wanted to see what your original code does: godbolt.org/z/qjxM3cWjoBasically it's as you said, unsigned in c is slower.
Just in case anyone else wants to see the asm diffs.