On the value of going deep, or How a broken keyboard led me to fix bugs in Zig.

#showcase

All in the golden afternoon..

This afternoon I sat down in front of my keyboard, with a cup of coffee ready to work. As you can imagine you don't end up using this kind of keyboard by accident, and I really liked this one. But, alas, the keyboard was again acting up, seemingly rebooting when I'm typing on the right hand side. Oh no I thought, I'd already fixed this last week by heating up a bit the micro controller legs... But today I had to make progress, so I sighed, and push my keyboard to the side.

And I was back at typing on my laptop keyboard. I always found this awkward, but after being use to vertically staggered it's hard to ignore the wrist pain. So I figured I was going to dig up my layerz project from a year ago that emulates some of my Keyseebee layout on my laptop, by adding extra power to the "space" key and "alt" keys.

Since I didn't use this code from some time I had to update from Zig 0.8.0 to Zig 0.10.0. There was a few breaking changes with function pointers and build API, but nothing crazy, and I was quickly able to compile layerz. I felt relieved to have quickly mitigated my issue, and being able to get back to work.

But as I pressed "Meta-tab" I saw the segfault being printed in my terminal. This what somewhat confusing to me, because layerz is a very simple program. It reads key events from one file handle and writes them to another. I realized I had forgotten to run the test suite after the last code changes, but no, it was still working fine.

“Oh dear! Oh dear! I shall be late!”

Hmm why is the behavior different in test and in prod ? The only thing I wasn't testing, is my interaction with libevdev. libevdev is a wrapper library for evdev devices. It allows me to "grab" a physical device and its inputs, and create a new virtual device, with its own file descriptor and write events there.

It is unlikely that libevdev itself was the culprit, and it was also unlikely I had a sever bug in my calling code, since it used to work on a previous laptop.

So what was in between my code and libevdev ? The well known C ABI. This is the binary interface that glue together most of the programming world.
Most compiler talks this language, and are able to correctly pass simple values and struct through this frontier to function written in another language. Note that at this point in time, this simple definition is about all I could tell about C ABI.

The rabbit-hole went straight on like a tunnel for some way

How do we know that Zig is calling the ABI correctly ? At first I didn't even found the relevant test suite in Zig, so I searched for Clang ones. I ducked the web around, until I found a test suite for Clang in llvm-test-suite. The test suite is mostly about creating C struct and verifying that Clang is creating the right layout for them. I decided to see if Zig could pass those tests.
Zig is also a C compiler so I could quickly run the tests with zig cc, but of course that's not really helpful, because Zig is using libclang for zig cc, so it didn't provide a different result.

So I tried zig translate-c feature, that allow for Zig to convert C files, to Zig. That wasn't really convincing because the test suite is very macro-heavy, something Zig doesn't handle well, and a lot of the complexity was about implementing a test runner in C, which is builtin in Zig so the code wasn't very idiomatic, even once I workaround the macro issues.

At this point I rolled up my sleeves and implemented a small Python script to translate the very repetitive C code into Zig code. And quickly I was able to get my first results. Out of the full test suite, there was actually 1808 valid C structs. Others are either containing empty struct (which is invalid in C but not in C++) or bitfields whose behavior through the C ABI is not specified by the C standard.
Clang is apparently testing on them to ensure it does the same thing than gcc.

For those 1808 structs, Zig was passing all the layouts tests. So I had to go deeper. Knowing the shape of a struct is effectively needed to use the C ABI, but that's really not enough.

There were doors all round the hall, but they were all locked;

Because the C struct are typically sliced into registers before calling the C function. There are conventions about which registers should be used for which function parameters.
For example if you have a {v1: f32, v2: i32} struct then the v1 field will be put into a floating point register and the v2 in an integer register even if you could theoretically have crammed them both into one 64-bit register.

And it's time to talk about different architecture and OSes.
Since the C ABI talks about registers you actually have one C ABI per architecture. And x86_64 even has two because Windows uses its own calling convention and not sytemV like other OSes. I can't tell you much about the difference, because I didn't got a chance to run my test suite on Windows, so let's go deeper instead !

So how do we test that we actually tests that C ABI ?
Asking on Zig discord, Topolarity pointed me to the (modest) C ABI test suite in Zig. It passes structs through the ABI and asserts they have the expected value on the other side.

I wrote similar tests for the list of struct I had. For each struct I generate a special value, a C function that check the content of each field, and returns 0 for success, or i if the i-th field doesn't contain the expect value.

The generate code looks like this:

// zig side:
test "F_I: Zig passes to C" {
    try testing.expectOk(c.assert_F_I(.{ .v1 = -0.25, .v2 = 2673 }));
}

// C side:
int assert_F_I(struct F_I lv){
    int err = 0;
    if (lv.v1 != -0.25) err = 1;
    if (lv.v2 != 2673) err = 2;
    return err;
}

I actually generates 4 directions:

Zig calls the C assertion function
C calls Zig assertion function
Zig asserts a struct returned by C
C asserts a struct returned by Zig

Using qemu I was also able to run the above tests on different platforms. It's actually very easy to use qemu with zig test, by just passing -target {target} --test-cmd qemu-{arch} --test-cmd-bin.

I found a number of failing tests. At this time the aarch64 test suite was actually segfaulting, while the x86_64 ones was reporting 340 failing tests (across as many structs).
Note that this is with the newly release Zig self-hosted
compiler of 0.10.0 which isn't fully ironed yet.

Target	passed	skipped	failed	crashes
i386-linux	3725	3768	1927	0
x86_64-linux	9080	0	340	0
aarch64-linux	5	0	0	7
riscv64-linux	8494	0	146	1

she opened it, and found in it a very small cake

I was getting at something.
I opened a PR adding three distinct structs to the existing Zig test suite.
I was secretly hoping this would be enough to convince someone else to finish the job for me.
And it probably would have, but some other part of my brain was on the hunt and wanted to squash that bug.
Most importantly this PR had shown to the core contributors I wasn't joking, and I was worth spending some time on.

I didn't thought about this before-hand but it makes lot of sense now.
As I spend some time on Zig Discord, I observed it's quite common to see people raising issues with big stack traces or asking questions about particular thing they want to improve but not following up on them.
Core contributors have a lot of knowledge to share, but also they have limited time available and they can't disperse themselves too much if they want to focus on the hard parts.
Offering something first helps when you're asking for help.

her foot slipped, and in another moment, splash!

I focused on one struct and looked at the generated machine code (Note I was using godbolt at first, but as soon I started recompiling Zig, I had to roll out a local "godbolt", which isn't very complicated, Zig makes it easy to output the llvm or assembly).

const C_C_D = extern struct  { v1: i8, v2: i8, v3: f64 };

pub export fn zig_assert_C_C_D(lv: C_C_D) c_int {
    var err: c_int = 0;
    if (lv.v1 != 88) err = 1;
    if (lv.v2 != 39) err = 2;
    if (lv.v3 != -2.125) err = 3;
    return err;
}

zig_assert_C_C_D:
    push    rbp
    mov     rbp, rsp
    sub     rsp, 24
    mov     qword ptr [rbp - 24], rdi
    mov     qword ptr [rbp - 16], rsi
    mov     dword ptr [rbp - 4], 0
    cmp     byte ptr [rbp - 24], 88
    je      .LBB0_2
    mov     dword ptr [rbp - 4], 1
    jmp     .LBB0_3
    ...

I knew that Zig was generating the wrong machine code,
but I didn't exactly knew what was wrong. If you're familiar with x86_64 you can probably find the issue relatively quickly. Admittedly this took me a bit longer because even when comparing with Clang assembly, it took me some time to see the difference. Indeed there are many difference between Clang and Zig assembly, but most of them are non-issue, since they lead to equivalent behavior. Reading Raymond Chen blog posts about calling conventions helped me understand what was the supposedly right assembly for this function. The issue is that we are reading the struct from rdi and rsi, two integer registers while the C calling convention says v3 should be written in the first float register, xmm0.

From there the solving actually came pretty fast, I got some help from another core contributor, Vexu, who pointed me to the part of the code generating the LLVM IR for C calling convention, and pointed me that the intermediary struct representing the function registers couldn't handled a mix of float and integer registers. I modified a bit the struct, adapted the code generating it and the code reading it, recompiled the compiler, rerun the test suite and all x86_64 tests were passing:

x86_64-linux: Test results: 9420 passed; 0 skipped; 0 failed.

A quick PR later, and voilà, you can check on godbolt that zig trunk generates good looking LLVM IR and assembly code.

the Caucus-Race, the Cat, the Queen, the Croquet, ...

There are even more adventures going on with Aarch64, and others with miscompilations in release mode. Those stories haven't resolved yet, so that would be for another time.

Here are my takeaways from all this:

Don't be afraid at looking at what's below you, maybe you'll find something interesting and worst case you'll learn a lot. Often people think about "rabbit hole" as "wasting time", or "getting lost". I tried to describe my thought process to try to show it's possible to go very deep, but staying focus to find what you're searching. It means I had to make some compromise, and didn't try to understand everything I was seeing, only what I needed to fix my bug.
When interacting with OSS project, do you homework before asking for help, and try to bring something to the project.
Zig 0.10.0, has some bugs which is kind of expected given that it's the first release of the self hosted compiler.0.10.1 will be more usable.
Zig codebase is great. As a new contributor, I was able to quickly read, understand and fix the relevant code. Zig core team is also great and they know their stuff.
Test generation is a great tool to find bugs.

And for those who were still holding their breath about my wrists pain, I've switched to Sesame "Elmo" keyboard, which features cheap and sturdy electronic on a classic layout named Alice.

Thanks for reading, I wish you a lot of rabbit holing !

Top comments (5)

Reuben Miller • Feb 4 '23

I really like the takeaways, especially "When interacting with OSS project, do you homework before asking for help, and try to bring something to the project".

I think too often users just think a OSS project is just a "free" project that they can use and when something does not work, they just create a "it does not work, fix it" ticket. OSS is about community and using it means trying to positively contribute to it, whether it be a detailed ticket, PR, testing or whatever.