Loris Cro

Posted on Sep 8, 2021

Extend a C/C++ Project with Zig

#c #cpp #cabi

Zig is not just a programming language but also a toolchain that can help you maintain and gradually modernize existing C/C++ projects, based on your needs. In this series we're using Redis, a popular in-memory key value store written in C, as an example of a real project that can be maintained with Zig. You can read more in "Maintain it with Zig".

A taste of Zig

In this series we started by using Zig as a C/C++ compiler and dove deeper as we worked to make cross-compilation possible. In the last post we ditched Make and all other build dependencies in favor of zig build.

This is a good place to be, and it could very well be the end of your journey. In my case, if I were to take ownership of a C codebase, I would be definitely interested in continuing its development using Zig rather than C, so let's take a final dive into the Redis codebase to learn how Zig and C interoperate.

If need an introduction to Zig as a language, check out this talk by @andrewrk

One particularly relevant argument from that talk is how "Zig is better at using C libraries than C itself". Make sure you don't miss that passage.

Extending Redis

In this article we'll add a new command to Redis. This will be a great opportunity to showcase a realistic, non-trivial example of how to include Zig in an existing C code base.

Our new command will need to integrate with the existing Redis ecosystem to open keys, read their contents, and to reply to the client. This will allow us to examine Zig's interoperability story both from and to C (i.e., C calling into Zig code and Zig using C definitions).

Finally, I'll tell you upfront that this is not a special "best case scenario" that we're going to see; in fact we're going to face a current limitation of the compiler when it comes to reading C header files and we'll implement a simple workaround for it.

Look at me, I'm the captain maintainer now

The whole idea of this series is to use Redis as an example of a project we maintain, so it makes sense for us to perform this type of modification to "our" code base, but be aware that writing a Redis Module is the correct way of adding new commands to Redis as a user (which also is very easy to do using Zig, but that's a story for another time).

Since we have to operate on "our" codebase, I'll also introduce you to some of the details and quirks of how Redis is written because, while this addition is by no means invasive, we're going to perform a proper integration, which requires knowing a bit of Redis trivia.

Adding UTF8 support to Redis

The most basic key type in Redis is the string. Strings in Redis are just byte sequences, so they don't have to respect any particular encoding (thank god), but this means that occasionally some basic commands will not behave as you'd like them to. One simple example is the STRLEN command which will return byte counts, which is usually not what you want when you're dealing with unicode data.

Well, no big deal, let's add a UTF8LEN command to Redis and have it return the number of codepoints. Conveniently for us, the Zig standard library already implements std.unicode.utf8CountCodepoints so it's just a matter of adding the glue necessary to interact with the Redis ecosystem.

The command table

The start of our journey would probably be to look for where all the commands in Redis are registered, this way we can follow the breadcrumbs and hopefully find the implementation of an existing command to take inspiration from. An obviously good candidate for this process is STRLEN.

The Redis command table is defined in server.c and alongside the command-name to function-pointer mapping, it also features a few other details about the nature of the command that we can safely ignore for the purpose of this article.

{"strlen",strlenCommand,2,
 "read-only fast @string",
 0,NULL,1,1,1,0,0,0},

Now we know that the implementation of STRLEN (commands are case-insensitive in Redis btw) is in a function called strlenCommand and we can also use this opportunity to add a new entry right after it to register our upcoming UTF8LEN command.

{"utf8len",utf8lenCommand,2,
 "read-only fast @string",
 0,NULL,1,1,1,0,0,0},

Ok so now we have to declare utf8lenCommand in the C file (just the forward declaration, the actual implementation will be done in Zig), but we don't know the signature yet. Looking at the signature of strlenCommand will answer our questions but, for the sake of convenience, this is what you need to add at the top of server.c.

void utf8lenCommand(client *c);

Looking at a Redis command implementation

Let's now take a look at the implementation of strlenCommand. If you were going in blind, you would have to either grep the entire codebase for that symbol or follow the include chain and guess where the implementation could reside.

Luckily for you, I'm your Virgilio and I can tell you that each key type in Redis has its own C file where all the relative functions are implemented. To make it even more easy to find them, these types have their file start with t_, so the function that we're looking for can be found in src/t_string.c, at the very end of the file.

void strlenCommand(client *c) {
    robj *o;
    if ((o = lookupKeyReadOrReply(c,c->argv[1],shared.czero)) == NULL ||
        checkType(c,o,OBJ_STRING)) return;
    addReplyLongLong(c,stringObjectLen(o));
}

// from object.c
size_t stringObjectLen(robj *o) {
    serverAssertWithInfo(NULL,o,o->type == OBJ_STRING);
    if (sdsEncodedObject(o)) {
        return sdslen(o->ptr);
    } else {
        return sdigits10((long)o->ptr);
    }
}

Ok let's unpack strlenCommand. It's just two lines of code but they are a bit hermetic.

The first complex line is the if statement. The gist of it is that lookupKeyReadOrReply will either be able to open the key or (as a side effect) reply with an error to the client, while the second part of the or expression will check the key type and, as a side effect, reply with an error to the client if the key is not a string. If either case is true, then strlenCommand will do an early return. This part seems a bit confusing because the first function returns NULL in the failure case, while checkType has an "error" code return logic, where anything other that zero is an error.

Anyway, if the checks pass (could access the key & the key is of the right type), then we reply to the client with the length in bytes.

This is where another quirk of Redis shows up because Redis doesn't have any built-in numeric type.

If you want to store a number in Redis, be it an int or a float, you must use a string key, and in fact there are commands that operate exclusively on string keys that contain numbers, like INCR. Does it mean that those commands will parse a number out of a string every time you need to operate on it? Not really, the string object struct has a flag that tells you whether its ptr field points to an array of bytes or if it's not really a pointer but rather the number itself. This is what stringObjectLen is doing when invoking sdsEncodedObject.

Keep this point in mind, because we'll have to account for numbers when writing our Zig code later.

Take off every Zig!

We learned the basics of how commands are implemented in Redis, we registered our new command, and we also left a forward declaration for it in server.c. It's finally time to write some Zig code!

To respect the conventions of the project I'll name this file t_string_utf8.zig. Before we start writing code, let's add it to the compilation process.

Add a Zig compilation unit

Zig can export functions and definitions compatible with the C ABI. This means that we can compile Zig as a separate compilation unit and then have the linker resolve all symbols as it normally happens in a C/C++ project.

To make things easy in our case we'll just compile our code as a static library and then add it to the main redis_server build step (refer to the previous article for more context).

const t_string_utf8 = b.addStaticLibrary("t_string_utf8", "src/t_string_utf8.zig");
t_string_utf8.setTarget(target);
t_string_utf8.setBuildMode(mode);
t_string_utf8.linkLibC();
t_string_utf8.addIncludeDir("src");
t_string_utf8.addIncludeDir("deps/hiredis");
t_string_utf8.addIncludeDir("deps/lua/src");

// Add where the `redis_server` step is being defined
redis_server.linkLibrary(t_string_utf8);

The Zig implementation

First we need to be able to access the definitions in server.h, since it exposes declarations for all the functions that we're going to need, like checkType.

const redis = @cImport({
    @cInclude("server.h");
});

Then, we need to re-implement the function and finally add our twist (count codepoints instead of bytes). Let's start by re-implementing the original function.

const std = @import("std");
const redis = @cImport({
    @cInclude("server.h");
});

export fn utf8lenCommand(c: *redis.client) void {
    var o: *redis.robj = redis.lookupKeyReadOrReply(c, c.argv[1], redis.shared.czero) orelse return;
    if (redis.checkType(c, o, redis.OBJ_STRING) != 0) return;

    // Get the strlen
    const len = redis.stringObjectLen(o);
    redis.addReplyLongLong(c, @intCast(i64, len));
}

This function doesn't do anything interesting yet, but it's a good checkpoint to compile and test that everything works.

Run zig build to compile everything, then launch the Redis server by running: ./zig-out/bin/redis-server.

In another tab you can launch ./zig-out/bin/redis-cli, which should allow our new command to Redis:

> set foo "Hello World!"
OK
> strlen foo
12
> utf8len foo
12

Add UTF8 support

To add our new spin to the function we need to differentiate between two cases:

When the string key points to bytes
When the string key is a number so no bytes

This is important because we're going to crash the server if we try to dereference a pointer that encodes a number.

We already saw that o.ptr is the pointer to bytes (or number), and by inspecting stringObjectLen() a bit more closely you can see that o.encoding tells you in which of the two cases we are.

This means that the following code would work if not for a current limitation of the cImport function.

export fn utf8lenCommand(c: *redis.client) void {
    var o: *redis.robj = redis.lookupKeyReadOrReply(c, c.argv[1], redis.shared.czero) orelse return;
    if (redis.checkType(c, o, redis.OBJ_STRING) != 0) return;

    // Get the strlen
    const len = redis.stringObjectLen(o);

    // If the key encodes a number we're done.
    if (o.encoding == redis.OBJ_ENCODING_INT) {
        redis.addReplyLongLong(c, @intCast(i64, len));
        return;
    }

    // Not a number! Grab the bytes and count the codepoints.
    const str = @ptrCast([*]u8, o.ptr)[0..len];
    const cps = std.unicode.utf8CountCodepoints(str) catch {
        redis.addReplyError(c, "this aint utf8 chief");
        return;
    };

    redis.addReplyLongLong(c, @intCast(i64, cps));
}

If we try to compile now, this is the error we get:

./src/t_string_utf8.zig:15:10: error: no member named 'encoding' in opaque type '.cimport:3:15.struct_redisObject'
    if (o.encoding == redis.OBJ_ENCODING_INT) {

Let's see how to solve this final problem.

Problems related to C header files

When you cImport a header file, Zig will try to translate its contents into a Zig equivalent (which is a different process than linking to a C compilation unit btw). This same feature is also available from the command line with zig translate-c, which is also useful to diagnose problems with the cImport system, like we are encountering right now.

If we run translate-c on the header file we discover that unfortunately the definition of the robj (Redis Object) struct was translated to an opaque type because Zig couldn't parse the bitfield specifiers. At the moment translate-c has a short list of unsupported C features that are progressively getting tackled, but alas we'll need to find a work around for now.

Here's the C definition of robj, taken from server.h:

typedef struct redisObject {
    unsigned type:4;
    unsigned encoding:4;
    unsigned lru:LRU_BITS; /* LRU time (relative to global lru_clock) or
                            * LFU data (least significant 8 bits frequency
                            * and most significant 16 bits access time). */
    int refcount;
    void *ptr;
} robj;

The most general workaround is to do manually what translate-c couldn't do, which is to just write in Zig a struct definition compatible with the C one. The same can be also done with function declarations and in fact we could also make do without importing server.h at all, and just write down manually the extern definitions of all the needed symbols. That said, for this case we can do something less tedious and brittle than to have a second definition of the same struct: we can make a couple getter functions in server.c and use them from Zig.

Getting a hand from C

Since we're having trouble reaching into robj, let's just add a couple C functions that can to that for us.

In server.c add:

void* getPtrFromObj(robj* r) {return r->ptr;}
unsigned getEncodingFromObj(robj* r) {return r->encoding;}

Then in server.h add the relative forward declarations:

void* getPtrFromObj(robj*);
unsigned getEncodingFromObj(robj*);

Note for clarity: we can't define our functions directly into server.h because we are using cImport to translate server.h into Zig code, which would still break. This way we only provide the forward declaration to Zig and let the function be resolved at link time.

Finally, in case you're worried about the performance implications of having getter functions, don't worry because LTO (Link-Time Optimization) works across language boundaries.

The working code

After using our new getter functions we are finally able to achieve a functioning implementation written in Zig.

export fn utf8lenCommand(c: *redis.client) void {
    var o: *redis.robj = redis.lookupKeyReadOrReply(c, c.argv[1], redis.shared.czero) orelse return;
    if (redis.checkType(c, o, redis.OBJ_STRING) != 0) return;

    // Get the strlen
    const len = redis.stringObjectLen(o);

    // If the key encodes a number we're done.
    if (redis.getEncodingFromObj(o) == redis.OBJ_ENCODING_INT) {
        redis.addReplyLongLong(c, @intCast(i64, len));
        return;
    }

    // Not a number! Grab the bytes and count the codepoints.
    const str = @ptrCast([*]u8, redis.getPtrFromObj(o))[0..len];
    const cps = std.unicode.utf8CountCodepoints(str) catch {
        redis.addReplyError(c, "this aint utf8 chief");
        return;
    };

    redis.addReplyLongLong(c, @intCast(i64, cps));
}

Now, after rebuilding the project, you should be able to see the new behavior of UTF8LEN.

> set foo "voilà"
OK
> strlen foo
6
> utf8len foo
5

You can find the full listing on GitHub

In conclusion

Whew, this time the work was a bit more intense, but that's the case when it comes to real projects. I hope I was able to give you an interesting window into Redis without introducing unnecessary concepts.

As you can see, adding Zig to a C project doesn't automagically resolve all complexity, but it's mostly seamless and, given the way C/Zig interop works, you can easily find a workaround when you encounter road blocks. On top of that, translate-c is being improved as usage grows, so I'm sure that soon enough the missing C syntax will be covered.

If you like where Zig is going, take a look at "The Road to Zig 1.0" by Andrew, checkout Zig Learn, and join a Zig community!

Finally, if you want to help us reach 1.0 faster, consider donating to the Zig Software Foundation to allow us to hire more full-time contributors.

Extra credit

Want more? Here are a couple things to think about!

Proper errors!

When coding this live on stream, Andrew added better error reporting by leveraging the fact that std.unicode.utf8CountCodepoints has a precise set of possible errors.

const cps = std.unicode.utf8CountCodepoints(str) catch |err| return switch (err) {
    error.Utf8ExpectedContinuation => redis.addReplyError(c, "Expected UTF-8 Continuation"),
    error.Utf8OverlongEncoding => redis.addReplyError(c, "Overlong UTF-8 Encoding"),
    error.Utf8EncodesSurrogateHalf => redis.addReplyError(c, "UTF-8 Encodes Surrogate Half"),
    error.Utf8CodepointTooLarge => redis.addReplyError(c, "UTF-8 Codepoint too large"),
    error.TruncatedInput => redis.addReplyError(c, "UTF-8 Truncated Input"),
    error.Utf8InvalidStartByte => redis.addReplyError(c, "Invalid UTF-8 Start Byte"),
};

Here's how you can trigger some of those errors:

> set foo "\xc3\x28"
OK
> utf8len foo
(error) ERR Expected UTF-8 Continuation

Here are a few other values for foo that trigger different errors:

"\xa0\xa1"
"\xc0\x80"
"\xf4\x90\x80\x80"
"\xed\xbf\xbf"

Codepoints? 🤮🤮🤮

Counting UTF8 codepoints is nowhere near enough if you're dealing with real-world text. Multiple codepoints can combine to create new symbols, like the astronaut emoji which is the combination of 3 codepoints (person, zero width joiner, rocket), just to name one problem.

Ziglyph is a solution to this problem and once the transition to a self-hosted implementation of the Zig compiler will be completed, Zig will also bundle a package manager, making Zig a complete solution for fetching dependencies, building, and extending C/C++ projects. It would be interesting at that point to hook Ziglyph (or any other Zig package) to Redis.

Reproducibility footnote

Zig 0.8.1, Redis commit be6ce8a.

Top comments (5)

sekoyo • Jan 24 '22 • Edited

Thank you, though it would be great if there was a paired down hello world example (with an example of linking still), and redis is C I don't see any example of C++ compilation in github.com/kristoff-it/redis/blob/... ?

Edit, all good, I didn't realize addCSourceFiles is also for CPP files and then the only other step is linkLibCpp:

const std = @import("std");

pub fn build(b: *std.build.Builder) void {
    const target = b.standardTargetOptions(.{});
    const mode = b.standardReleaseOptions();

    const my_app = b.addExecutable("my_app", null);
    my_app.setTarget(target);
    my_app.setBuildMode(mode);
    my_app.install();
    // Add your .addIncludeDir or .linkLibrary here
    my_app.linkLibCpp();
    my_app.addCSourceFiles(&.{
        "src/main.cc",
    }, &.{
        "-std=c++20",
    });
}

Loris Cro • Jan 25 '22

Yes, jemalloc is the depedency that makes Redis technically also a C++ project, but you're right, we didn't build it once we started doing cross-compilation (and instead defaulted to vanilla malloc), but we did compile it in the beginning when just replacing clang with zig cc.

spirobel • Jun 14 '22

is it possible to call cpp directly from zig or do I need to make a C wrapper for cpp code and then call that from zig?

Loris Cro • Jun 14 '22

you have to make a C wrapper