dude_the_builder

Posted on Sep 13, 2021

Unicode Basics in Zig

#unicode #string #character #text

Characters

The concept of a character is elusively complex. Let's go step by step.

fn processAscii(byte: u8) void { _ = byte; }
fn processUnicode(code_point: u21) void { _ = code_point; }

// Output: comptime_int
debug.print("{}\n", .{@TypeOf('e')});

processAscii('e');
processUnicode('e');

Contrary to what one might expect based on other programming languages, in Zig, that 'e' character literal has type comptime_int and can be coerced into integer types such as u8 for ASCII-only text processing or u21 to handle Unicode code points.

Note: I'm not going down the endless rabbit hole discussion of what characters are and what they aren't. Let's stick to the Zig-related topics and we should be fine. 😼

ASCII in Zig

To work with single-byte (u8) ASCII characters, Zig offers the std.ascii namespace in which you can find many useful functions.

const ascii = @import("std").ascii;

_ = ascii.isAlNum('A');
_ = ascii.isAlpha('A');
_ = ascii.isDigit('3');
// ...and many more.

Unicode Code Points

Unicode assigns a unique integer code to characters, marks, symbols, emoji, etc. This code is what's known as a code point. These code points are then encoded into what are called code units using a Unicode Encoding Form. The most widely used encoding form today is UTF-8, in which each code point can be encoded into code units of 8 bits each.

UTF-8 was designed in such a way that the first range of code points can be encoded with just 1 code unit (1 byte), and those code units map directly to the ASCII encoding. This means that UTF-8 encoded text consisting of only ASCII characters is practically the same as ASCII encoded text; a great idea allowing reuse of decades of existing code made to process ASCII. But remember, even if they're just 1 byte, they're still code points in terms of Unicode.

Code Points in Zig

All Zig source code is UTF-8 encoded text, and the standard library std.unicode namespace provides some useful code point processing functions.

const unicode = @import("std").unicode;

// A UTF-8 code point can be from 1 to 4 bytes in length.
var code_point_bytes: [4]u8 = undefined;

// Returns the number of bytes written to the array.
const bytes_encoded = try unicode.utf8Encode('⚡', &code_point_bytes);

There's also a utf8Decode function and functions to convert from UTF-8 to UTF-16 and vice-versa. std.unicode also has some utilities to iterate over code points in a string, but then again: What is a string in Zig?

Strings in Zig

As @kristoff explains in

What's a String Literal in Zig?

Loris Cro ・ Aug 8 ・ 3 min read

#beginners #learn

string literals in Zig are just pointers to null-terminated arrays of bytes tucked away in the executable file of your program. The syntax

const name = "Jose";

is syntactic sugar, creating the array and a pointer to it behind the scenes, but in the end, it's just a sequence of bytes. Given that these bytes are in fact UTF-8 encoded code points, we can use Unicode processing functions to work with Zig strings. For example, let's iterate over the code points of a string:

const unicode = @import("std").unicode;

const name = "José";
var code_point_iterator = (try unicode.Utf8View.init(name)).iterator();

while (code_point_iterator.next()) |code_point| {
    std.debug.print("0x{x} is {u} \n", .{ code_point, code_point });
}

This can be very useful when dealing with functions that expect a Unicode code point to do their work. In Zig, functions that work with code points use the u21 type, since in reality only 21 bits are required to represent the entire Unicode code space (versus the 32 bit types other languages use).

Code Points Are Not Necessarily Characters

One problem that can arise when iterating strings one code point at a time is inadvertently decomposing a character that's made up of multiple code points. There has been a misconception, probably rooted in the ASCII way of doing things, that a code point is a character, and indeed that can be true but is not always the case.

const unicode = @import("std").unicode;

fn codePointInfo(str: []const u8) !void {
    std.debug.print("Code points for: {s} \n", .{str});

    var iter = (try unicode.Utf8View.init(str)).iterator();

    while (iter.nextCodepoint()) |cp| {
        std.debug.print("0x{x} is {u} \n", .{ cp, cp });
    }
}

try codePointInfo("é");
try codePointInfo("\u{65}\u{301}");

// Output:
// Code points for: é
// 0xe9 is é
// Code points for: é
// 0x65 is e
// 0x301 is

Note in he output that both string literals are displayed exactly the same as the single character "é", but one string contains a single code point and the other contains two. This second two-code point version of the character "é" is an example of a character composed of a base character, 0x65: the letter 'e', and a combining mark, 0x301: combining acute accent. Like this, there are many more multi-code point characters like Korean letters, country flags, and modified emoji that can't be properly handled in a solitary code point fashion. To overcome this, Unicode provides many algorithms that can process code point sequences and produce higher-level abstractions such as grapheme clusters, words, and sentences. More on that a bit later on.

But Bytes are Still Useful!

Interestingly, since at the lowest-level, we're still dealing with just sequences of bytes, if all we need is to match those bytes one-on-one, then we don't need the higher-level concepts. The std.mem namespace has many useful functions that can work on any type of sequence, but are particulary useful for bytes, and who says those bytes can't be the UTF-8 encoded bytes of a string?

const mem = @import("std").mem;

_ = mem.eql(u8, "⚡ Zig!", "⚡ Zig!");
_ = mem.trimLeft(u8, "⚡ Zig!", "⚡");
_ = mem.trimRight(u8, "⚡ Zig!", "!");
_ = mem.trim(u8, " ⚡ Zig! ", " ");
_ = mem.indexOf(u8, "⚡ Zig!", "Z");
_ = mem.split(u8, "⚡ Zig!", " ");
// ...and many more.

Is this All I Can Do?

To recap, there are two important takeaways from this first post in the series:

A byte can be a character, but many characters require more than 1 byte.
A code point can be a character but many characters require more than 1 code point.

We have seen some of the useful tools Zig provides to process UTF-8 encoded Unicode text. Beyond these tools, there are many more higher-level abstractions that can be worked with using third-party libraries. In the next post, we'll look at Ziglyph, the library I've been developing to deal with many of the requirements of the Unicode Standard and text processing in general.

Until then, try out some of the builtin tools presented in this post, and stay away from the ASCII Kool-Aid! 😹

Oldest comments (3)

Loris Cro • Sep 13 '21

Yay, thank you very much for this post, definitely needed and it's also giving me inspiration for one or two related posts :^)

Zooce • Sep 14 '21

Excellent! Straight to the point, no fluff, solid info. Thank you!

Max • Mar 23 '23

Great post, thank you!

Also found a typo:

- "Note in he output that both string literals are displayed"
           ^
+ "Note in the [...]"

Zig NEWS