Characters
The concept of a character is elusively complex. Let's go step by step.
fn processAscii(byte: u8) void { _ = byte; }
fn processUnicode(code_point: u21) void { _ = code_point; }
// Output: comptime_int
debug.print("{}\n", .{@TypeOf('e')});
processAscii('e');
processUnicode('e');
Contrary to what one might expect based on other programming languages, in Zig, that 'e'
character literal has type comptime_int
and can be coerced into integer types such as u8
for ASCII-only text processing or u21
to handle Unicode code points.
Note: I'm not going down the endless rabbit hole discussion of what characters are and what they aren't. Let's stick to the Zig-related topics and we should be fine. 😼
ASCII in Zig
To work with single-byte (u8
) ASCII characters, Zig offers the std.ascii
namespace in which you can find many useful functions.
const ascii = @import("std").ascii;
_ = ascii.isAlNum('A');
_ = ascii.isAlpha('A');
_ = ascii.isDigit('3');
// ...and many more.
Unicode Code Points
Unicode assigns a unique integer code to characters, marks, symbols, emoji, etc. This code is what's known as a code point. These code points are then encoded into what are called code units using a Unicode Encoding Form. The most widely used encoding form today is UTF-8, in which each code point can be encoded into code units of 8 bits each.
UTF-8 was designed in such a way that the first range of code points can be encoded with just 1 code unit (1 byte), and those code units map directly to the ASCII encoding. This means that UTF-8 encoded text consisting of only ASCII characters is practically the same as ASCII encoded text; a great idea allowing reuse of decades of existing code made to process ASCII. But remember, even if they're just 1 byte, they're still code points in terms of Unicode.
Code Points in Zig
All Zig source code is UTF-8 encoded text, and the standard library std.unicode
namespace provides some useful code point processing functions.
const unicode = @import("std").unicode;
// A UTF-8 code point can be from 1 to 4 bytes in length.
var code_point_bytes: [4]u8 = undefined;
// Returns the number of bytes written to the array.
const bytes_encoded = try unicode.utf8Encode('⚡', &code_point_bytes);
There's also a utf8Decode
function and functions to convert from UTF-8 to UTF-16 and vice-versa. std.unicode
also has some utilities to iterate over code points in a string, but then again: What is a string in Zig?
Strings in Zig
As @kristoff explains in
string literals in Zig are just pointers to null-terminated arrays of bytes tucked away in the executable file of your program. The syntax
const name = "Jose";
is syntactic sugar, creating the array and a pointer to it behind the scenes, but in the end, it's just a sequence of bytes. Given that these bytes are in fact UTF-8 encoded code points, we can use Unicode processing functions to work with Zig strings. For example, let's iterate over the code points of a string:
const unicode = @import("std").unicode;
const name = "José";
var code_point_iterator = (try unicode.Utf8View.init(name)).iterator();
while (code_point_iterator.next()) |code_point| {
std.debug.print("0x{x} is {u} \n", .{ code_point, code_point });
}
This can be very useful when dealing with functions that expect a Unicode code point to do their work. In Zig, functions that work with code points use the u21
type, since in reality only 21 bits are required to represent the entire Unicode code space (versus the 32 bit types other languages use).
Code Points Are Not Necessarily Characters
One problem that can arise when iterating strings one code point at a time is inadvertently decomposing a character that's made up of multiple code points. There has been a misconception, probably rooted in the ASCII way of doing things, that a code point is a character, and indeed that can be true but is not always the case.
const unicode = @import("std").unicode;
fn codePointInfo(str: []const u8) !void {
std.debug.print("Code points for: {s} \n", .{str});
var iter = (try unicode.Utf8View.init(str)).iterator();
while (iter.nextCodepoint()) |cp| {
std.debug.print("0x{x} is {u} \n", .{ cp, cp });
}
}
try codePointInfo("é");
try codePointInfo("\u{65}\u{301}");
// Output:
// Code points for: é
// 0xe9 is é
// Code points for: é
// 0x65 is e
// 0x301 is
Note in he output that both string literals are displayed exactly the same as the single character "é", but one string contains a single code point and the other contains two. This second two-code point version of the character "é" is an example of a character composed of a base character, 0x65: the letter 'e', and a combining mark, 0x301: combining acute accent. Like this, there are many more multi-code point characters like Korean letters, country flags, and modified emoji that can't be properly handled in a solitary code point fashion. To overcome this, Unicode provides many algorithms that can process code point sequences and produce higher-level abstractions such as grapheme clusters, words, and sentences. More on that a bit later on.
But Bytes are Still Useful!
Interestingly, since at the lowest-level, we're still dealing with just sequences of bytes, if all we need is to match those bytes one-on-one, then we don't need the higher-level concepts. The std.mem
namespace has many useful functions that can work on any type of sequence, but are particulary useful for bytes, and who says those bytes can't be the UTF-8 encoded bytes of a string?
const mem = @import("std").mem;
_ = mem.eql(u8, "⚡ Zig!", "⚡ Zig!");
_ = mem.trimLeft(u8, "⚡ Zig!", "⚡");
_ = mem.trimRight(u8, "⚡ Zig!", "!");
_ = mem.trim(u8, " ⚡ Zig! ", " ");
_ = mem.indexOf(u8, "⚡ Zig!", "Z");
_ = mem.split(u8, "⚡ Zig!", " ");
// ...and many more.
Is this All I Can Do?
To recap, there are two important takeaways from this first post in the series:
- A byte can be a character, but many characters require more than 1 byte.
- A code point can be a character but many characters require more than 1 code point.
We have seen some of the useful tools Zig provides to process UTF-8 encoded Unicode text. Beyond these tools, there are many more higher-level abstractions that can be worked with using third-party libraries. In the next post, we'll look at Ziglyph, the library I've been developing to deal with many of the requirements of the Unicode Standard and text processing in general.
Until then, try out some of the builtin tools presented in this post, and stay away from the ASCII Kool-Aid! 😹
Oldest comments (3)
Yay, thank you very much for this post, definitely needed and it's also giving me inspiration for one or two related posts :^)
Excellent! Straight to the point, no fluff, solid info. Thank you!
Great post, thank you!
Also found a typo: