Zig NEWS

Cover image for Ziglyph Unicode Wrangling
dude_the_builder
dude_the_builder

Posted on • Updated on

Ziglyph Unicode Wrangling

Enter the Ziglyph

In the previous post, we covered some basic concepts in Unicode text processing, and some of the builtin tools tha Zig offers to process ASCII and UTF-8 encoded text. In this post, we'll explore Ziglyph, a library for processing Unicode text in your Zig projects. The library has a lot of components, so let's get started right away!

Note: To learn how to integrate Ziglyph into your projects, refer to the README file at the GitHub repo.

Code Point Category and Property Detection

Unicode defines categories into which code points can be grouped. Letters, Numbers, Symbols are examples of such categories. It also defines properties that code points can have, like for example if a code point is an alphabetic character. All of these characteristics can be detected with Ziglyph.

const zg = @import("ziglyph");

_ = zg.isLower('a');
_ = zg.isTitle('A');
_ = zg.isUpper('A');
_ = zg.isDigit('9');
_ = zg.isMark('\u{301}');
// ...and many more.
Enter fullscreen mode Exit fullscreen mode

These functions all work on code points, so you can pass in a u21 or character literal (comptime_int) as in the example above. All these functions can be found in the src/ziglyph.zig file's source code.

Letter Case Conversion

Not all languages of the world have the notion of letter case, but for those that do, Ziglyph has got you covered. The case conversion functions have variants for either individual code points or entire strings.

_ = zg.toLower('A');

// The string versions require an allocator since they
// allocate memory for the new converted string. Don't
// forget to free that memory later!
var allocator = std.testing.allocator;
const got = try zg.toLowerStr(allocator, "XANADÚ");
defer allocator.free(got);

const want = "xanadú";
try std.testing.expectEqualStrings(want, got);
Enter fullscreen mode Exit fullscreen mode

There are also toUpper, toTitle for code points, and toUpperStr and toTitleStr for strings too. All these functions can also be found in the src/ziglyph.zig source code linked above.

Case Folding

Unicode provides a mechanism for performing quick and reliable case-insensitive string matching using a method called Case Folding. Some writing systems have irregular case conversion rules, sometimes producing a round-trip conversion that results in a different code point than the original. To avoid complex algorithms to deal with such edge cases, case folding provides a stable conversion from any other letter case, allowing for easy comparison of the case folded strings.

var allocator = std.testing.allocator;
const input_a = "Exit, Stage Left; 1981";
const fold_a = try zg.toCaseFoldStr(allocator, input_a);
defer allocator.free(fold_a);

const input_b = "exIt, stAgE LeFt; 1981";
const fold_b = try zg.toCaseFoldStr(allocator, input_b);
defer allocator.free(fold_b);

try std.testing.expectEqualStrings(fold_a, fold_b);
Enter fullscreen mode Exit fullscreen mode

Beyond Code Points: Grapheme Clusters

Remember the example we saw in the first post regarding the character "é" and its two-code point composite version? At that point, we observed that iterating a string one code point at a time wasn't the best method when it comes to extracting these types of complex characters. Unicode provides a text segmentation algorithm that allows us to iterate a string by the elements that humans recognize as discrete characters, such as "é". These discrete elements are called Grapheme Clusters and when you want to iterate over them in a string, Ziglyph provides a GraphemeIterator to do just that.

const Grapheme = @import("ziglyph").Grapheme;
const GraphemeIterator = Grapheme.GraphemeIterator;

const input = "Jos\u{65}\u{301}"; // José
var iter = try GraphemeIterator.init(input);

const want = &[_][]const u8{
    "J",
    "o",
    "s",
    "\u{65}\u{301}",
};

var i: usize = 0;
while (iter.next()) |grapheme| : (i += 1) {
    std.testing.expect(grapheme.eql(want[i]));
}
Enter fullscreen mode Exit fullscreen mode

Note that the GraphemeIterator returns a Grapheme struct, which holds the slice of bytes that compose the grapheme cluster and a convenient eql method to compare it with a normal Zig string. Each grapheme cluster in fact is a string, not a byte nor a code point, given that grapheme clusters can contain many of such components. The last element of the want array of strings is thus the combined "\u{65}\u{301}" (two code points) that produce the single displayable character "é".

The character "é" is multi-code point, but is still rather simple when compared to the grapheme clusters that can be created abiding by the rules of Unicode. Recent additions to these rules include modifiers to existing emoji characters (i.e. skin tone) which are compositions of many code points joined with special joiner code points. Ziglyph's GraphemeIterator will properly handle any such complex grapheme clusters for you.

Words are Easy, Right?

Actually, no. There's another Unicode algorithm for text segmentation at what most humans would perceive as word boundaries. It may seem a trivial task, but when you consider that there are language writing systems that don't use whitespace at all!; you begin to grasp the complexity at hand. But don't worry, Ziglyph has an iterator for that too.

const Word = @import("ziglyph").Word;
const WordIterator = Word.WordIterator;

const input = "The (quick) fox. Fast! ";
var iter = try WordIterator.init(input);

const want = &[_][]const u8{
    "The",
    " ",
    "(",
    "quick",
    ")",
    " ",
    "fox",
    ".",
    " ",
    "Fast",
    "!",
    " ",
};

var i: usize = 0;
while (iter.next()) |word| : (i += 1) {
    try std.testing.expect(word.eql(want[i]));
}
Enter fullscreen mode Exit fullscreen mode

Note how spaces and punctuation are split up and included in the iteration, which can be surprising when one is accustomed to splitting text into "words" using whitespace as a delimiter. But then again, in that approach you would get (quick), fox., and Fast! as words, which include punctuation and thus are technically wrong. As always in software development, tradeoffs and more tradeoffs! Analyze your requirements well and you'll know which approach fits best.

As a result of having this type of text segmentation by word boundaries, converting a string to title case, where the first letter of each word is uppercase and the rest are lowercase, is made possible.

var allocator = std.testing.allocator;
const input = "tHe (anALog) 'kiD'";
const got = try zg.toTitleStr(allocator, input);
defer allocator.free(got);

const want = "The (Analog) 'Kid'";
try std.testing.expectEqualStrings(want, got);
Enter fullscreen mode Exit fullscreen mode

Sentences Can't Be that Hard, Right?

Just split at punctuation, right? How hard can it be? Well, what do we do with "The U.S.A. has 50 states."? And what about "'Dr. Smith, how are you?' Alex said candidly." ? These are just a couple of examples in English, imagine the rest of the languages and writing systems of the world! Once again, there's an iterator for that, but wait, this is Zig, let's turn it up a notch and get our sentences at compile time.

const Sentence = @import("ziglyph").Sentence;
const ComptimeSentenceIterator = Sentence.ComptimeSentenceIterator;

// You may need to adjust this depending on your input.
@setEvalBranchQuota(2_000);

const input =
    \\("Zig.") ("He said.")
;

// Note the space after the closing right parenthesis 
// is included as part of the first sentence.
const s1 =
    \\("Zig.") 
;
const s2 =
    \\("He said.")
;
const want = &[_][]const u8{ s1, s2 };

comptime var iter = ComptimeSentenceIterator(input){};
var sentences: [iter.count()]Sentence = undefined;

comptime {
    var i: usize = 0;
    while (iter.next()) |sentence| : (i += 1) {
        sentences[i] = sentence;
    }
}

for (sentences) |sentence, i| {
    try std.testing.expect(sentence.eql(want[i]));
}
Enter fullscreen mode Exit fullscreen mode

Of course, SentenceIterator is also available, I just wanted to show-off some of Zig's comptime muscle. In the cases of GraphemeIterator and WordIterator, there's no separate Comptime... versions, given that they don't require allocation and you can just wrap them in a comptime block to have the same effect. All of these text segmentation tools can be found in the src/segmenter subdirectory of the GitHub repo.

Enough Iterations for Now

OK, if your head is spinning with so much iteration, you're not alone! Let's call it a day for this post, but in the next one we'll delve into the world of Unicode Normalization (for string comparisons), Collation (for sorting strings), and Display Width calculations. It's going to be interesting, I promise. See you then!

Top comments (8)

Collapse
 
kristoff profile image
Loris Cro

This series is amazing, keep it up!

Collapse
 
rhamorim profile image
Roberto Amorim • Edited

Agreed. I'm learning quite a bit.

To @dude_the_builder now: I noticed you have "autogen" files in the project, which are generated from Unicode specs by some code... which I couldn't find in the project. How did you generate those files, if I may ask?

Collapse
 
dude_the_builder profile image
dude_the_builder

I originally had the autogen code within the Ziglyph repo, but removed it to keep the repo's size small. I was planning on making a separate repo for that, but haven't done it yet. The autogen is done with a mix of Zig ocde, shell, sed, grep, and awk scripts. It downloads the Unicode Character Database files from the Unicode website and a few other files for tests and the Collation algorithm. The plan is to do it all with Zig, but for now the shell, sed, and awk scripts make it really easy to slice and dice the data.

Thread Thread
 
rhamorim profile image
Roberto Amorim

Sounds good to me. Yeah, having it done in Zig (and generating the files as part of the build process for Ziglyph) would be something really interesting to do, but it feels like a lot of work compared to awk/sed/shell. Thanks for the answer, I appreciate it! Also thanks for this library, it's pretty awesome.

Collapse
 
dude_the_builder profile image
dude_the_builder

Thanks @kristoff ! It's interesting how writing about your own code and what it does helps you understand it better, and even find some bugs or missing pieces along the way. lol

Collapse
 
gonzus profile image
Gonzalo Diethelm

Love your description of these (surprisingly?) subtle issues related to Unicode. Thanks!

Minor typo: s/REEADME file/README file/g.

Collapse
 
pyrmont profile image
Michael Camilleri

Great article!

One minor typo: the name change of src/Ziglyph.zig to src/ziglyph.zig broke the link at the end of the 'Code Point Category and Property Detection' section.

Collapse
 
dude_the_builder profile image
dude_the_builder

Thanks for the feedback and the typo catch!