dude_the_builder

Posted on Sep 16, 2021 • Updated on Sep 20, 2021

Unicode String Operations

#unicode #string #normalization #collation

Normalization

Let's bring back our old friend from previous posts, the character "é". As we've seen, this character has two possible code point compositions:

The single code point 0xE9, and
The code points 0x65 + 0x301 ("e" plus the acute accent).

Why have different code point combinations to produce the same character? Isn't this just needless complexity, wasting memory and cPU cycles? Well, the reason behind this is to facilitate conversions to and from pre-existing encodings. Due to coding space limitations, many previous encodings usually encoded base characters with modifiers and combining marks (such as the acute accent in "é") as combinations of codes, and if you're converting such an encoding to Unicode, having a direct one-to-one mapping of each of these codes into code points is faster and easier to implement (think: avoiding lookahead buffers).

But now we have a problem. Let's say we have these two strings:

const str_1 = "Jos\u{E9}";
const str_2 = "Jos\u{65}\u{301}";

try std.testing.expectEqualStrings(str_1, str_2); // BOOM!

The test doesn't pass because we're comparing the bytes that compose each string, and obviously they're not the same. To drive the significance of this problem even further, let's look at the actual error produced:

====== expected this output: =========
José␃

======== instead found this: =========
José␃

======================================
First difference occurs on line 1:
expected:
José
   ^
found:
José
   ^

WAT 🦆? Here lies the real problem, these two strings look exactly the same when rendered visually as glyphs. As programmers, we know the bytes and the code points are different, but normal humans (😹) perceive these strings as equal, and so they should be treated as equivalent.

Unicode Character Equivalence

To deal with this problem, Unicode defines the relationships between characters and their code point compositions. The standard has two types of relationship, Canonical and Compatibility. We won't go into the gory details here, but if you're interested in learning more UAX #15 has all the info you'll ever need.

For the purpose of most day-to-day string comparison needs, all we need to know is that Unicode has precise rules regarding how strings can be compared correctly, taking into account the different composition and decomposition forms that can be present. These rules and accompanying algorithms fall under the umbrella topic called Normalization, and Ziglyph has a Normalizer struct that handles all the details for you to let you compare strings easily.

const Normalizer = @import("ziglhph").Normalizer;
const testing = @import("std").testing;

var allocator = std.testing.allocator;

var norm = try Normalizer.init(allocator);
defer norm.deinit();

const str_1 = "Jos\u{E9}";
const str_2 = "Jos\u{65}\u{301}";
const str_3 = "JOSÉ";

// Normalized, case-sensitive comparison.
try testing.expect(try norm.eqlBy(str_1, str_2, .normalize));

// Normalized, case-insensitive comparison.
try testing.expect(try norm.eqlBy(str_2, str_3, .norm_ignore));

The Normalizer.eqlBy method takes the two strings to be compared and a comparison mode from the enum CmpMode which includes .normalize for case-sensitive normalized comparisons and .norm_ignore for the case-insensitive mode.

Ziglyph's Normalizer has support for all Unicode Normalization Forms and methods to compose and decompose code points and strings among them. Check out the src/normalizer/Normalizer.zig source code to see what's available.

Unicode Data Differential Compression

The Unicode Character Database is a collection of many text files with varying forms of internal structure. At the core of the database, is the UnicodeData.txt file, which consolidates a lot of frequently-used code point data. As part of this file, we can find the mappings between code points and their canonical and compatibility decompositions, which is essential data for the normalization algorithms.

The decompositions data alone, once extracted and stripped-down, results in a file that's still pretty big, taking up about 72KiB. When stored in an efficient, raw binary format, it slims down to 48KiB, a great improvement but still pretty big. Here's where the plot twists; @slimsag developed an amazing compression algorithm, specially tailored for this type of Unicode Data. After compression, the file size shrunk to an incredible 19KiB! Starting out from the original UnicodeData.txt at 1.8MiB and ending up at around 19KiB is amazing indeed. To learn more about @slimsag 's Unicode Data Differential Compression (UDDC), check out this blog post and the source code of the src/normalizer/DecompFile.zig file.

Sorting Strings

Given the complexities we've seen so far when it comes to compositions, decompositions, normalization forms and character equivalence; you can imagine that sorting Unicode strings isn't a trivial task either. The Unicode Collation Algorithm (UCA) defines the process by which Unicode strings can be compared to determine their order. This algorithm has to incorporate normalization as a first step towards preparing a level playing field. Once the strings are normalized, a table of weights is used to calculate the total combined weight of a string given its constituent code points. With that total weight at hand, it becomes a matter of comparing it against the total weight of the other string.

The default weights table provided by Unicode is pretty huge, with the original file coming in at about 1.9MiB. The same UDDC algorithm is applied to this data (see side note above), bringing down the size substantially to a modest 101KiB! But enough about the technical details, let's finally see how to sort strings with Ziglyph:

const Collator = @import("ziglyph").Collator;
const testing = @import("std").testing;

var allocator = std.testing.allocator;

var collator = try Collator.init(allocator);
defer collator.deinit();

// Collation weight levels overview:
//     .primary:   different characters.
//     .secondary: could be same base characters but 
//                 marks (like accents) differ.
//     .tertiary:  same base characters and marks but 
//                 case is different.
// So cab < dab at .primary, and cab < cáb at .secondary, 
// and cáb < Cáb at .tertiary level.

// These methods return true if the string arguments are
// in the specified order.
testing.expect(collator.tertiaryAsc("abc", "def"));
testing.expect(collator.tertiaryDesc("def", "abc"));

// At .primary level, "José" and "jose" are equal because 
// base characters are the same, only marks and case differ, 
// which are weighted at .secondary and .tertiary levels
// respectively. `eq` is a `std.math.Order`.
testing.expect(try collator.orderFn(
    "José",
    "jose",
    .primary,
    .eq,
));

// Full Unicode in-place sort.
var strings: [3][]const u8 = .{ "xyz", "def", "abc" };
collator.sortAsc(&strings);
testing.expectEqual(strings[0], "abc");
testing.expectEqual(strings[1], "def");
testing.expectEqual(strings[2], "xyz");

// ASCII only sort. If you know the strings are ASCII only,
// this method is faster.
strings = .{ "xyz", "def", "abc" };
collator.sortAsciiAsc(&strings);
testing.expectEqual(strings[0], "abc");
testing.expectEqual(strings[1], "def");
testing.expectEqual(strings[2], "xyz");

When comparing strings using the Unicode weights, there are three levels providing increasing granularity when performing the comparison.

.primary takes only into consideration the base characters in a string, ignoring case and combining marks, so "cáb" and "Cab" are equal at this level but not "cab" and "dab".
.secondary adds combining marks to the comparison of .primary, so "Cab" and "cab" are equal at this level but not "Cab" and "Cáb".
.tertiary adds case to the comparison so "cáb" and "cáb" are equal but not "Cáb" and "cáb". Pretty much an exact match is required here.

In addition to the usual init function, there's an initWithReader function that allows you to provide your own compressed binary weights table file (allkeys.bin). You can find instructions on how to do this, including the process for generating a new version of this file in the main README file Be sure to also check out the source code of src/collator/Collator.zig for more string sorting examples.

80 Columns, 80 Characters... Not So Fast Buddy!

To top off this post, we have the issue of character display widths. It just so happens that some characters fit nicely into a single column (or cell) when rendered as a glyph in a fixed-width font, but some other characters need more special consideration. Most control characters and combining marks have width 0. Some even have a negative width (-1) as is the case with Backspace and DEL. Yet other characters have a variable width between 1 and 2 (called half and full) depending on context. There's even the 3-em dash that has a width of 3 (⸻). Ziglyph has the display_width namespace with functions to calculate the widths of code points and strings.

const dw = @import("ziglyph").display_width;
const testing = @import("std").testing;

// The width methods take a second parameter of value 
// .half or .full to determine the width of "ambiguous"
// code points as per the Unicode standard. .half is the
// most common case.

// Note that codePointWidth returns an i3 because code 
// points like Backspace have width -1.
try testing.expectEqual(dw.codePointWidth('é', .half), 1);
try testing.expectEqual(dw.codePointWidth('😊', .half), 2);
try testing.expectEqual(dw.codePointWidth('统', .half), 2);

var allocator = std.testing.allocator;

// strWidth returns usize because it can never be negative, 
// regardless of the code points contained in the string.
try testing.expectEqual(try dw.strWidth(allocator, "Hello\r\n", .half), 5);
try testing.expectEqual(try dw.strWidth(allocator, "Héllo 🇵🇷", .half), 8);

Equipped with a means to calculate proper display width for Unicode strings, we can have some fun with them now:

// padLeft, center, padRight
const right_aligned = try dw.padLeft(allocator, "w😊w", 10, "-");
defer allocator.free(right_aligned);
try testing.expectEqualStrings("------w😊w", right_aligned);

const centered = try dw.center(allocator, "w😊w", 10, "-");
defer allocator.free(centered);
try testing.expectEqualStrings("---w😊w---", centered);

const left_aligned = try dw.padRight(allocator, "w😊w", 10, "-");
defer allocator.free(left_aligned);
try testing.expectEqualStrings("w😊w------", left_aligned);

// Line wrap.
var input = "The quick brown fox\r\njumped over the lazy dog!";
var got = try dw.wrap(
    allocator,
    input,
    10, // Desired line width
    3,  // wrap threshold
);
defer allocator.free(got);
var want = "The quick\n brown \nfox jumped\n over the\n lazy dog\n!";
try testing.expectEqualStrings(want, got);

Note that for line wrapping, the wrap function takes the desired number of columns per line and a threshold value that determines how close the last character of the last word can be to the wrapping point, before deciding to wrap at that point or not. You can play with this value to avoid orphaned punctuation or small words on a last line. This function makes use of the WordIterator we saw in the previous post to avoid wrapping within a word. An interesting future project would be hyphenation perhaps?

That's All Folks!

Well, this post has been pretty long and a bit dense, but I hope the Ziglyph functionality presented here can be useful in any project requiring Unicode text processing with Zig. Although quite functional, the library is still a work in progress, so feedback, recommendations, issues, and pull requests are welcome. This is the final post in this series, but a future post on Zigstr is planned for the future, so until then, have fun fellow Ziguanas 🦎!

Top comments (10)

Roberto Amorim • Sep 16 '21

Interesting how there's so much about Unicode I never had to worry about much... I knew those things existed, of course, but since I never had to engage with them, they were not a "conscious" problem I had to solve or be concerned with.

Of course, with a library like this, you have to think of all the cases (including the "edge" cases) and I really appreciate your hard work on this. Zig is awesome and having a solid, comprehensive Unicode library only makes it better. Thanks for building this library and for sharing your knowledge with us in this series!

As for including the bin files or not, I think @slimsag 's suggestion is great. Having a config option in comptime with a sensible default is the best approach I can think of.

discipulus • Sep 16 '21

Great work! Your work is momentous for the Zig community. Every programmer should be exposed to that first example and understand this material.

Loris Cro • Sep 16 '21

Amazing series & a good reminder that I have some janky text centering procedures in Bork that need to be streamlined :^)

dude_the_builder • Sep 16 '21

lol Thanks! Those centering, padding, and wrapping functions may still need some fine tuning so let me know if they work as intended.

Loris Cro • Sep 16 '21 • Edited

Oh would you look at that, a new version of unicode just got announced, what a joyous moment :^)

unicode.org/versions/Unicode14.0.0/

Time to re-parse the spec files?

dude_the_builder • Sep 16 '21

@kristoff would you believe that I adjusted the autogen scripts to download v14.0 last night and see what would explode... absolutely nothing. All tests are passing with the new data! Either my tests are wrong or this is indeed a joyous moment where I can proudly say that Ziglyph already supports Unicode version 14.0.0! lol I have to read that announcement very carefully to see if something needs adjusting.

discipulus • Sep 16 '21

Where does locale come into play? I fuzzily remember an article about how two different countries sort the same language differently.

dude_the_builder • Sep 16 '21

Your memory is correct, and there's a whole other dimension to many of these algorithms that Unicode calls "tailoring". In the specific case of collation, the tailoring can be done by adjusting the weights table to match a specific locale. Aside from providing a different "allkeys.bin" file, Ziglyph doesn't have much in the way of tailoring yet, but it's definitely an interesting avenue to investigate further.

dude_the_builder • Sep 16 '21

Thanks @slimsag for the feedback, recommendation, and typo catch. Honored to be featured in zigmonthly!

Zig NEWS

Unicode String Operations

Normalization

Unicode Character Equivalence

Sorting Strings

80 Columns, 80 Characters... Not So Fast Buddy!

That's All Folks!

Top comments (10)

Read next

Impl on Userland is quite easy, actually

`tmpfile.zig`, a reusable utility for using tmp files in sys tmp

Do not rely on the pointer to content of returned struct

A "distinct index types" journey