Zig NEWS

LeRoyce Pearson

Posted on • Updated on

Thoughts on Parsing Part 2 - Read Cursors

While working on `djot.zig`, I've gravitated towards a pattern I'm taking to calling the `Cursor` pattern (if this pattern has a name already, let me know! Edit: Recursive Descent Parsing. I'm using `Cursor`s to implement Recursive Descent Parsing).

Posts in my Thoughts on Parsing series:

1. Part 1 - Parsed Representation
2. Part 2 - Read Cursors
3. Part 3 - Write Cursors

What is a Cursor

The simplest `Cursor` is just a pointer to an index. Let's make a function that parses an integer using this pattern.

``````fn parseInt(text: []const u8, parent: *usize) ?u64 {
const start = parent.*;
var index = start;

// Keep going until we find the first non-digit character
while (index < text.len and '0' <= text[index] and text[index] <= '9') : (index += 1) { }

// There weren't any digits; return null
if (index - start < 1) return null;

// Parse an integer, or return null.
const integer = std.fmt.parseInt(u64, text[start..index], 10) catch return null;

// "Move" the parent cursor to the end of the integer we just parsed
parent.* = index;
return integer;
}
``````

The key difference between this and a `Reader` comes right at the end of the function where we update the `parent` index. This `parseInt` function only "moves" the cursor if it succeeds. If `parseInt` fails we can parse the text using a different method. And if it succeeds, we can chain it with another call.

Here's an example of parsing a list of integers with this pattern:

``````fn parseIntList(allocator: std.mem.Allocator, text: []const u8, parent: *usize) !?[]u64 {
const start = parent.*;
var index = start;

// A list starts with `[`
_ = parseExpectCharacter(text, &index, '[') orelse return null;

var list = std.ArrayList(u64).init(allocator);
defer list.deinit();
while (true) {
try list.append(parseInt(text, &index) orelse break);
_ = parseExpectCharacter(text, &index, ',') orelse break;
}

_ = parseExpectCharacter(text, &index, ']') orelse return null;
parent.* = index;
return list.toOwnedSlice();
}
``````

What does a `Cursor` need?

A `Cursor` must fulfill a couple properties:

1. easy to copy: so we can try reading something without committing to it
2. easy to update: to make chaining calls together easy

The word Transactional neatly sums up those properties. A `Cursor` should be Transactional.

Where did this come from? Why?

While I'm parsing a `djot` file, I don't always know ahead of time if I'm parsing the correct thing.

Take, for example, two overlapping formatting spans:

``````*This is [strong*, not a link.](https://djot.net)
``````

Here the bold span is closed out before the link inside it is, so the text is not a link. The produced html should look like this:

``````<strong>This is [strong</strong>, not a link.](https://djot.net)
``````

When we get to the `[` character, we don't know if we need to produce a link, or if it should be interepted it as text. Using the `Cursor` pattern means I can try parsing the link, and if it doesn't work out because an earlier span closes it out, or because a it's missing the parenthesis, we can instead parse it as text.

Conclusion

While the `Cursor` makes it easy to roll back text that was just read in, it can also be applied to writing a list of events. I'll discuss this in Part 3 of the series.