Zig NEWS

LeRoyce Pearson
LeRoyce Pearson

Posted on • Updated on

Thoughts on Parsing Part 2 - Read Cursors

While working on djot.zig, I've gravitated towards a pattern I'm taking to calling the Cursor pattern (if this pattern has a name already, let me know! Edit: Recursive Descent Parsing. I'm using Cursors to implement Recursive Descent Parsing).

Posts in my Thoughts on Parsing series:

  1. Part 1 - Parsed Representation
  2. Part 2 - Read Cursors
  3. Part 3 - Write Cursors

What is a Cursor

The simplest Cursor is just a pointer to an index. Let's make a function that parses an integer using this pattern.

fn parseInt(text: []const u8, parent: *usize) ?u64 {
    const start = parent.*;
    var index = start;

    // Keep going until we find the first non-digit character
    while (index < text.len and '0' <= text[index] and text[index] <= '9') : (index += 1) { }

    // There weren't any digits; return null
    if (index - start < 1) return null;

    // Parse an integer, or return null.
    const integer = std.fmt.parseInt(u64, text[start..index], 10) catch return null;

    // "Move" the parent cursor to the end of the integer we just parsed
    parent.* = index;
    return integer;
}
Enter fullscreen mode Exit fullscreen mode

The key difference between this and a Reader comes right at the end of the function where we update the parent index. This parseInt function only "moves" the cursor if it succeeds. If parseInt fails we can parse the text using a different method. And if it succeeds, we can chain it with another call.

Here's an example of parsing a list of integers with this pattern:

fn parseIntList(allocator: std.mem.Allocator, text: []const u8, parent: *usize) !?[]u64 {
    const start = parent.*;
    var index = start;

    // A list starts with `[`
    _ = parseExpectCharacter(text, &index, '[') orelse return null;

    var list = std.ArrayList(u64).init(allocator);
    defer list.deinit();
    while (true) {
        try list.append(parseInt(text, &index) orelse break);
        _ = parseExpectCharacter(text, &index, ',') orelse break;
    }

    _ = parseExpectCharacter(text, &index, ']') orelse return null;
    parent.* = index;
    return list.toOwnedSlice();
}
Enter fullscreen mode Exit fullscreen mode

What does a Cursor need?

A Cursor must fulfill a couple properties:

  1. easy to copy: so we can try reading something without committing to it
  2. easy to update: to make chaining calls together easy

The word Transactional neatly sums up those properties. A Cursor should be Transactional.

Where did this come from? Why?

While I'm parsing a djot file, I don't always know ahead of time if I'm parsing the correct thing.

Take, for example, two overlapping formatting spans:

*This is [strong*, not a link.](https://djot.net)
Enter fullscreen mode Exit fullscreen mode

Here the bold span is closed out before the link inside it is, so the text is not a link. The produced html should look like this:

<strong>This is [strong</strong>, not a link.](https://djot.net)
Enter fullscreen mode Exit fullscreen mode

When we get to the [ character, we don't know if we need to produce a link, or if it should be interepted it as text. Using the Cursor pattern means I can try parsing the link, and if it doesn't work out because an earlier span closes it out, or because a it's missing the parenthesis, we can instead parse it as text.

Conclusion

While the Cursor makes it easy to roll back text that was just read in, it can also be applied to writing a list of events. I'll discuss this in Part 3 of the series.

Top comments (0)