Thoughts on Parsing Part 2 - Read Cursors

#parsing

While working on djot.zig, I've gravitated towards a pattern I'm taking to calling the Cursor pattern (if this pattern has a name already, let me know! Edit: Recursive Descent Parsing. I'm using Cursors to implement Recursive Descent Parsing).

Posts in my Thoughts on Parsing series:

What is a Cursor

The simplest Cursor is just a pointer to an index. Let's make a function that parses an integer using this pattern.

fn parseInt(text: []const u8, parent: *usize) ?u64 {
    const start = parent.*;
    var index = start;

    // Keep going until we find the first non-digit character
    while (index < text.len and '0' <= text[index] and text[index] <= '9') : (index += 1) { }

    // There weren't any digits; return null
    if (index - start < 1) return null;

    // Parse an integer, or return null.
    const integer = std.fmt.parseInt(u64, text[start..index], 10) catch return null;

    // "Move" the parent cursor to the end of the integer we just parsed
    parent.* = index;
    return integer;
}

The key difference between this and a Reader comes right at the end of the function where we update the parent index. This parseInt function only "moves" the cursor if it succeeds. If parseInt fails we can parse the text using a different method. And if it succeeds, we can chain it with another call.

Here's an example of parsing a list of integers with this pattern:

fn parseIntList(allocator: std.mem.Allocator, text: []const u8, parent: *usize) !?[]u64 {
    const start = parent.*;
    var index = start;

    // A list starts with `[`
    _ = parseExpectCharacter(text, &index, '[') orelse return null;

    var list = std.ArrayList(u64).init(allocator);
    defer list.deinit();
    while (true) {
        try list.append(parseInt(text, &index) orelse break);
        _ = parseExpectCharacter(text, &index, ',') orelse break;
    }

    _ = parseExpectCharacter(text, &index, ']') orelse return null;
    parent.* = index;
    return list.toOwnedSlice();
}

What does a `Cursor` need?

A Cursor must fulfill a couple properties:

easy to copy: so we can try reading something without committing to it
easy to update: to make chaining calls together easy

The word Transactional neatly sums up those properties. A Cursor should be Transactional.

Where did this come from? Why?

While I'm parsing a djot file, I don't always know ahead of time if I'm parsing the correct thing.

Take, for example, two overlapping formatting spans:

*This is [strong*, not a link.](https://djot.net)

Here the bold span is closed out before the link inside it is, so the text is not a link. The produced html should look like this:

<strong>This is [strong</strong>, not a link.](https://djot.net)

When we get to the [ character, we don't know if we need to produce a link, or if it should be interepted it as text. Using the Cursor pattern means I can try parsing the link, and if it doesn't work out because an earlier span closes it out, or because a it's missing the parenthesis, we can instead parse it as text.

Conclusion

While the Cursor makes it easy to roll back text that was just read in, it can also be applied to writing a list of events. I'll discuss this in Part 3 of the series.

Zig NEWS

Thoughts on Parsing Part 2 - Read Cursors

What is a Cursor

What does a `Cursor` need?

Where did this come from? Why?

Conclusion

Top comments (0)

Read next

Unpacking Zig's Standard Library Documentation

AIR, what is it good for?

Creating safer zig bindings for liburing

Yet another parser combinators library

What is a Cursor

What does a Cursor need?

Where did this come from? Why?

Conclusion

Read next

Unpacking Zig's Standard Library Documentation

AIR, what is it good for?

Creating safer zig bindings for liburing

Yet another parser combinators library

What does a `Cursor` need?