While working on djot.zig
, I've gravitated towards a pattern I'm taking to calling the Cursor
pattern (if this pattern has a name already, let me know! Edit: Recursive Descent Parsing. I'm using Cursor
s to implement Recursive Descent Parsing).
Posts in my Thoughts on Parsing series:
- Part 1 - Parsed Representation
- Part 2 - Read Cursors
- Part 3 - Write Cursors
What is a Cursor
The simplest Cursor
is just a pointer to an index. Let's make a function that parses an integer using this pattern.
fn parseInt(text: []const u8, parent: *usize) ?u64 {
const start = parent.*;
var index = start;
// Keep going until we find the first non-digit character
while (index < text.len and '0' <= text[index] and text[index] <= '9') : (index += 1) { }
// There weren't any digits; return null
if (index - start < 1) return null;
// Parse an integer, or return null.
const integer = std.fmt.parseInt(u64, text[start..index], 10) catch return null;
// "Move" the parent cursor to the end of the integer we just parsed
parent.* = index;
return integer;
}
The key difference between this and a Reader
comes right at the end of the function where we update the parent
index. This parseInt
function only "moves" the cursor if it succeeds. If parseInt
fails we can parse the text using a different method. And if it succeeds, we can chain it with another call.
Here's an example of parsing a list of integers with this pattern:
fn parseIntList(allocator: std.mem.Allocator, text: []const u8, parent: *usize) !?[]u64 {
const start = parent.*;
var index = start;
// A list starts with `[`
_ = parseExpectCharacter(text, &index, '[') orelse return null;
var list = std.ArrayList(u64).init(allocator);
defer list.deinit();
while (true) {
try list.append(parseInt(text, &index) orelse break);
_ = parseExpectCharacter(text, &index, ',') orelse break;
}
_ = parseExpectCharacter(text, &index, ']') orelse return null;
parent.* = index;
return list.toOwnedSlice();
}
What does a Cursor
need?
A Cursor
must fulfill a couple properties:
- easy to copy: so we can try reading something without committing to it
- easy to update: to make chaining calls together easy
The word Transactional neatly sums up those properties. A Cursor
should be Transactional.
Where did this come from? Why?
While I'm parsing a djot
file, I don't always know ahead of time if I'm parsing the correct thing.
Take, for example, two overlapping formatting spans:
*This is [strong*, not a link.](https://djot.net)
Here the bold span is closed out before the link inside it is, so the text is not a link. The produced html should look like this:
<strong>This is [strong</strong>, not a link.](https://djot.net)
When we get to the [
character, we don't know if we need to produce a link, or if it should be interepted it as text. Using the Cursor
pattern means I can try parsing the link, and if it doesn't work out because an earlier span closes it out, or because a it's missing the parenthesis, we can instead parse it as text.
Conclusion
While the Cursor
makes it easy to roll back text that was just read in, it can also be applied to writing a list of events. I'll discuss this in Part 3 of the series.
Oldest comments (0)