Zig NEWS

LeRoyce Pearson
LeRoyce Pearson

Posted on • Updated on

Thoughts on Parsing Part 1 - Parsed Representation

I've been working on a parser for djot, a light markup language similar to CommonMark. The parser is written in zig, so I've named it djot.zig. In this series of posts I'll share some of the thoughts I've had while writing it.

Note that djot.zig is not yet finished, and the code in this post is for example only.

Posts in my Thoughts on Parsing series:

  1. Part 1 - Parsed Representation
  2. Part 2 - Read Cursors
  3. Part 3 - Write Cursors

I'm designing djot.zig to have a small in memory representation once parsed. My design looks something like this at the moment:

const Document = struct {
  source: []const u8,

  events: []Event,

  /// Where each event starts in source
  event_source: []u32,
};

const Event = enum(u8) {
  text,
  start_paragraph,
  close_paragraph,
  start_list,
  close_list,
  start_list_item,
  close_list_item,
};
Enter fullscreen mode Exit fullscreen mode

This design is very much inspired by the Zig compiler's internals and data oriented design.

Thus the following markup would be parsed as so:

Hello world!

- a bullet
- another bullet
- more bullet
Enter fullscreen mode Exit fullscreen mode
idx event src source text
0 .start_paragraph 0 ""
1 .text 0 "Hello, world"
2 .close_paragraph 12 "\n\n"
3 .start_list 14 ""
4 .start_list_item 14 "- "
5 .text 16 "a bullet\n"
6 .close_list_item 25 ""
7 .start_list_item 25 "- "
8 .text 27 "another bullet\n"
9 .close_list_item 42 ""
10 .start_list_item 42 "- "
11 .text 44 "more bullet"
12 .close_list_item 54 ""

Which is 13 bytes for the events list, and 52 bytes for the event_source list, and 54 bytes for source itself.

We can then turn this abstract representation into html by looping over the list of events:

pub fn toHtml(writer: anytype, doc: Document) !void {
  for (doc.events) |event, i| {
    switch (event) {
      .text => try writer.writeAll(doc.text(i)),
      .start_paragraph => try writer.writeAll("<p>"),
      .close_paragraph => try writer.writeAll("</p>"),
      .start_list => try writer.writeAll("<ul>"),
      .close_list => try writer.writeAll("</ul>"),
      .start_list_item => try writer.writeAll("<li>"),
      .close_list_item => try writer.writeAll("</li>"),
    }
  }
}
Enter fullscreen mode Exit fullscreen mode

In part 2, I'll describe a pattern I've been using while parsing, which I am calling the Cursor pattern.

Oldest comments (0)