Zig NEWS

Vladimir Popov
Vladimir Popov

Posted on

Yet another parser combinators library

Hi! I'm excited to introduce Parcom, a parser combinators library for Zig. The main feature of Parcom is streamed input parsing: there's no need to have the entire input as a string— use std.io.AnyReader to parse the input byte-by-byte!

Here's a simple example from the documentation.

Three different types of parser implementations exist:

  • The base parser implementations contain the logic for parsing input and serve as the fundamental building blocks;
  • The ParserCombinator provides methods to combine parsers and create new ones;
  • The TaggedParser erases the type of the underlying parser and simplifies the parser's type declaration.

Every parser provides the type of the parsing result as a constant ResultType: type.

Let's create a parser, which will parse and execute a simple math expression with follow grammar:

# The `number` is a sequence of unsigned integer numbers
Number := [0-9]+
# The `value` is a `number` or an `expression` in brackets
Value  := Number / '(' Expr ')'
# The `sum` is an operation of adding or substraction of two or more values
Sum    := Value (('+' / '-') Value)*
# The `expression` is result of evaluation the combination of values and operations
Expr   := evaluate(Sum)
Enter fullscreen mode Exit fullscreen mode

Our parser will be capable of parsing and evaluating mathematical expressions that include addition and subtraction operations, unsigned integers, and nested expressions within brackets.

Base parser

The number from the grammar above is a sequence of symbols from the range ['0', '9']. Parcom has a constructor of the parser of bytes in a range, but we will create our own parser starting from the base parser AnyChar. AnyChar is a simplest parser consumed the input. It returns the next byte from the input, or null if the input is empty.

const AnyChar = struct {
    // Every parser has such constant with type of parsing result
    pub const ResultType = u8;

    // The main function to parse an input. 
    // It takes the instance of the parser and the pointer 
    // to the input:
    fn parse(_: @This(), input: *Input) anyerror!?u8 {
        return try input.read();
    }
};
Enter fullscreen mode Exit fullscreen mode

To parse only numeric symbols we should provide a classifier - function that receives the result of a parser and returns true only if it is an expected value:

const parcom = @import("parcom");

// ResultType: u8
const num_char = parcom.anyChar().suchThat({}, struct {
    fn condition(_: void, ch: u8) bool {
        return switch (ch) {
            '0' ... '9' => true,
            else => false,
        };
    }
}.condition);
Enter fullscreen mode Exit fullscreen mode

Every function required in combinators in Parcom library has a context parameter. That gives more flexibility for possible implementations of that functions.

Repeat parsers

Next, we should continue applying our parser until we encounter the first non-numeric symbol or reach the end of the input. To achieve this, we need to store the parsed results. The simplest solution is to use a sentinel array:

// ResultType: [10:0]u8
const number = num_char.repeatToSentinelArray(.{ .max_count = 10 });
Enter fullscreen mode Exit fullscreen mode

But that option is available only for parsers with scalar result types. For more general cases a regular array can be used. If you know exact count of elements in the parsed sequence, you can specified it to have an array with exact length as result:

// ResultType: [3]u8
const number = num_char.repeatToArray(3);
Enter fullscreen mode Exit fullscreen mode

However, this is a rare case. More often, the exact number of elements is unknown, but the maximum number can be estimated:

// ResultType: struct { [10]u8, usize }
const number = num_char.repeatToArray(.{ .max_count = 10 });
Enter fullscreen mode Exit fullscreen mode

In such cases, the result is a tuple consisting of the array and a count of the parsed items within it.

For cases, when impossible to predict the maximum count we can allocate a slice to store the parsed results:

// ResultType: []u8
const number = num_char.repeat(allocator, .{});

// Don't forget to free the memory, allocated for the slice!
Enter fullscreen mode Exit fullscreen mode

or use an arbitrary storage and a function to add an item to it:

var list = std.ArrayList(u8).init(allocator);
defer list.deinit();
// ResultType: *std.ArrayList(u8)
const p = anyChar().repeatTo(&list, .{}, std.ArrayList(u8).append);
Enter fullscreen mode Exit fullscreen mode

Notice, that no matter which combinator you use to collected repeated numbers for our example, you have to set the .min_count to 1, because of empty collection of chars is not a number!

// ResultType: []u8
const number = num_char.repeat(allocator, .{ .min_count = 1 });
Enter fullscreen mode Exit fullscreen mode

Try one or try another

We'll postpone the value parser for now, and instead of that will focus on creating a parsers for the '+' and '-' symbols.

// ResultType: i32
const value: ParserCombinator(???) = ???; 
Enter fullscreen mode Exit fullscreen mode

First of all, we should be able to parse every symbol separately. The char parser is the best candidate for it:

const plus = parcom.char('+');
const minus = parcom.char('-');
Enter fullscreen mode Exit fullscreen mode

Next, we have to choose one of them. To accomplish this, let's combine parsers to a new one, that first attempt one, and if it fails, it will try the other:

// ResultType: parcom.Either(u8, u8)
const plus_or_minus = plus.orElse(minus);
Enter fullscreen mode Exit fullscreen mode

The result type of the new parser is parcom.Either(L, R), an alias for union(enum) { left: L, right: R } type.

Combine results

We have a parser for operations and we assume that we have a parser for values as well. This is sufficient to build the Sum parser, which, as you may recall, follows this structure:

Sum := Value (('+' / '-') Value)*
Enter fullscreen mode Exit fullscreen mode

Let's start from the part in brackets. We have to combine the plus_or_minus parser with value parser and repeat result:

// ResultType: []struct{ parcom.Either(u8, u8), i32 }
plus_or_minus.andThen(value).repeat(allocator, .{});
Enter fullscreen mode Exit fullscreen mode

The andThen combinator runs the left parser and then the right. If both parsers were successful, it returns a tuple of results. Finally, we can combine the value with the new parser to have the version of the expression parser that follows the grammar:

// ResultType: struct{ i32, []struct{ parcom.Either(u8, u8), i32 } }
const sum = value.andThen(plus_or_minus.andThen(value).repeat(allocator, .{}));
Enter fullscreen mode Exit fullscreen mode

Transform the result

So far so good. We are ready to create a parser that will not only parse the input, but also sum of parsed values:

const expr = sum.transform(i32, {}, struct {
    fn evaluate(_: void, value: struct{ i32, []struct{ Either(u8, u8), i32 } }) !i32 {
        var result: i32 = value[0];
        for (value[1]) |op_and_arg| {
            switch(op_and_arg[0]) {
                .left => result += op_and_arg[1],
                .right => result -= op_and_arg[1],
            )
        }
        return result;
    }
}.evaluate);
Enter fullscreen mode Exit fullscreen mode

The combinator transform requires a context and a function for transformation. It runs the left parser and applies the function to the parsed result.

Tagged parser

Now the time to build the value parser:

Value  := Number / '(' Expr ')'
Enter fullscreen mode Exit fullscreen mode

This is a recursive parser that not only forms part of the expression parser, but also depends on it. How we can implement this? First of all, let's wrap the expression parser to the function:

const std = @import("std");
const parcom = @import("parcom");

fn expression(allocator: std.mem.Allocator) ??? {

    // ResultType: u8
    const num_char = parcom.anyChar().suchThat({}, struct {
        fn condition(_: void, ch: u8) bool {
            return switch (ch) {
                '0' ... '9' => true,
                else => false,
            };
        }
    }.condition);

    // ResultType: i32
    const number = num_char.repeat(allocator, .{ .min_count = 1 })
       .transform(i32, {}, struct {
        fn parseInt(_: void, value: []u8) !i32 {
            return try std.fmt.parseInt(i32, value, 10);
        }
    }.parseInt);

    // ResultType: i32
    const value = ???;

    // ResultType: parcom.Either(u8, u8)
    const plus_or_minus = parcom.char('+').orElse(parcom.char('-'));

    // ResultType: struct{ i32, []struct{ parcom.Either(u8, u8), i32 } }
    const sum = value.andThen(plus_or_minus.andThen(value).repeat(allocator, .{}));

    const expr = sum.transform(i32, {}, struct {
        fn evaluate(_: void, v: struct{ i32, []struct{ parcom.Either(u8, u8), i32 } }) !i32 {
            var result: i32 = v[0];
            for (v[1]) |op_and_arg| {
                switch(op_and_arg[0]) {
                    .left => result += op_and_arg[1],
                    .right => result -= op_and_arg[1],
                }
            }
            return result;
        }
    }.evaluate);

    return expr;
}
Enter fullscreen mode Exit fullscreen mode

The type of ParserCombinator in Parcom can be very cumbersome, and it is often impractical to manually declare it as a function's type. However, Zig requires this type to allocate enough memory for the parser instance. While most parsers in Parcom are simply namespaces, this is not true for all of them. What can we do is moving our parser to heap and replace particular type by the pointer to it. This is exactly how the TaggedParser works. It has a pointer to the original parser, and a pointer to a function responsible for
parsing the input. More over, the TaggedParser has explicit ResultType:

const std = @import("std");
const parcom = @import("parcom");

fn expression(allocator: std.mem.Allocator) parcom.TaggedParser(i32) {
    ...
    return expr.taggedAllocated(allocator);
}
Enter fullscreen mode Exit fullscreen mode

Deferred parser

Let's go ahead and finally build the value parser:

const value = number.orElse(
    parcom.char('(').rightThen(expression(allocator)).leftThen(parcom.char(')')
);
Enter fullscreen mode Exit fullscreen mode

Pay attention on rightThen and leftThen combinators. Unlike the andThencombinator, these two do not produce a tuple. Instead, they ignore one value and return another. The rightThen uses only result of the right parser, and leftThen of the left parser respectively. It means, that both brackets will be parsed, but ignored in the example above.

But this is not all. Unfortunately, such implementation of the value parser will lead to infinite loop of invocations the expression function. We can solve this by invoking the function only when we need to parse an expression within brackets. The Parcom has the deferred parser for such purposes. It receives the ResultType of TaggedParser which should be returned by the function, a context that should be passed to the function and pointer to the function:

const value = number.orElse(
    parcom.char('(').rightThen(parcom.deferred(i32, allocator, expression)).leftThen(parcom.char(')'))
);
Enter fullscreen mode Exit fullscreen mode

When the tagged parsed completes its deferred work, the deinit method will be invoked, and memory will be freed. But, do not forget to invoke deinit manually, when you create the TaggedParser outside the deferred parser!

Now, we can bring everything together and test it:

test "9-(5+2) == 2" {
    var arena = std.heap.ArenaAllocator.init(std.testing.allocator);
    defer arena.deinit();
    const parser = try expression(arena.allocator());
    try std.testing.expectEqual(2, try parser.parseString("9-(5+2)"));
}
Enter fullscreen mode Exit fullscreen mode

The whole final solution and more details you can find on the github page of the Parcom project: https://github.com/dokwork/parcom

Top comments (0)