Crafting an Interpreter in Zig - part 3

#learn

This is the third post on Crafting an Interpreter in Zig, the blog series where I read the book Crafting Interpreters, implement the III part using Zig and highlight some C feature used in the book and compare it to how I did it using Zig. You can find the code here, and if you haven't read the first two parts go read them!

In this chapter, chapter 15, we implement the scanner (or lexer) of our language. In few words, the scanner reads the raw source code of our Lox program and converts it to a sequence of tokens that carry more meaning to be used by the following steps of our interpreter. You should go and read a better explanation of it in the Crafting Interpreters book itself.

I have to confess that this is not my first time going through this book. The first time I tried implementing Lox using Rust and got stumped by it, in fact got stumped implementing the scanner. But why? Well, at that time my understanding of slices, allocation and memory management was very superficial (not that now is any better) and I couldn't wrap my head around the &str and String types in Rust, at least not well enough to implement a performant scanner. The &str in Rust is a string slice, in Zig we prefer to call those by its full name instead of the nickname []const u8. But in this chapter I feel that the concept of slices finally clicked to me. We'll see how it happened.

To implement a scanner we need to read the whole contents of a file into the memory of our program, of course we are not sure how big or small this file will be, as far as we know it could be a "simple" hello world, or a simulation of the universe, so we need to dynamically allocate the memory to store the contents of this file, for that we need an allocator.

Allocation in Zig is very explicit if something is allocated you will see it using an allocator. Different from C, Zig does not have a global allocator if a function allocates something it has to take an allocator as a parameter (You could define a global allocator in Zig if you really want to). For someone like me, used to high level languages with automatic memory management, this looks weird and some times feels like a friction point, but I have to say that I really appreciate, explicit allocation and not having a global allocator let's you reason about your program in different ways and at a different level of detail. In part thanks to this I finally feel that I understand slices.

Let's compare how we read the contents of a file in C and Zig.

static char* readFile(const char* path) {
  FILE* file = fopen(path, "rb");

  fseek(file, 0L, SEEK_END);
  size_t fileSize = ftell(file);
  rewind(file);

  char* buffer = (char*)malloc(fileSize + 1);
  size_t bytesRead = fread(buffer, sizeof(char), fileSize, file);
  buffer[bytesRead] = '\0';

  fclose(file);
  return buffer;
}

Yo can see the malloc call in line 6. The function does not tell us that it allocates other than by reading what it is doing. Compare that to the Zig version.

fn readFile(path: []const u8, allocator: *Allocator) []const u8 {
    const file = std.fs.cwd().openFile(
        path,
        .{ .read = true },
    ) catch |err| {
        std.log.err("Could not open file \"{s}\", error: {any}.\n", .{ path, err });
        process.exit(74);
    };
    defer file.close();

    return file.readToEndAlloc(allocator, 100_000_000) catch |err| {
        std.log.err("Could not read file \"{s}\", error: {any}.\n", .{ path, err });
        process.exit(74);
    };
}

This function takes an allocator as an argument, this tells the caller of the function to be prepared for an allocation and all the things it implies, like performance considerations, error handling, and probably having to free that memory, the latest would be better communicated using some form of documentation. In this specific case, the caller of this function is responsible to free the allocated buffer.

fn runFile(fileName: []const u8, vm: *Vm, allocator: *Allocator) !void {
    const source = readFile(fileName, allocator);
    defer allocator.free(source);

    try vm.interpret(source);
}

readFile returns the source code of our Lox program in the form of a []const u8 (slice of bytes) that we will use in the rest of our interpreter. We need to free this memory once we don't need it anymore for that we use the same allocator we passed to the readFile function. The defer keyword let us keep our allocations and frees close together and still executing the free at the end of the function.

So this is one way we can interpret a slice, a chunk of memory of our program. But this does not tell the whole story, as I said we want to convert this raw source code into Tokens that have more meaning for the next steps, this Tokens are represented by this struct

const Token = struct {
    ty: TokenType,
    lexeme: []const u8,
    line: u64,
};

It has a type, which is an enum with values like LEFT_PAREN, RIGHT_PAREN, LEFT_BRACE, RIGHT_BRACE, NUMBER, STRING, etc. A line which is the line in the source code where we find the token, and a lexeme which is the string representing this token in the case of LEFT_PAREN it will be (, for STRING it will be something like "Hello World". We could store this string by allocating memory and storing the string there, but that wouldn't be very efficient. We are already storing all of this strings in the slice that holds the source code of our Lox program, we could just refer to that memory, if only the language we are using supported something like that, well it does, and it's called a slice! So a slice is not only a chunk of memory of our program but also references to pieces of this chunks.

Let's see it in code to make it more clear, this is how we create this tokens.

fn makeToken(self: *Self, tokenType: TokenType) Token {
    return Token{
        .ty = tokenType,
        .lexeme = self.source[self.start..self.current],
        .line = self.line,
    };
}

Where self is the instance of the Scanner that holds the source code. For the lexeme we take the source and say we are just interested in it from start to current which are just numbers that we advance when we find new Tokens.

Zig provides powerful ways to manually manage the memory of our program. It provides us with slices which are more convenient and easier to reason about than raw pointers. Also explicit allocation and not having a global allocator are features that can improve the quality and performance characteristics of our program. I hope this post helped you understand slices a bit better in case you where confused by them as I was or at least I hope it is not more confusing than it was before. I am leaving a lot of details out of this post and I recommend that you go and read about slices in the official Zig documentation.

Hope you liked this post and see you in the next one.

Coverphoto by Graphy Co on Unsplash

Zig NEWS

Crafting an Interpreter in Zig - part 3

Top comments (0)

Read next

Do not rely on the pointer to content of returned struct

A "distinct index types" journey

Building Zig libraries with C dependencies

High Performance HashMap + Disk-based Allocator