How to use hash map contexts to save memory when doing a string table

#optimization #std #learn

Standard library hash maps have a flexible API that can be used for some really nice patterns. For this example, consider the use case of building a string table. A string table is a way to efficiently store many references to the same set of strings via an index into a linear array of bytes. In our example, each string will be null-terminated.

What data structure to use for this?

const Foo = struct {
    string_bytes: std.ArrayList(u8),
    string_table: std.StringHashMap(u32),
};

The first idea we might try is this: the string bytes are in one array, and a hash table maps strings to the index at which they can be found.

This works just fine, but, we can do better! Each key of string_table requires 16 bytes (on 64-bit platforms) for each key. This can really add up with a large number of strings. Here we will use the adapted context APIs in order to store only indexes in the table, and yet still be able to look up strings by slice in the table.

Here is a complete example that you copy-paste and play with:

const std = @import("std");
const mem = std.mem;

const Foo = struct {
    string_bytes: std.ArrayListUnmanaged(u8),
    /// Key is string_bytes index.
    string_table: std.HashMapUnmanaged(u32, void, IndexContext, std.hash_map.default_max_load_percentage),
};

const IndexContext = struct {
    string_bytes: *std.ArrayListUnmanaged(u8),

    pub fn eql(self: IndexContext, a: u32, b: u32) bool {
        _ = self;
        return a == b;
    }

    pub fn hash(self: IndexContext, x: u32) u64 {
        const x_slice = mem.spanZ(@ptrCast([*:0]const u8, self.string_bytes.items.ptr) + x);
        return std.hash_map.hashString(x_slice);
    }
};

const SliceAdapter = struct {
    string_bytes: *std.ArrayListUnmanaged(u8),

    pub fn eql(self: SliceAdapter, a_slice: []const u8, b: u32) bool {
        const b_slice = mem.spanZ(@ptrCast([*:0]const u8, self.string_bytes.items.ptr) + b);
        return mem.eql(u8, a_slice, b_slice);
    }

    pub fn hash(self: SliceAdapter, adapted_key: []const u8) u64 {
        _ = self;
        return std.hash_map.hashString(adapted_key);
    }
};

test "hash contexts" {
    const gpa = std.testing.allocator;

    var foo: Foo = .{
        .string_bytes = .{},
        .string_table = .{},
    };
    defer foo.string_bytes.deinit(gpa);
    defer foo.string_table.deinit(gpa);

    const index_context: IndexContext = .{ .string_bytes = &foo.string_bytes };

    const hello_index = @intCast(u32, foo.string_bytes.items.len);
    try foo.string_bytes.appendSlice(gpa, "hello\x00");
    try foo.string_table.putContext(gpa, hello_index, {}, index_context);

    const world_index = @intCast(u32, foo.string_bytes.items.len);
    try foo.string_bytes.appendSlice(gpa, "world\x00");
    try foo.string_table.putContext(gpa, world_index, {}, index_context);

    // now we want to check if a string exists based on a string literal
    const slice_context: SliceAdapter = .{ .string_bytes = &foo.string_bytes };
    const found_entry = foo.string_table.getEntryAdapted(@as([]const u8, "world"), slice_context).?;
    try std.testing.expectEqual(found_entry.key_ptr.*, world_index);
}

Here we provide an adapter which provides an alternate way to perform hash and equality checking. As long as the hash and eql functions are consistent across all the adapters used for a given hash map, everything works!

For a string table with 10 thousand entries, this would save 117 KB of memory, significantly reducing the CPU cache pressure, resulting in performance improvements.

Zig NEWS

How to use hash map contexts to save memory when doing a string table

Top comments (0)

Read next

build.zig.zon dependency hashes

To SIMD and beyond: Optimizing a simple comparison routine

Do not rely on the pointer to content of returned struct

A "distinct index types" journey