Zig NEWS

Cover image for Cool Zig Patterns - Gotta alloc fast
Felix "xq" Queißner
Felix "xq" Queißner

Posted on

Cool Zig Patterns - Gotta alloc fast

So everyone knows that the GeneralPurposeAllocator is slow as heck, as it's primarily designed to be a debug allocator.

There's often the recommendation to use an ArenaAllocator for fast, short-lived allocations. But can we actually beat that?

Yes, we can!

Well, under one constraint: We need to allocate only a single kind of objects, so implement a memory pool.

This often sounds intimidating, but we can easily implement something that outperforms an ArenaAllocator in basically all cases.

The interface

First, let's define the interface of our pool and implement it badly:

pub const Pool = struct { // 1: raw
  allocator: std.mem.Allocator,

  pub fn init(allocator: std.mem.Allocator) @This() {
    return .{ .allocator = allocator };
  }

  pub fn new(self: @This()) !*Object {
    return try self.allocator.create(Object);
  }

  pub fn delete(self: @This(), obj: *Object) void {
    self.allocator.destroy(obj);
  }

  pub fn deinit(self: *@This()) void {
    _ = self;
  }
};
Enter fullscreen mode Exit fullscreen mode

So the interface is ptr = try pool.new() and pool.delete(ptr). As you can easily see, no pooling is performed here.

Going faster

Let's improve that a bit:

const Arena = struct { // 2: arena
  arena: std.heap.ArenaAllocator,

  pub fn init(allocator: std.mem.Allocator) @This() {
    return .{ .arena = std.heap.ArenaAllocator.init(allocator) };
  }

  pub fn deinit(self: *@This()) void {
    self.arena.deinit();
  }

  pub fn new(self: *@This()) !*Object {
    return try self.arena.allocator().create(Object);
  }

  pub fn delete(self: *@This(), obj: *Object) void {
    // this is basically a no-op
    self.arena.allocator().destroy(obj);
  }
};
Enter fullscreen mode Exit fullscreen mode

Well, this seems to be faster (see benchmarks below), but now we basically leak memory. This is not suitable for long-running applications, but at least allocations are fast.

Using an arena is a pretty good base in general as you get cache locality of allocated objects, and for smol objects, we can go pretty fast with that.

Going 🚀

Let us utilize the cache locality of objects even more! The problem with the previous implementation is that we don't reuse the freed objects and just leak them into the nirvana.
Let's just cache them and use the cache instead of a new allocation!

const Pool = struct { // 3: pool
    const List = std.TailQueue(Object);

    arena: std.heap.ArenaAllocator,
    free: List = .{},

    pub fn init(allocator: std.mem.Allocator) @This() {
        return .{ .arena = std.heap.ArenaAllocator.init(allocator) };
    }

    pub fn deinit(self: *@This()) void {
        self.arena.deinit();
    }

    pub fn new(self: *@This()) !*Object {
        const obj = if (self.free.popFirst()) |item|
            item
        else
            try self.arena.allocator().create(List.Node);
        return &obj.data;
    }

    pub fn delete(self: *@This(), obj: *Object) void {
        const node = @fieldParentPtr(List.Node, "data", obj);
        self.free.append(node);
    }
};
Enter fullscreen mode Exit fullscreen mode

Now we just append each freed object into a list of elements, and when we allocate, we check that list first. Easy to implement, easy to use. Now the code seems fast, but doesn't leak anymore. Nice!

Benchmarking

But are we really better? Let's benchmark the three implementations!

Implementation Lines Of Code Small Medium Big
1: raw 19 3.25us 8.70us 224.03us
2: arena 20 0.48us 1.92us 169.42us
3: pool 27 0.43us 0.42us 2.60us

I benchmarked three categories of objects:

  • Small objects are exactly one byte large. There were 1 248 859 allocations, 1 248 731 frees and a peak of 154 living objects at the same time. A total of 2 500 000 rounds were performed.
  • Medium objects have a size of 8k, with 499 709 allocs, 499 587 frees and a peak of 154 living objects. A total of 1 000 000 rounds were performed.
  • Big objects are 1 MiB large, and have 12 589 allocs, 12 464 frees and a peak of 146 living objects. A total of 25 000 rounds were performed.

The benchmarking algorithm was roughly that:

open = []
for rounds:
  fill_rate = open.len / 256
  if random() <= fill_rate:
    pool.delete(removeRandomItem(open))
  if random() >= fill_rate:
    open.append(pool.new())
Enter fullscreen mode Exit fullscreen mode

The random generator uses a fixed seed, so we have reproducible benchmarks. This pattern should still generate pretty unregular patterns of alloc and free operations, thus having no benefit of allocating/freeing the same object all the time. Also the frequency of allocs and frees changes by the amount of items allocated, so there should be some back-and-forward bounces.

The full code can be found here, try to reproduce the results!

Conclusion

So with 8 lines of code more, we got an allocation performance boost of "a good portion".

For tiny allocations, we perform in the same category as an ArenaAllocator, but can run forever. For medium-sized allocations, we start beating the ArenaAllocator by a factor of 4, the GPA even by a factor of 20! But when it comes to huge objects, we outperform both the ArenaAllocator (65* faster) and the GPA (86*faster) but a huge amount. That's two orders of magnitude!

Now go, and write faster software!

Top comments (7)

Collapse
 
silversquirl profile image
Silver

Since you don't need the object and the list node at the same time, you can reduce memory usage and improve cache locality by using a union of std.TailQueue(void).Node and Object, instead of a std.TailQueue(Object).Node which always stores redundant information.

Collapse
 
xq profile image
Felix "xq" Queißner

I know, i've done this in #12586 to provide this functionality to the std lib

Collapse
 
leroycep profile image
LeRoyce Pearson

I wonder how an ArrayList of the items would perform against the LinkedList of results? Or, I guess it would have to be a segmented list. Maybe I'll try that out

Collapse
 
leroycep profile image
LeRoyce Pearson

Update: It's not faster. Faster than the Arena and GPA for big items, but slower than the linked list. It might result in better runtime performance, but I don't have a benchmark for that

Collapse
 
xq profile image
Felix "xq" Queißner

Yeah that would've been my guess. The problem with array list is that it's not pointer-stable per se. How did you implement freeing in the segmented list?

I will also create a much improved version of this pool and commit it to the stdlib, so people can easily use this pattern.

Thread Thread
 
leroycep profile image
LeRoyce Pearson

I used a separate list which I just ignored any allocation errors with. It wasn't great :P.

Thread Thread
 
ikrima profile image
ikrima

if you have fixed upper bound + 64bit address space, another common approach is to vmem reserve the address space upfront + commit/decommit on-demand

  • the freelist can be made more cache friendly with array instead of linked list
  • depending on frequency of alloc vs. dealloc vs. size, can micro-optimize it to a bitvector

of course some subtle, confusing and non-deterministic complications (the best kind, right?):

  • (on windows at least) even if you reserve + commit upfront, the OS always page maps on-demand on page first touch. A naive implementation may inexplicable performance stalls (e.g. depends on # of process standby page pool, # of zeroed page pool, # of demand-zero page faults generated, etc). Bruce Dawson has excellent write-ups as usual
  • if used liberally, you may also need to misalign or randomize your vmem base address so it's not page aligned; otherwise, you might cripple your cache. @danluu wrote a more detailed explanation
  • there are more nuances implementation modulo multi-threading/contention/segregating the freelist metadata into hot/cold; but at that point, you're writing a full blown allocator so you might as well just grab someone's slab allocator implementation :P