Felix "xq" Queißner

Posted on Aug 22, 2022

Cool Zig Patterns - Gotta alloc fast

#allocator #pooling #patterns #performance

So everyone knows that the GeneralPurposeAllocator is slow as heck, as it's primarily designed to be a debug allocator.

There's often the recommendation to use an ArenaAllocator for fast, short-lived allocations. But can we actually beat that?

Yes, we can!

Well, under one constraint: We need to allocate only a single kind of objects, so implement a memory pool.

This often sounds intimidating, but we can easily implement something that outperforms an ArenaAllocator in basically all cases.

The interface

First, let's define the interface of our pool and implement it badly:

pub const Pool = struct { // 1: raw
  allocator: std.mem.Allocator,

  pub fn init(allocator: std.mem.Allocator) @This() {
    return .{ .allocator = allocator };
  }

  pub fn new(self: @This()) !*Object {
    return try self.allocator.create(Object);
  }

  pub fn delete(self: @This(), obj: *Object) void {
    self.allocator.destroy(obj);
  }

  pub fn deinit(self: *@This()) void {
    _ = self;
  }
};

So the interface is ptr = try pool.new() and pool.delete(ptr). As you can easily see, no pooling is performed here.

Going faster

Let's improve that a bit:

const Arena = struct { // 2: arena
  arena: std.heap.ArenaAllocator,

  pub fn init(allocator: std.mem.Allocator) @This() {
    return .{ .arena = std.heap.ArenaAllocator.init(allocator) };
  }

  pub fn deinit(self: *@This()) void {
    self.arena.deinit();
  }

  pub fn new(self: *@This()) !*Object {
    return try self.arena.allocator().create(Object);
  }

  pub fn delete(self: *@This(), obj: *Object) void {
    // this is basically a no-op
    self.arena.allocator().destroy(obj);
  }
};

Well, this seems to be faster (see benchmarks below), but now we basically leak memory. This is not suitable for long-running applications, but at least allocations are fast.

Using an arena is a pretty good base in general as you get cache locality of allocated objects, and for smol objects, we can go pretty fast with that.

Going 🚀

Let us utilize the cache locality of objects even more! The problem with the previous implementation is that we don't reuse the freed objects and just leak them into the nirvana.
Let's just cache them and use the cache instead of a new allocation!

const Pool = struct { // 3: pool
    const List = std.TailQueue(Object);

    arena: std.heap.ArenaAllocator,
    free: List = .{},

    pub fn init(allocator: std.mem.Allocator) @This() {
        return .{ .arena = std.heap.ArenaAllocator.init(allocator) };
    }

    pub fn deinit(self: *@This()) void {
        self.arena.deinit();
    }

    pub fn new(self: *@This()) !*Object {
        const obj = if (self.free.popFirst()) |item|
            item
        else
            try self.arena.allocator().create(List.Node);
        return &obj.data;
    }

    pub fn delete(self: *@This(), obj: *Object) void {
        const node = @fieldParentPtr(List.Node, "data", obj);
        self.free.append(node);
    }
};

Now we just append each freed object into a list of elements, and when we allocate, we check that list first. Easy to implement, easy to use. Now the code seems fast, but doesn't leak anymore. Nice!

Benchmarking

But are we really better? Let's benchmark the three implementations!

Implementation	Lines Of Code	Small	Medium	Big
1: raw	19	3.25us	8.70us	224.03us
2: arena	20	0.48us	1.92us	169.42us
3: pool	27	0.43us	0.42us	2.60us

I benchmarked three categories of objects:

Small objects are exactly one byte large. There were 1 248 859 allocations, 1 248 731 frees and a peak of 154 living objects at the same time. A total of 2 500 000 rounds were performed.
Medium objects have a size of 8k, with 499 709 allocs, 499 587 frees and a peak of 154 living objects. A total of 1 000 000 rounds were performed.
Big objects are 1 MiB large, and have 12 589 allocs, 12 464 frees and a peak of 146 living objects. A total of 25 000 rounds were performed.

The benchmarking algorithm was roughly that:

open = []
for rounds:
  fill_rate = open.len / 256
  if random() <= fill_rate:
    pool.delete(removeRandomItem(open))
  if random() >= fill_rate:
    open.append(pool.new())

The random generator uses a fixed seed, so we have reproducible benchmarks. This pattern should still generate pretty unregular patterns of alloc and free operations, thus having no benefit of allocating/freeing the same object all the time. Also the frequency of allocs and frees changes by the amount of items allocated, so there should be some back-and-forward bounces.

The full code can be found here, try to reproduce the results!

Conclusion

So with 8 lines of code more, we got an allocation performance boost of "a good portion".

For tiny allocations, we perform in the same category as an ArenaAllocator, but can run forever. For medium-sized allocations, we start beating the ArenaAllocator by a factor of 4, the GPA even by a factor of 20! But when it comes to huge objects, we outperform both the ArenaAllocator (65* faster) and the GPA (86*faster) but a huge amount. That's two orders of magnitude!

Now go, and write faster software!

Oldest comments (7)

LeRoyce Pearson • Aug 22 '22

I wonder how an ArrayList of the items would perform against the LinkedList of results? Or, I guess it would have to be a segmented list. Maybe I'll try that out

LeRoyce Pearson • Aug 22 '22

Update: It's not faster. Faster than the Arena and GPA for big items, but slower than the linked list. It might result in better runtime performance, but I don't have a benchmark for that

Felix "xq" Queißner • Aug 22 '22

Yeah that would've been my guess. The problem with array list is that it's not pointer-stable per se. How did you implement freeing in the segmented list?

I will also create a much improved version of this pool and commit it to the stdlib, so people can easily use this pattern.

LeRoyce Pearson • Aug 22 '22

I used a separate list which I just ignored any allocation errors with. It wasn't great :P.

ikrima • Sep 8 '22

if you have fixed upper bound + 64bit address space, another common approach is to vmem reserve the address space upfront + commit/decommit on-demand

the freelist can be made more cache friendly with array instead of linked list
depending on frequency of alloc vs. dealloc vs. size, can micro-optimize it to a bitvector

of course some subtle, confusing and non-deterministic complications (the best kind, right?):

(on windows at least) even if you reserve + commit upfront, the OS always page maps on-demand on page first touch. A naive implementation may inexplicable performance stalls (e.g. depends on # of process standby page pool, # of zeroed page pool, # of demand-zero page faults generated, etc). Bruce Dawson has excellent write-ups as usual
if used liberally, you may also need to misalign or randomize your vmem base address so it's not page aligned; otherwise, you might cripple your cache. @danluu wrote a more detailed explanation
there are more nuances implementation modulo multi-threading/contention/segregating the freelist metadata into hot/cold; but at that point, you're writing a full blown allocator so you might as well just grab someone's slab allocator implementation :P

Silver • Aug 27 '22

Since you don't need the object and the list node at the same time, you can reduce memory usage and improve cache locality by using a union of std.TailQueue(void).Node and Object, instead of a std.TailQueue(Object).Node which always stores redundant information.

Felix "xq" Queißner • Aug 27 '22

I know, i've done this in #12586 to provide this functionality to the std lib

Zig NEWS

Cool Zig Patterns - Gotta alloc fast

The interface

Going faster

Going 🚀

Benchmarking

Conclusion

Oldest comments (7)

Read next

Mutable Global Data in Zig

TerrainZigger: A Beginner's 3D Terrain Generator in Zig

DVUI - Immediate Zig GUI for Apps and Games

Zig Hack: TypedGet