So everyone knows that the GeneralPurposeAllocator
is slow as heck, as it's primarily designed to be a debug allocator.
There's often the recommendation to use an ArenaAllocator
for fast, short-lived allocations. But can we actually beat that?
Yes, we can!
Well, under one constraint: We need to allocate only a single kind of objects, so implement a memory pool.
This often sounds intimidating, but we can easily implement something that outperforms an ArenaAllocator
in basically all cases.
The interface
First, let's define the interface of our pool and implement it badly:
pub const Pool = struct { // 1: raw
allocator: std.mem.Allocator,
pub fn init(allocator: std.mem.Allocator) @This() {
return .{ .allocator = allocator };
}
pub fn new(self: @This()) !*Object {
return try self.allocator.create(Object);
}
pub fn delete(self: @This(), obj: *Object) void {
self.allocator.destroy(obj);
}
pub fn deinit(self: *@This()) void {
_ = self;
}
};
So the interface is ptr = try pool.new()
and pool.delete(ptr)
. As you can easily see, no pooling is performed here.
Going faster
Let's improve that a bit:
const Arena = struct { // 2: arena
arena: std.heap.ArenaAllocator,
pub fn init(allocator: std.mem.Allocator) @This() {
return .{ .arena = std.heap.ArenaAllocator.init(allocator) };
}
pub fn deinit(self: *@This()) void {
self.arena.deinit();
}
pub fn new(self: *@This()) !*Object {
return try self.arena.allocator().create(Object);
}
pub fn delete(self: *@This(), obj: *Object) void {
// this is basically a no-op
self.arena.allocator().destroy(obj);
}
};
Well, this seems to be faster (see benchmarks below), but now we basically leak memory. This is not suitable for long-running applications, but at least allocations are fast.
Using an arena is a pretty good base in general as you get cache locality of allocated objects, and for smol objects, we can go pretty fast with that.
Going 🚀
Let us utilize the cache locality of objects even more! The problem with the previous implementation is that we don't reuse the freed objects and just leak them into the nirvana.
Let's just cache them and use the cache instead of a new allocation!
const Pool = struct { // 3: pool
const List = std.TailQueue(Object);
arena: std.heap.ArenaAllocator,
free: List = .{},
pub fn init(allocator: std.mem.Allocator) @This() {
return .{ .arena = std.heap.ArenaAllocator.init(allocator) };
}
pub fn deinit(self: *@This()) void {
self.arena.deinit();
}
pub fn new(self: *@This()) !*Object {
const obj = if (self.free.popFirst()) |item|
item
else
try self.arena.allocator().create(List.Node);
return &obj.data;
}
pub fn delete(self: *@This(), obj: *Object) void {
const node = @fieldParentPtr(List.Node, "data", obj);
self.free.append(node);
}
};
Now we just append each freed object into a list of elements, and when we allocate, we check that list first. Easy to implement, easy to use. Now the code seems fast, but doesn't leak anymore. Nice!
Benchmarking
But are we really better? Let's benchmark the three implementations!
Implementation | Lines Of Code | Small | Medium | Big |
---|---|---|---|---|
1: raw | 19 | 3.25us | 8.70us | 224.03us |
2: arena | 20 | 0.48us | 1.92us | 169.42us |
3: pool | 27 | 0.43us | 0.42us | 2.60us |
I benchmarked three categories of objects:
- Small objects are exactly one byte large. There were 1 248 859 allocations, 1 248 731 frees and a peak of 154 living objects at the same time. A total of 2 500 000 rounds were performed.
- Medium objects have a size of 8k, with 499 709 allocs, 499 587 frees and a peak of 154 living objects. A total of 1 000 000 rounds were performed.
- Big objects are 1 MiB large, and have 12 589 allocs, 12 464 frees and a peak of 146 living objects. A total of 25 000 rounds were performed.
The benchmarking algorithm was roughly that:
open = []
for rounds:
fill_rate = open.len / 256
if random() <= fill_rate:
pool.delete(removeRandomItem(open))
if random() >= fill_rate:
open.append(pool.new())
The random generator uses a fixed seed, so we have reproducible benchmarks. This pattern should still generate pretty unregular patterns of alloc and free operations, thus having no benefit of allocating/freeing the same object all the time. Also the frequency of allocs and frees changes by the amount of items allocated, so there should be some back-and-forward bounces.
The full code can be found here, try to reproduce the results!
Conclusion
So with 8 lines of code more, we got an allocation performance boost of "a good portion".
For tiny allocations, we perform in the same category as an ArenaAllocator
, but can run forever. For medium-sized allocations, we start beating the ArenaAllocator
by a factor of 4, the GPA even by a factor of 20! But when it comes to huge objects, we outperform both the ArenaAllocator
(65* faster) and the GPA (86*faster) but a huge amount. That's two orders of magnitude!
Now go, and write faster software!
Top comments (7)
Since you don't need the object and the list node at the same time, you can reduce memory usage and improve cache locality by using a union of
std.TailQueue(void).Node
andObject
, instead of astd.TailQueue(Object).Node
which always stores redundant information.I know, i've done this in #12586 to provide this functionality to the std lib
I wonder how an
ArrayList
of the items would perform against theLinkedList
of results? Or, I guess it would have to be a segmented list. Maybe I'll try that outUpdate: It's not faster. Faster than the Arena and GPA for big items, but slower than the linked list. It might result in better runtime performance, but I don't have a benchmark for that
Yeah that would've been my guess. The problem with array list is that it's not pointer-stable per se. How did you implement freeing in the segmented list?
I will also create a much improved version of this pool and commit it to the stdlib, so people can easily use this pattern.
I used a separate list which I just ignored any allocation errors with. It wasn't great :P.
if you have fixed upper bound + 64bit address space, another common approach is to
vmem reserve
the address space upfront +commit/decommit
on-demandof course some subtle, confusing and non-deterministic complications (the best kind, right?):
reserve + commit
upfront, the OS always page maps on-demand on page first touch. A naive implementation may inexplicable performance stalls (e.g. depends on# of process standby page pool
,# of zeroed page pool
,# of demand-zero page faults generated
, etc). Bruce Dawson has excellent write-ups as usual