Michal Ziulek

Posted on Jan 18, 2022 • Updated on May 6, 2022

Fast, multi-platform, SIMD math library in Zig

#math #simd #gamedev #library

GitHub homepage: zmath

Introduction

Zig programming language does not support operator overloading - instead, it provides builtin Vector types.

const F32x4 = @Vector(4, f32);

const v0 = F32x4{ 1.0, 2.0, 3.0, 4.0 };
const v1 = F32x4{ 1.0, 2.0, 3.0, 4.0 };

const v = v0 + v1;

When compiled to x86_64 with zig build -Dcpu=x86_64 -Drelease-fast=true above code will map to a single SIMD instruction:

addps xmm0, xmm1

This is very nice because we get fast SIMD code without using CPU specific intrinsics - when compiled for ARM architecture we would get single NEON instruction. Furthermore, when we extend our code to:

const F32x8 = @Vector(8, f32);

const v0 = F32x8{ 1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0 };
const v1 = F32x8{ 1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0 };

const v = v0 + v1;

and compile with zig build -Dcpu=x86_64 -Drelease-fast=true we get:

addps xmm0, xmm2
addps xmm1, xmm3

So, we have extended our vector width from 4 to 8 and our code is still very efficient. Target architecture does not natively support vector width 8 so compiler generated 2 x SIMDx4 code.

When we compile above code with zig build -Dcpu=x86_64+avx -Drelease-fast=true we get:

vaddps ymm0, ymm0, ymm1

AVX extension adds native support for vector width 8 and compiler takes advantage of this fact generating only single instruction. Of course Zig supports all fundamental operators, not only '+'.

Above concepts are very powerful because they allow us to write generic SIMD code that will compile to multiple vector widths and multiple CPU architectures.

Example 1 - zmath.sin()

Consider code below. It computes sin() over large array using zmath lib. When you work with vector width 8 - compiler generates SIMDx8 code. When you increase vector width to 16 - compiler generates 2 x SIMDx8 code.

const zm = @import("zmath.zig");

const SimdType = zm.F32x16;
// const SimdType = zm.F32x8;

fn processData(data: []f32, num_elements: u32) void {
    var i: u32 = 0;
    while (i < num_elements) : (i += zm.veclen(SimdType)) {
        var v = zm.load(data[i..], SimdType, 0);
        v = zm.sin(v);
        zm.store(data[i..], v, 0);
    }
}

In this case 2 x SIMDx8 code achieves 0.98 uOps/cycle. SIMDx8 code achieves only 0.32 uOps/cycle. This is because 2 x SIMDx8 code generates more independent uOps per iteration and HW resources are better utilized - more uOps are executed in parallel.

;
; SimdType = F32x8
; zmath.sin(v: SimdType)
; SIMDx8 code (-Dcpu=x86_64+avx+fma)
; 0.32 uOps/cycle according to llvm-mca
;
vmovap      ymm1, ymm0
vmulps      ymm0, ymm0, ymmword ptr [rip + .LCPI0_0]
vroundps    ymm0, ymm0, 0
vmulps      ymm0, ymm0, ymmword ptr [rip + .LCPI0_1]
vaddps      ymm0, ymm1, ymm0
vandps      ymm1, ymm0, ymmword ptr [rip + .LCPI0_2]
vorps       ymm1, ymm1, ymmword ptr [rip + .LCPI0_3]
vandps      ymm2, ymm0, ymmword ptr [rip + .LCPI0_4]
vsubps      ymm1, ymm1, ymm0
vcmpleps    ymm2, ymm2, ymmword ptr [rip + .LCPI0_5]
vblendvps   ymm0, ymm1, ymm0, ymm2
vmulps      ymm1, ymm0, ymm0
vmovaps     ymm2, ymmword ptr [rip + .LCPI0_6]
vfmadd213ps ymm2, ymm1, ymmword ptr [rip + .LCPI0_7]
vfmadd213ps ymm2, ymm1, ymmword ptr [rip + .LCPI0_8]
vfmadd213ps ymm2, ymm1, ymmword ptr [rip + .LCPI0_9]
vfmadd213ps ymm2, ymm1, ymmword ptr [rip + .LCPI0_10]
vfmadd213ps ymm2, ymm1, ymmword ptr [rip + .LCPI0_11]
vmulps      ymm0, ymm0, ymm2

;
; SimdType = F32x16
; zmath.sin(v: SimdType)
; 2 x SIMDx8 code (-Dcpu=x86_64+avx+fma)
; 0.98 uOps/cycle according to llvm-mca
;
vmovaps     ymm2, ymm1
vmovaps     ymm3, ymm0
vmovaps     ymm0, ymmword ptr [rip + .LCPI0_0]
vmulps      ymm1, ymm1, ymm0
vmulps      ymm0, ymm3, ymm0
vroundps    ymm0, ymm0, 0
vroundps    ymm1, ymm1, 0
vmovaps     ymm4, ymmword ptr [rip + .LCPI0_1]
vmulps      ymm0, ymm0, ymm4
vmulps      ymm1, ymm1, ymm4
vaddps      ymm1, ymm2, ymm1
vaddps      ymm0, ymm3, ymm0
vmovaps     ymm2, ymmword ptr [rip + .LCPI0_2]
vandps      ymm3, ymm0, ymm2
vandps      ymm2, ymm1, ymm2
vmovaps     ymm4, ymmword ptr [rip + .LCPI0_3]
vorps       ymm2, ymm2, ymm4
vorps       ymm3, ymm3, ymm4
vmovaps     ymm4, ymmword ptr [rip + .LCPI0_4]
vandps      ymm5, ymm1, ymm4
vandps      ymm4, ymm0, ymm4
vsubps      ymm3, ymm3, ymm0
vsubps      ymm2, ymm2, ymm1
vmovaps     ymm6, ymmword ptr [rip + .LCPI0_5]
vcmpleps    ymm4, ymm4, ymm6
vcmpleps    ymm5, ymm5, ymm6
vblendvps   ymm1, ymm2, ymm1, ymm5
vblendvps   ymm0, ymm3, ymm0, ymm4
vmulps      ymm2, ymm0, ymm0
vmulps      ymm3, ymm1, ymm1
vmovaps     ymm4, ymmword ptr [rip + .LCPI0_6]
vmovaps     ymm5, ymmword ptr [rip + .LCPI0_7]
vmovaps     ymm6, ymm5
vfmadd213ps ymm6, ymm3, ymm4
vfmadd213ps ymm5, ymm2, ymm4
vmovaps     ymm4, ymmword ptr [rip + .LCPI0_8]
vfmadd213ps ymm5, ymm2, ymm4
vfmadd213ps ymm6, ymm3, ymm4
vmovaps     ymm4, ymmword ptr [rip + .LCPI0_9]
vfmadd213ps ymm6, ymm3, ymm4
vfmadd213ps ymm5, ymm2, ymm4
vmovaps     ymm4, ymmword ptr [rip + .LCPI0_10]
vfmadd213ps ymm5, ymm2, ymm4
vfmadd213ps ymm6, ymm3, ymm4
vmovaps     ymm4, ymmword ptr [rip + .LCPI0_11]
vfmadd213ps ymm6, ymm3, ymm4
vfmadd213ps ymm5, ymm2, ymm4
vmulps      ymm0, ymm0, ymm5
vmulps      ymm1, ymm1, ymm6

Notice how easy it is to change the vector width - you only need to change the SimdType constant - rest of the code stays unchanged. This is great because we can experiment with different vector widths for a given algorithm and choose the fastest one.

Example 2 - zmath.mul(Mat, Mat)

4x4 matrix multiplication is very common in 3D game development. zmath has very simple and efficient implementation:

var result: Mat = undefined;
comptime var row: u32 = 0;
inline while (row < 4) : (row += 1) {

var vx = @shuffle(f32, m0[row], undefined, [4]i32{ 0, 0, 0, 0 });
var vy = @shuffle(f32, m0[row], undefined, [4]i32{ 1, 1, 1, 1 });
var vz = @shuffle(f32, m0[row], undefined, [4]i32{ 2, 2, 2, 2 });
var vw = @shuffle(f32, m0[row], undefined, [4]i32{ 3, 3, 3, 3 });

vx = vx * m1[0];
vy = vy * m1[1];
vz = vz * m1[2];
vw = vw * m1[3];
vx = vx + vz;
vy = vy + vw;
vx = vx + vy;
result[row] = vx;
}

When compiled with: zig build -Dcpu=x86_64+avx+fma -Drelease-fast=true we get fast assembly shown below. Theoretical characteristic (llvm-mca) of this code on Zen2 is:

Dispatch Width: 4
uOps Per Cycle: 3.30
IPC: 3.30
Block RThroughput: 12.3

vmovups      xmm1, xmmword ptr [rdx]
vmovups      xmm2, xmmword ptr [rdx + 16]
vmovups      xmm3, xmmword ptr [rdx + 32]
vmovups      xmm0, xmmword ptr [rdx + 48]
vbroadcastss xmm4, dword ptr [rsi]
vbroadcastss xmm5, dword ptr [rsi + 4]
vbroadcastss xmm6, dword ptr [rsi + 8]
vbroadcastss xmm7, dword ptr [rsi + 12]
vmulps       xmm4, xmm4, xmm1
vmulps       xmm5, xmm5, xmm2
vmulps       xmm6, xmm6, xmm3
vaddps       xmm4, xmm4, xmm6
vmulps       xmm6, xmm7, xmm0
vaddps       xmm5, xmm5, xmm6
vaddps       xmm8, xmm4, xmm5
vbroadcastss xmm5, dword ptr [rsi + 16]
vbroadcastss xmm6, dword ptr [rsi + 20]
vbroadcastss xmm7, dword ptr [rsi + 24]
vbroadcastss xmm4, dword ptr [rsi + 28]
vmulps       xmm5, xmm5, xmm1
vmulps       xmm6, xmm6, xmm2
vmulps       xmm7, xmm7, xmm3
vaddps       xmm5, xmm5, xmm7
vmulps       xmm4, xmm4, xmm0
vaddps       xmm4, xmm6, xmm4
vaddps       xmm9, xmm5, xmm4
vbroadcastss xmm5, dword ptr [rsi + 32]
vbroadcastss xmm6, dword ptr [rsi + 36]
vbroadcastss xmm7, dword ptr [rsi + 40]
vbroadcastss xmm4, dword ptr [rsi + 44]
vmulps       xmm5, xmm5, xmm1
vmulps       xmm6, xmm6, xmm2
vmulps       xmm7, xmm7, xmm3
vaddps       xmm5, xmm5, xmm7
vmulps       xmm4, xmm4, xmm0
vaddps       xmm4, xmm6, xmm4
vaddps       xmm10, xmm5, xmm4
vbroadcastss xmm5, dword ptr [rsi + 48]
vbroadcastss xmm6, dword ptr [rsi + 52]
vbroadcastss xmm7, dword ptr [rsi + 56]
vbroadcastss xmm4, dword ptr [rsi + 60]
vmulps       xmm1, xmm5, xmm1
vmulps       xmm2, xmm6, xmm2
vmulps       xmm3, xmm7, xmm3
vaddps       xmm1, xmm1, xmm3
vmulps       xmm0, xmm4, xmm0
vaddps       xmm0, xmm2, xmm0
vaddps       xmm0, xmm1, xmm0

Example 3 - zmath.round()

round() is very useful operation - zmath.sin() uses it to perform input value reduction into [-pi; pi) range. zmath uses inline assembly on x86_64 to make it as fast as possible. As you can see below, x86_64 SSE and ARM NEON paths are implemented as generic Zig code ('else' block). x86_64 AVX+ paths are optimized and map to a single instruction.

pub fn round(v: anytype) @TypeOf(v) {
    const T = @TypeOf(v);
    if (cpu_arch == .x86_64 and has_avx) {
        if (T == F32x4) {
            return asm ("vroundps $0, %%xmm0, %%xmm0"
                : [ret] "={xmm0}" (-> T),
                : [v] "{xmm0}" (v),
            );
        } else if (T == F32x8) {
            return asm ("vroundps $0, %%ymm0, %%ymm0"
                : [ret] "={ymm0}" (-> T),
                : [v] "{ymm0}" (v),
            );
        } else if (T == F32x16 and has_avx512f) {
            return asm ("vrndscaleps $0, %%zmm0, %%zmm0"
                : [ret] "={zmm0}" (-> T),
                : [v] "{zmm0}" (v),
            );
        } else if (T == F32x16 and !has_avx512f) {
            const arr: [16]f32 = v;
            var ymm0 = @as(F32x8, arr[0..8].*);
            var ymm1 = @as(F32x8, arr[8..16].*);
            ymm0 = asm ("vroundps $0, %%ymm0, %%ymm0"
                : [ret] "={ymm0}" (-> F32x8),
                : [v] "{ymm0}" (ymm0),
            );
            ymm1 = asm ("vroundps $0, %%ymm1, %%ymm1"
                : [ret] "={ymm1}" (-> F32x8),
                : [v] "{ymm1}" (ymm1),
            );
            return @shuffle(
                f32, ymm0, ymm1,
                [16]i32{
                    0, 1, 2, 3, 4, 5, 6, 7,
                    -1, -2, -3, -4, -5, -6, -7, -8
                }
            );
        }
    } else {
        const sign = andInt(v, splatNegativeZero(T));
        const magic = orInt(splatNoFraction(T), sign);
        var r1 = v + magic;
        r1 = r1 - magic;
        const r2 = abs(v);
        const mask = r2 <= splatNoFraction(T);
        return select(mask, r1, v);
    }
}

Library overview

zmath library builds on top of Zig provided features and adds higher level functionality useful for game developers. Few fundamental types are provided F32x4, F32x8, F32x16, Boolx4, Boolx8 and Boolx16. List of all provided functions can be found here. I have investigated assembly code for each function and when compiler generated version wasn't efficient I have provided faster inline assembly for x86_64 architecture.

Functions are divided into several classes:

Initialization functions

const zm = @import("zmath.zig");

const v4 = zm.f32x4(1.0, 2.0, 3.0, 4.0);
const v8 = zm.f32x8(1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0);

// set all components to a single value
const v16 = zm.f32x16s(math.pi * 0.25);

const v = v4 * v4;
const b = v > v4; // b is a Boolx4 vector
if (zm.all(b, 0)) { // if all components are 'true'
  // do something
}

// load 3 components from memory, 4th component will be set to 0
var vv4 = zm.load(mem[0..], F32x4, 3);

// load 2 components from memory and put them into F32x8
var vv8 = zm.load(mem[32..], F32x8, 2);

// load 7 components from memory and put them into F32x16
var vv16 = zm.load(mem[64..], F32x16, 7); 

// process data...
vv4 = zm.sin(vv4);
vv8 = zm.saturate(vv8);
vv16 = zm.cos(vv16);

zm.store(mem[0..], vv4, 3); // store 3 components
zm.store(mem[32..], vv8, 8); // store 8 components

// store all vector components (16 in this case)
zm.store(mem[64..], vv16, 0);

Functions that work on all vector components

const zm = @import("zmath.zig");

const v4 = zm.sin(zm.f32x4(1.0, 2.0, 3.0, 4.0));
const v8 = zm.sin(zm.f32x8(1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0));
const v16 = zm.sin(zm.f32x16s(math.pi * 0.25));

When compiled with zig build -Dcpu=x86_64 -Drelease-fast=true we get efficient SIMDx4 assembly for the first invocation, 2 x SIMDx4 assembly for the second invocation and 4 x SIMDx4 for the third one. When we compile for AVX capable CPUs we get SIMDx4 code for the first invocation, SIMDx8 for the second one and 2 x SIMDx8 for the last one.

2D, 3D, 4D vector functions

pub const Vec = F32x4;

Those functions treat F32x4 type as 2D, 3D or 4D vector. Number in the function name suffix tells us how many components are used. For example, dot2 performs dot product operation treating vectors as 2D, cross3 function uses x, y, z components to perform cross product operation and so on.

Vector arithmetic is expressed naturally, for example:

const vr0 = v0 + v1 - v2;
const scalar: f32 = 2.0;
const vr1 = v0 + zm.f32x4s(scalar) * zm.f32x4(1.0, 2.0, 3.0, 4.0);

// zm.dotN() returns scalar replicated on all vector components
const vr2 = zm.dot3(v0, v1) / v3;

Vector comparison is accomplished with builtin operators and zmath.all(), zmath.any() functions.

if (zm.all(v0 > v1, 3)) { // if all first 3 components are 'true'
    // do something
}

if (zm.any(v0 > v1, 3)) { // if any of the first 3 components is 'true'
    // do something
}

Matrix functions

pub const Mat = [4]F32x4;

All typical matrix operations are available. There are several overloads of zmath.mul() function:

const zm = @import("zmath.zig");
zm.mul(zm.Vec, zm.Mat) // Vec treated as a row vector
zm.mul(zm.Mat, zm.Vec) // Vec treated as a column vector
zm.mul(f32, zm.Mat)
zm.mul(zm.Mat, f32)
zm.mul(zm.Quat, zm.Quat)
zm.mul(zm.Vec, f32)
zm.mul(f32, zm.Vec)

Some examples:

const speed = zm.f32x4s(10.0);
const delta_time = zm.f32x4s(demo.frame_stats.delta_time);

const transform = zm.mul(
    zm.rotationX(demo.camera.pitch),
    zm.rotationY(demo.camera.yaw),
);
var forward = zm.normalize3(
    zm.mul(zm.f32x4(0.0, 0.0, 1.0, 0.0), transform),
);

zm.store(demo.camera.forward[0..], forward, 3);

const right = speed * delta_time * zm.normalize3(
    zm.cross3(
        zm.f32x4(0.0, 1.0, 0.0, 0.0),
        forward,
    )
);
forward = speed * delta_time * forward;

Quaternion functions

pub const Quat = F32x4;

All typical quaternion operations are available. Some examples:

const q0 = zm.quatFromMat(m0);
const q1 = zm.quatFromMat(m1);
const qp = zm.mul(q0, q1);

const q = zm.slerp(q0, q1, 0.5);

const m = zm.quatToMat(q);

Summary

Thanks for reading! If you like this article please visit zig-gamedev project - recently I have released several 'intro applications' which show step by step how to use zmath lib and graphics lib - you can find them here.

Top comments (6)

花大喵 • Feb 7 '22

Great to enrich the library of zig math

Austin Clements • Jul 23 '22

This is awesome!! For releasing software using SIMD would your approach be to release multiple builds with different processor extensions enabled? Alternatively is there a way to have one binary that determines SIMD capabilities at runtime?

Kan Liu • Sep 8 '22

In example 2, the code was compiled with fma, why didn't the assembly show vfmaddps?

Thomas Poff • Nov 13 '23

Hi. Nice stuff, really.
I can get a lot to work but on the zm.sin() calls I get...

~/zig/zmathproj % zig build-exe -Dcpu=x86_64 -Drelease-fast=true test.zig
test.zig:40:15: error: root struct of file 'libs.zmath.build' has no member named 'sin'
const zz = zm.sin(zm.f32x16(1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0));

Yeah I'm on a mac using zig 0.11.0 have gotten the library included and can call into it.

I can switch to Linux if need be just was thinking this should work since many other profiles I've tried have been ok.

I have no problems, for instance with-
const v0 = F32x4{ 1.0, 2.0, 3.0, 4.0 };
const v1 = F32x4{ 1.0, 2.0, 3.0, 4.0 };
const v2 = v0 + v1;
....
std.debug.print("{} \n", .{v2});

Works great. Any ideas why the trig functions are not a go? Thank you.
Getting this to work would be a boon to my laser imaging system, which in the past has was mostly an analog, quadrature burr-brown oscillator chips. I also used a digital system with six channels of D/A + now would like to convert over to digital, finally, since everything can be done digitally "kind of" in real time now (with big canned matrices of values for various complex additive+multiplicative sin+cos values across the x-y axis). Basically using this technique I can get 3D-like animations that noone would ever think of, just by twisting knobs, etc. If digital I could also add real 3D space operations to the core vector presentations.

Thus the interest in using a real matrix manipulation stuff (for trigonometry where applicable).