systemssystemsperformancecavx2

Layer-wise Acceleration of Clifford Neural Networks

Hand-optimised C99 + AVX2 implementation of geometric-algebra neural-network layers, benchmarked to a roofline and bit-perfect against the reference PyTorch implementation of Clifford Neural Networks.

ETH Zürich, Advanced Computing Lab · Feb 2025 – Jun 2025

Layer-wise Acceleration of Clifford Neural Networks banner

Stack

C
AVX2
Python
PyTorch

Role

Course project

Team

4 people

Repository

Overview

Clifford Neural Networks (Brandstetter et al., Ruhe et al.) represent data, weights, and transformations as multivectors instead of scalars. Each multivector is a linear combination of basis blades (scalars, vectors, bivectors, higher-grade primitives), which makes the model naturally good at PDE modelling and physical simulation. The cost is brutal. Each feature has 2ᵈ components, the geometric product creates dense cross-blade coupling, and the reference PyTorch implementation of these layers is far too slow for real-time or large-batch inference.

We started from the open-source CliffordLayers reference and rewrote the three bottleneck layers (Clifford Linear, Multivector SLU, Clifford Group Normalisation) in hand-tuned C99 with AVX2 intrinsics. The kernels drop into the existing Python codebase via ctypes and are checked bit-perfect against the reference PyTorch implementation of these layers on every commit.

What was built

1. Rewriting three bottleneck layers in C99

Each layer was translated from the reference PyTorch implementation of Clifford Neural Networks into hand-tuned C99 and exposed back to Python through ctypes so the existing training and inference scripts run unchanged.

Linear projection: a geometric-product matrix multiply between learned multivector weights and input multivectors.
Multivector SLU: depth-wise convolution over blades followed by a learned sigmoid gate.
Group normalisation: per-group covariance, Cholesky decomposition, back substitution, and optional weight scaling.

2. SIMD and cache optimisation

Progressive optimisations across all three layers, validated bit-perfect after every change.

Vectorised dot products with multiple accumulators for instruction-level parallelism, and flattened weight layouts that stay friendly to the cache hierarchy.
Cache-resident tiling for the SLU and a sparse multiplication scheme with index maps to skip zero blades.
A small lookup table for the sigmoid that replaces a transcendental call with a couple of array loads, plus a permutation trick that runs many small Cholesky decompositions in parallel.

3. No-kernel reformulation and verification

A second path for the Linear layer that skips constructing the kernel matrix, plus a verification harness that gates every change.

The no-kernel path indexes weights directly against input blades, factors out shared coefficients, and exploits the symmetries of the geometric product.
A test harness runs thousands of random tensors per layer across grades, channel counts, and batch sizes, and the bit-perfect comparison against the reference is a hard gate in CI, not a sanity check.

Technical details

Three layers, all rewritten in C with vector intrinsics, cycle-counted layer-by-layer, and validated bit-perfect against the reference.

Target: a commodity x86 laptop CPU, vectorised against 256-bit registers
Linear and GroupNorm peak: ~7 FLOPS per cycle (compute-bound)
SLU peak: ~0.9 FLOPS per cycle (memory-bound on gather pressure)
Verification: bit-perfect parity against the reference across thousands of random tensors per layer

Key technical decisions

No-kernel path for the Linear layer: Skipping the explicit kernel matrix lets us factor out shared coefficients across input blades and reuse symmetry. Big win when kernel-construction time dominates.
Lookup table for the sigmoid: The transcendental call was the single dominant cost in the SLU. Replacing it with a small lookup table cut a divide and an exponential to a couple of array loads, well within the accuracy budget.
Permutation for batched Cholesky: A single Cholesky has serial dependencies that block vectorisation. Permuting the data so many decompositions run in parallel unlocks the vector unit at the cost of a little extra memory traffic.

Results

FLOPS/cycle (Linear & GroupNorm, compute-bound)

5.5×

AVX2 speedup over baseline (Linear)

<1e-6

Bit-perfect parity vs reference PyTorch CliffordLayers

SLU ceiling

~0.9 FLOPS/cycle, gather-bound. Load ports are saturated, not compute.

Challenges & tradeoffs

Cross-blade coupling: The geometric product creates dense dependencies across 2ᵈ blades with sign and metric rules. A naive translation is O(2ᵈ × 2ᵈ) per output blade, so each layer needs custom kernel design.
Sparse-gather bottleneck on SLU: Even with the LUT, tiling, and inverse-index map, the sparse dot product still saturates the load ports with random gathers. Memory latency caps performance here, not compute.
Serial dependencies in GroupNorm: Cholesky and back-sub can't be vectorised within a single decomposition. Permuting data to batch 8/32 decompositions at once unblocks AVX2 at the cost of extra memory traffic, but it's worth it.

What I learned

Speedups should be measured against the roofline, not the slow baseline. A 3× over a baseline that was already far from the ceiling is not the same as approaching the ceiling.
Eliminating overhead beat SIMD in raw impact. Skipping kernel construction, replacing expf with a lookup table, and removing scatter buffers each mattered more than the AVX2 work after them.
Vectorised kernels can still be memory-bound. The SLU saturated load ports on gather instructions, and adding accumulators didn't help.
Bit-perfect parity against the reference PyTorch CliffordLayers is cheap to gate in CI and catches numerical drift that approximate checks don't.

Gallery

01 / 03Clifford Linear Layer optimisations across batch sizes. Vectorising the dot product with multiple accumulators reaches ~7 FLOPS per cycle, about 5.5× over the Python baseline. Compute-bound regime confirmed by the roofline.
02 / 03Multivector SLU. The index map, the lookup-table sigmoid, and cache-resident tiling push throughput to ~0.9 FLOPS per cycle. Sparse gathers saturate the load ports, so the kernel is memory-bound rather than compute-bound, and adding accumulators doesn't help.
03 / 03Clifford Group Normalisation. Running many small Cholesky decompositions in parallel and fusing the surrounding loops brings throughput to ~7 FLOPS per cycle, near the architectural roof.

Next project

Dynamic Scheduling for Latency-Critical Cloud Workloads