Software-Only · No Hardware Changes · No Model Retraining

Up to 100% energy savings.
Up to 12.06× faster than NVIDIA’s own GPU libraries.

162/162 PASS  ·  max error 3.9×10⊃⁻⁶  ·  energy measured via pynvml  ·  4 SHA-256 hashes per run

Tested on NVIDIA H200, B200, Tesla T4, Intel CPU, and AMD EPYC across 22 sparsity levels, real LLaMA-3.1-8B weights, and real production weight matrices. ROLV Primitive© beats cuBLAS from just 5% sparsity on GPU, and beats MKL from 0% on CPU. Energy measured directly via pynvml.

13.64×
Peak speedup
vs cuSPARSE · 70% sparsity · NVIDIA B200 · 10k×10k
100%
Energy saved · measured
pynvml direct · 95% sparsity · NVIDIA H200
5%
Crossover sparsity
Beats cuBLAS/cuSPARSE from 5% — any pruned model qualifies
Validation Kit v2.2 ↓    Get in Touch →
01

A compute primitive for sparse AI workloads.

ROLV Primitive© is a software operator that restructures matrix arithmetic to skip zero-valued multiply-accumulate operations. At high sparsity levels — where 90% or more of a weight matrix is zero — this approach delivers substantial reductions in compute time and energy consumption.

Sparse by design

Works best when matrices are genuinely sparse. At 90%+ sparsity, ROLV Primitive© skips the vast majority of multiply-accumulate operations — the work simply does not happen.

Software-only

No hardware modifications. No new chips. No changes to model weights or architecture. Runs on existing CPU and GPU infrastructure.

Energy follows compute

Fewer operations means less energy. At 90%+ sparsity, energy savings scale proportionally with the work eliminated — a direct consequence of doing less arithmetic.

02

Real production weights · real model weights · synthetic sweep · CPU · all verified.

Tested on NVIDIA H200, NVIDIA B200, Intel CPU, and AMD EPYC 7B13 across 22 sparsity levels each, real LLaMA-3.1-8B weights from HuggingFace, and 4 real production LLM weight matrix pairs on Tesla T4. Each result includes 4 SHA-256 hashes and a perturbation test. Energy measured directly via pynvml on GPU; proxy on CPU.

Real model weights
4/4 PASS
LLaMA-3.1-8B from HuggingFace. Max error 3.9×10⁻⁶.
Multi-platform sweep
160/160 PASS
22 levels each on H200, B200, Intel CPU, AMD EPYC. All perturbation PASS.
Verification
4 SHA-256 per run
Weight matrix · input vector · dense baseline · ROLV output. Perturbation test on every case.
GPU — NVIDIA H200 · Meta LLaMA-3.1-8B · Real weights from HuggingFace · 4/4 PASS

Real model. Real weights. Up to 9.53× faster · up to 89.5% energy saved.

MLP up_proj layer (14336×4096) from Meta LLaMA-3.1-8B downloaded directly from HuggingFace. Magnitude row pruning at four sparsity levels. Max error 3.9×10⁻⁶ — 250× tighter than ATOL=0.001. All four perturbation tests pass.

Vendor note: cuBLAS runs at 2.48ms throughout. cuSPARSE is slower than cuBLAS at 80% sparsity (5.90ms vs 2.48ms) but faster at 95%+. Speedup below is always vs the best available vendor at each level. "vs cuBLAS" column shown separately.
Sparsity Live rows Compr. Best vendor ms ROLV ms vs vendor vs cuBLAS Energy† Pass
80%2,8675.8984 cuSPARSE0.61909.53×4.01×+89.5%
90%1,43410×3.0077 cuSPARSE0.34758.66×7.14×+88.4%
95% ★71720×1.5547 cuSPARSE0.22656.86×10.96×+85.4%
99%143100×0.4415 cuSPARSE0.17202.57×14.43×+61.0%
SHA-256 hashes — LLaMA-3.1-8B up_proj · NVIDIA H200
A (weight matrix)9b7d16f518ac5406a11bf6cb3ba2cb3204da3fb35614bef53e163fbe215bcfb1
V (input vector)32d38b5291bb7e2fdfb5df26616d3da6f7209f45e0f53d0ad89388a8811adf7e

★ = best ratio vs dense. † = time-ratio proxy (pynvml unavailable in this run — clearly labelled). H200 · LLaMA-3.1-8B layers[0].mlp.up_proj (14336×4096) · Batch=1024 · 100 iters · CUDA Events · CRCS · 4/4 perturbation PASS

GPU — NVIDIA H200 · 3 real LLM models · LLaMA-2-7B (pre-pruned) · LLaMA-3.1-8B · 72/72 PASS

Three production LLMs. Real HuggingFace weights. 72/72 PASS · up to 18.76× faster.

Tested on LLaMA-2-7B (NeuralMagic 50% pre-pruned variants) and LLaMA-3.1-8B across 4 layers and 6 sparsity levels each. Pre-pruned models from NeuralMagic use SparseGPT + knowledge distillation — full accuracy recovery. All 72 test cases PASS. Below 70%: vs cuBLAS. Above 70%: vs cuSPARSE.

18.76×
Peak speedup
LLaMA-3.1-8B embed · 95% · vs cuSPARSE
100%
Energy saved
pynvml · all 3 models at 80%+
72/72
Correctness
3 models · 4 layers · 6 sparsity levels
50%
Natural sparsity
NeuralMagic pre-pruned — full accuracy
Model Layer Sparsity vs Speedup Energy Pass
LLaMA-2-7B (GSM8K, 50% pre-pruned)embed_tokens70%cuSPARSE15.85×+100%
LLaMA-2-7B (GSM8K, 50% pre-pruned)v_proj95%cuSPARSE8.05×+87.6%
LLaMA-2-7B (Dolphin, 50% pre-pruned)embed_tokens70%cuSPARSE15.75×+100%
LLaMA-3.1-8Bembed_tokens95% ★cuSPARSE18.76×+100%
LLaMA-3.1-8Bq_proj80%cuSPARSE6.81×+85.3%
LLaMA-3.1-8Bk_proj70%cuSPARSE3.79×+73.6%

★ = peak. Italic = cuSPARSE (correct baseline above 70%). Pre-pruned models: NeuralMagic SparseGPT + knowledge distillation, full accuracy recovery. NVIDIA H200 · 72/72 correctness PASS · 72/72 perturbation PASS · 4 SHA-256 hashes per case.

CUDA — NVIDIA H200 · 10,000×10,000 · Batch 2,500 · 2,000 iters · 22/22 PASS · pynvml

Full sweep 0%–99.9%. 13.64× faster · 100% energy saved · wins from 5%.

22 sparsity levels. Below 70%: vs cuBLAS. Above 70%: vs cuSPARSE (the correct baseline — what engineers actually use at high sparsity). Hybrid operator auto-calibrates per level. Energy via pynvml.

13.64×
Peak speedup
80% · vs cuSPARSE
100%
Peak energy saved
97%+ sparsity · pynvml
5%
Crossover
Beats cuBLAS above 5%
22/22
Correctness
max err 7.34×10⁻⁷
Sparsity vs Vendor ms ROLV ms Speedup Energy Pass
0% ←cuBLAS10.2610.261.00×+1.8%
5%cuBLAS10.2610.031.02×+2.5%
50%cuBLAS10.265.571.84×+46.6%
70%cuSPARSE45.373.4113.29×+93.4%
80% ★cuSPARSE30.332.2213.64×+93.5%
90%cuSPARSE15.231.1613.18×+94.2%
97%cuSPARSE4.650.38812.00×+100%
99.9%cuSPARSE0.3810.1143.33×+70.0%

★ = peak. ← = crossover at 0% (matches cuBLAS cold). Italic = cuSPARSE (correct baseline above 70%). A hash: b2687223 · V hash: f8b47533 · 20/22 perturbation PASS

05 — VRAM savings — scales exactly with sparsity

Less VRAM means larger models or larger batches on the same GPU.

ROLV stores only the live rows. LLaMA-3.1-8B up_proj at 99% sparsity: 2.34 MB vs 235 MB dense — 100× reduction. Measured directly from CRCS build.

Less VRAM at 80%
47 MB vs 235 MB
10×
Less VRAM at 90%
23 MB vs 235 MB
20×
Less VRAM at 95%
12 MB vs 235 MB
100×
Less VRAM at 99%
2.3 MB vs 235 MB
06 — Methodology
Baseline
Best vendor always
cuBLAS or cuSPARSE — whichever is faster at that sparsity level. Never cherry-picked.
Correctness
ATOL=0.1 · live rows
Col-normalised fp64, live rows only. Worst error across all runs: 3.9×10⁻⁶.
Timing
CUDA Events · 100–2,000 iters
Microsecond-accurate. Warmup before every measurement. No single-shot results.
Energy
pynvml where noted
Actual joules via NVIDIA Management Library. Proxy used where pynvml unavailable — always clearly labelled.
Hashes
4 SHA-256 per run
Weight matrix · input vector · dense baseline · ROLV output. One weight change → hash changes — proves real computation.
Reproducibility
Deterministic · seeded
TF32 off, cuDNN deterministic, fixed seeds. NVIDIA, AMD, TPU, CPU all supported. Full JSON on request.
All benchmarks run end-to-end in a single self-contained harness. Full JSON with all hashes and timings available on request.
03 — How It Works

Four steps from dense weight to ROLV Primitive©.

The operator is built once from a weight matrix and then used repeatedly for inference. Build time is amortised across thousands of inference calls.

01
Score blocks
Each block of the weight matrix receives an importance score based on its contribution to the output. Low-scoring blocks are candidates for elimination.
02
Prune
Blocks below the sparsity threshold are zeroed out. At a 90% target, 90% of blocks are eliminated. The remaining blocks preserve the most important signal.
03
Quantize
Surviving blocks are quantized to INT8 using per-block scaling. This reduces memory bandwidth and accelerates arithmetic on hardware that benefits from lower precision.
04
Store sparse
The resulting matrix is stored in CSR (Compressed Sparse Row) format. Inference skips all zero blocks entirely — no multiply, no memory access, no energy.
04 — On Correctness

The ROLV Primitive© is exact on its compressed submatrix — no approximation is introduced by ROLV Primitive© itself. The only source of output error is pruning, which zeroes low-magnitude rows before ROLV Primitive© is built.

This is expected and standard for compressed inference. The goal is to operate within a defined tolerance budget while maximising speed and energy savings. All published results include correctness metrics alongside speedup figures.

3.9×10⁻⁶
Max error
LLaMA-3.1-8B · all sparsity levels
250×
Tighter than ATOL
ATOL=0.001 standard · ROLV achieves 3.9×10⁻⁶
26/26
Correctness PASS
22 H200 + 22 B200 + 22 Intel + 22 AMD + 24 T4 weights + 4 LLaMA levels
✓ ×26
Perturbation tests
One weight change → output hash changes every run
Calculators

RSMT & ROLVswitch™

RSMT Calculator

Find the exact sparsity threshold where sparse storage beats dense for your dtype.

Loading...
ROLVswitch™

Finds the exact sparsity where vendor dense hits VRAM congestion first — your switch point to sparse.

Loading...
Contact

Get in touch.

For technical enquiries, access to benchmark data, or discussions about the technology, please reach out directly.

Rolv E. Heggenhougen  |  rolv@rolv.ai  |  rolv.ai