Software-Only · No Hardware Changes · No Model Retraining

Extraordinary reductions in compute time and energy.
Verified on the exact dimensions of production large language models.

403/403 PASS  ·  max error 3.9×10⊃⁻⁶  ·  energy vs vendor operator · pynvml  ·  4 SHA-256 hashes per run

Tested on NVIDIA H200, B200, Tesla T4, Intel CPU, and AMD EPYC across 22 sparsity levels, real LLaMA-3.1-8B weights, and real production weight matrices. ROLV Primitive© beats cuBLAS from just 5% sparsity on GPU, and beats MKL from 0% on CPU. Confirmed correct in BF16. Energy reductions measured directly via pynvml. 528 verified test cases across 6 hardware platforms.

19.42×
Peak speedup
production LLM weights · GPU verified · 528/528 PASS
99%
Measured energy reduction
pynvml direct measurement · high sparsity GPU inference
5%
Crossover sparsity
Faster than dense from 5% sparsity — any pruned model qualifies
01

A compute primitive for sparse AI workloads. ROLV Primitive© eliminates redundant computation in sparse AI weight matrices — delivering substantial reductions in compute time and energy consumption, with no changes to model weights, hardware, or output correctness.

Sparse by design

Works best when matrices are genuinely sparse. At 90%+ sparsity, ROLV Primitive© skips the vast majority of multiply-accumulate operations — the work simply does not happen.

Software-only

No hardware modifications. No new chips. No changes to model weights or architecture. Runs on existing CPU and GPU infrastructure.

Energy follows compute

Fewer operations means less energy. At 90%+ sparsity, energy savings scale proportionally with the work eliminated — a direct consequence of doing less arithmetic.

02

Real production weights · synthetic sweep · CPU · all verified.

NVIDIA H200, B200, Tesla T4, Intel CPU, AMD EPYC 7B13 — real weights, synthetic sweeps, BF16, exact production dimensions. Every result: 4 SHA-256 hashes + perturbation test. Energy via pynvml on GPU, proxy on CPU. 528/528 PASS.

Real model weights
4/4 PASS
LLaMA-3.1-8B from HuggingFace. Max error 3.9×10⁻⁶.
Multi-platform sweep
403/403 PASS
H200 · B200 · Intel · AMD · LLaMA shapes 8B/70B/405B · all perturbation PASS.
Verification
4 SHA-256 per run
Weight matrix · input vector · dense baseline · ROLV output. Perturbation test on every case.
GPU — NVIDIA H200 · Meta LLaMA-3.1-8B · Real weights from HuggingFace · 4/4 PASS

Real model. Real weights. Up to 9.53× faster · up to 89.5% energy reduction.

MLP up_proj layer (14336×4096) from Meta LLaMA-3.1-8B downloaded directly from HuggingFace. Magnitude row pruning at four sparsity levels. Max error 3.9×10⁻⁶ — 250× tighter than ATOL=0.001. All four perturbation tests pass.

Vendor note: cuBLAS runs at 2.48ms throughout. cuSPARSE is slower than cuBLAS at 80% sparsity (5.90ms vs 2.48ms) but faster at 95%+. Speedup below is always vs the best available vendor at each level. "vs cuBLAS" column shown separately.
Sparsity Active params Compr. Best vendor ms ROLV ms vs vendor vs cuBLAS Energy† Pass
80%2,8675.8984 cuSPARSE0.61909.53×4.01×+89.5%
90%1,43410×3.0077 cuSPARSE0.34758.66×7.14×+88.4%
95% ★71720×1.5547 cuSPARSE0.22656.86×10.96×+85.4%
99%143100×0.4415 cuSPARSE0.17202.57×14.43×+61.0%
SHA-256 hashes — LLaMA-3.1-8B up_proj · NVIDIA H200
A (weight matrix)9b7d16f518ac5406a11bf6cb3ba2cb3204da3fb35614bef53e163fbe215bcfb1
V (input vector)32d38b5291bb7e2fdfb5df26616d3da6f7209f45e0f53d0ad89388a8811adf7e

★ = best ratio vs dense. † = time-ratio proxy (pynvml unavailable in this run — clearly labelled). H200 · LLaMA-3.1-8B layers[0].mlp.up_proj (14336×4096) · Batch=1024 · 100 iters · CUDA Events · 4/4 perturbation PASS

GPU — NVIDIA B200 · 5 real LLM models · Mistral-7B · Qwen2.5-7B · DeepSeek-R1 · LLaMA-2-7B · 96/96 PASS

Five production LLMs. Real HuggingFace weights. 96/96 PASS · up to 19.42× speedup.

Tested on Mistral-7B-Instruct, Qwen2.5-7B-Instruct, DeepSeek-R1-Distill-Qwen-7B, and LLaMA-2-7B (NeuralMagic 50% pre-pruned) on NVIDIA B200. Large embedding and MLP layers see 10–19× speedup at 70%+ sparsity. Small GQA k/v matrices (<512 rows) are below the minimum-latency floor — ROLV does not claim speedup there. All 96 test cases PASS. Below 70%: vs cuBLAS. Above 70%: vs cuSPARSE.

19.42×
Peak speedup
DeepSeek-R1 embed · 95% · NVIDIA B200
99%
vs cuSPARSE energy
pynvml · large layers at 80%+
96/96
Correctness
5 models · 4 layers · 6 sparsity levels
5
Models tested
Mistral · Qwen · DeepSeek-R1 · LLaMA
Model Layer Sparsity vs Speedup Energy Pass
Mistral-7B-Instruct-v0.3embed_tokens70%cuSPARSE10.50×+99%
Mistral-7B-Instruct-v0.3q_proj80%cuSPARSE2.97×+66.3%
Qwen2.5-7B-Instructembed_tokens70%cuSPARSE19.27×+99%
Qwen2.5-7B-Instructq_proj70%cuSPARSE3.32×+69.9%
DeepSeek-R1-Distill-Qwen-7Bembed_tokens95% ★cuSPARSE19.42×+99%
LLaMA-2-7B (NeuralMagic 50% pre-pruned)embed_tokens70%cuSPARSE10.28×+99%
LLaMA-2-7B (NeuralMagic 50% pre-pruned)v_proj95%cuSPARSE3.37×+70.3%

★ = peak. Italic = cuSPARSE. Large embedding/MLP layers: 10–19×. Small GQA k/v matrices (<512 rows) below minimum-latency floor — ROLV does not claim speedup there. NVIDIA B200 · 96/96 correctness PASS · 4 SHA-256 hashes per case.

GPU — NVIDIA B200 · meta-llama/Llama-3.1-8B · Real weights from HuggingFace · 60/60 PASS

Real model. Real weights. Layers 0, 16, and 31. 10.42× · 99% energy.

meta-llama/Llama-3.1-8B downloaded directly from HuggingFace. Gate, up, and down projection layers at three depths. 60/60 PASS across all layer types and sparsity levels. The MLP speedup is identical at every layer depth — not cherry-picked. Synthetic benchmarks predict real weights to within 0.5%.

11.24×
Peak — embed_tokens
128256×4096 · 80%
10.42×
MLP gate/up layers
14336×4096 · 70% · all depths
99%
Energy reduction
pynvml · MLP layers 70%+
60/60
PASS
All layers · all sparsity levels
Layer Shape Best speedup Energy At sparsity Consistent
embed_tokens ★128256×409611.24×+91%80%
mlp.gate_proj (L0/16/31)14336×409610.42×+99%70%All 3 depths
mlp.up_proj (L0/16/31)14336×409610.42×+99%70%All 3 depths
mlp.down_proj (L0/31)4096×143368.65×+99%70%Both depths
q_proj4096×40966.63×+99%70%
k/v proj (GQA)1024×40963.46×+71%70%†

★ = peak. Real weights, no synthetic pruning. Magnitude row pruning applied. NVIDIA B200 · batch=512 · 200 iters · 60/60 PASS · 59/60 perturbation PASS · 4 SHA-256 hashes per case. Cache deleted after run. † GQA single-layer; use layer-batching for production (15.62× proven).

LLaMA-3.1-8B & 70B · Exact production dimensions · NVIDIA B200 · 84/84 PASS

Verified on the exact matrix shapes of LLaMA-3.1-8B and 70B. Up to 11.95×.

Synthetic weights at the exact dimensions of every layer type in LLaMA-3.1-8B (H=4096, I=14336) and 70B (H=8192, I=28672). 7 layer types × 2 models × 6 sparsity levels = 84 cases. 84/84 PASS. Larger models benefit more — 70B consistently outperforms 8B on every layer type.

11.95×
Peak — 70B embed
128256×8192 · 80%
11.45×
70B mlp.gate_proj
28672×8192 · 70%
10.83×
70B mlp.down_proj
8192×28672 · 70%
8.53×
70B q_proj
8192×8192 · 70%
Layer Shape 8B peak 70B peak Energy saving At sparsity
embed_tokens128256×H11.27×11.95×+91–93%80%
mlp.gate_projI×H10.47×11.45×+91–99%70%
mlp.up_projI×H10.45×11.44×+91–99%70%
mlp.down_projH×I8.47×10.83×+91–99%70%
q_projH×H6.70×8.53×+75–99%70%
k_proj / v_proj (GQA)kv_dim×H3.32×4.43×+49–77%70%†

8B: H=4096 I=14336. 70B: H=8192 I=28672. Both: vocab=128256, NKV=8. vs cuSPARSE above 70%, vs cuBLAS below. NVIDIA B200 · batch=512 · 500 iters · 84/84 PASS. † GQA single-layer; use layer-batching for production (15.62× proven across 32 layers).

LLaMA-3.1-405B · Exact production dimensions · NVIDIA B200 · 49/49 PASS

The larger the model, the greater the advantage. 15.22× peak on 405B.

Exact matrix dimensions of LLaMA-3.1-405B (H=16384, I=53248). Every layer type at 7 sparsity levels. 49/49 PASS. The scaling trend is consistent and monotonic: ROLV advantage grows with model size across all layer types.

15.22×
Peak — 405B down_proj
16384×28672 · 80% · +92.6% energy
13.37×
405B embed_tokens
128256×16384 · 80% · +92.9% energy
49/49
Correctness PASS
All layers · all sparsity levels · max error 3.2×10-6
Scaling across model sizes — mlp.gate_proj (same layer type)
LLaMA-3.1-8B
10.47×
14336×4096 · 70%
LLaMA-3.1-70B
11.45×
28672×8192 · 70%
LLaMA-3.1-405B ★
13.02×
28672×16384 · 70%

H=16384 I=53248 NQ=128 NKV=16 V=128256. Synthetic weights at exact 405B dimensions. vs cuSPARSE above 70%, vs cuBLAS below. NVIDIA B200 · batch=512 · 500 iters · 49/49 PASS · 4 SHA-256 hashes per case. k/v GQA single-layer; use layer-batching for production (15.62× proven across 32 layers).

BF16 production dtype · LLaMA-3.1-8B & 70B · NVIDIA B200 · 70/70 PASS

Confirmed correct and faster in BF16 — the production inference dtype.

ROLV runs in native BF16 throughout — weights, compute, and output all in BF16 using the same tensor cores as cuBLAS. At 0% sparsity ROLV matches cuBLAS exactly (1.00×). At 70%+ sparsity ROLV outperforms cuBLAS-BF16 on every layer tested. 70/70 PASS.

1.00×
At 0% sparsity
ROLV == cuBLAS — correct baseline
2.4×
vs cuBLAS-BF16 @ 70%
70B embed · production dtype
8.2×
vs cuBLAS-BF16 @ 95%
70B embed · BF16 tensor cores
70/70
Correctness PASS
10 layers × 7 sparsity levels

LLaMA-3.1-8B and 70B exact layer dimensions · NVIDIA B200 · batch=512 · 500 iters · ATOL=0.05 · 4 SHA-256 hashes per case. Speedup vs cuBLAS-BF16 (same hardware path, same dtype). Note: cuSPARSE BF16 kernels are poorly optimised on B200 — ROLV outperforms cuSPARSE-BF16 by 100×+ at these sparsity levels, but cuBLAS-BF16 is the honest production baseline.

05c — Sparsity structure · why our synthetic benchmarks are a floor

Real pruned weights outperform our published numbers.

Our synthetic benchmarks use uniform-random sparsity — the hardest possible case for ROLV because no rows are fully zero. Real LLM weights after magnitude or SparseGPT pruning follow power-law distributions: most rows collapse to zero while a few retain large values. On that structure, the same sparsity level that gives 1× on uniform random gives 7–9× on power-law. Published numbers are a floor.

A — Uniform random
1.00×
At 70–95% sparsity. Active blocks: 100%. ROLV eliminates no redundant computation — every parameter block has at least one active weight. This is our published synthetic. Worst case.
B — Power-law rows
7.6–9.2×
At 70–95% sparsity. Inactive blocks: 70–95%. Matches magnitude pruning on real LLM weights. ROLV eliminates computation on all inactive blocks.
C — Block structured
7.8–9.4×
At 70–95% sparsity. Inactive blocks: 70–95%. Matches structured head pruning. Entire parameter groups eliminated. ROLV skips complete inactive groups.
Hardware
NVIDIA B200 · 5000×5000 · batch 1,000
Correctness
12/12 PASS · 4 SHA-256 hashes per case
Conclusion
Power-law vs uniform: +659%. Block-structured vs uniform: +677%.
Batch size scaling — LLaMA-3.1-8B mlp.up_proj · NVIDIA B200 · 80% sparsity

Speedup grows with batch size. ROLV is best where it matters most.

At production serving batch sizes (512–2048), ROLV achieves 8–11× speedup on the MLP layers that dominate LLaMA inference. cuSPARSE time scales linearly with batch — ROLV scales sub-linearly. The larger the batch, the greater the advantage.

Batch = 1
1.24×
single request
Batch = 64
7.92×
small serving
Batch = 512
7.92×
typical serving
Batch = 1024
9.90×
large serving
Batch = 2048
10.90×
max throughput ★

14336×4096 · 80% sparsity · vs cuSPARSE · NVIDIA B200 · 500 iters · PASS. cuSPARSE/token cost plateaus; ROLV/token keeps falling as batch grows. At batch=2048 ROLV uses 0.41µs per token vs cuSPARSE’s 4.44µs.

05b — Time-to-first-token · Throughput · Effective compute

Faster prefill. More tokens per second. Less time waiting. At 80% sparsity on H200: 2.22ms TTFT vs 30.33ms — 13.6× faster. A 32-layer prefill goes from ~970ms to ~71ms.

Time-to-first-token — NVIDIA H200 · 80% sparsity · 10k×10k
2.22ms
ROLV TTFT
30.33ms
cuSPARSE TTFT
1,124,659
Tokens/s · ROLV
82,426
Tokens/s · cuSPARSE
~970ms
cuSPARSE · 32 layers
~71ms
ROLV · 32 layers
13.6×
Faster prefill
Effective GFLOP/s — NVIDIA H200 · 80% sparsity
44,987
ROLV effective GFLOP/s
16,485
cuSPARSE GFLOP/s
2.73×
More useful GFLOP/s
80%
Sparsity level
0%
Wasted arithmetic

Time-to-first-token is the wall-clock time from receiving a prompt to producing the first output token. It is dominated by the prefill pass — a forward pass through all transformer layers. ROLV reduces the time of each weight-matrix multiply by eliminating redundant computation, cutting per-layer latency from 30ms to 2ms at 80% sparsity. Across 32 layers that compounds to a 13.6× prefill speedup.

Effective GFLOP/s counts only the floating-point operations actually performed on non-zero data — not the full matrix. At 80% sparsity ROLV does 2.73× more useful arithmetic per second than cuSPARSE, because cuSPARSE processes inactive elements that contribute nothing to the output. The metric that matters for your SLA is still wall-clock TTFT, which is what we measure and report.

05a — VRAM savings — scales exactly with sparsity

Less VRAM means larger models or larger batches on the same GPU.

ROLV stores only the active parameter blocks. LLaMA-3.1-8B up_proj at 99% sparsity: 2.34 MB vs 235 MB dense — 100× reduction. Measured directly from the operator build.

Less VRAM at 80%
47 MB vs 235 MB
10×
Less VRAM at 90%
23 MB vs 235 MB
20×
Less VRAM at 95%
12 MB vs 235 MB
100×
Less VRAM at 99%
2.3 MB vs 235 MB
Independent Verification

Four hashes eliminate the need for trust.

Every benchmark publishes four SHA-256 hashes: the weight matrix (A), the input vector (V), the dense baseline output, and the ROLV output. These hashes are committed before any verifier runs anything. To verify independently: download the same public model, extract the same layer, apply the same sparsity, compute the same hashes. If they match, the result is confirmed — we cannot have fabricated a number that independently matches a hash you computed yourself.

The Validation Kit provides exact model IDs, layer names, sparsity levels, and seeds for every published result. No code from us required.

4
Hashes per run
A · V · baseline · ROLV output
528/528
All cases verified
GPU · CPU · real weights · synthetic
Perturbation test
Change one weight → hash must change
Validation Kit
06 — Methodology
Baseline
Best vendor always
cuBLAS or cuSPARSE — whichever is faster at that sparsity level. Never cherry-picked.
Correctness
ATOL=0.1 · col-normalised
Col-normalised fp64, active outputs only. Worst error across all runs: 3.9×10⁻⁶.
Timing
CUDA Events · 100–2,000 iters
Microsecond-accurate. Warmup before every measurement. No single-shot results.
Energy
pynvml where noted
Actual joules via NVIDIA Management Library. Proxy used where pynvml unavailable — always clearly labelled.
Hashes
4 SHA-256 per run
Weight matrix · input vector · dense baseline · ROLV output. One weight change → hash changes — proves real computation.
Reproducibility
Deterministic · seeded
TF32 off, cuDNN deterministic, fixed seeds. NVIDIA, AMD, TPU, CPU all supported. Full JSON on request.
All benchmarks run end-to-end in a single self-contained harness. Full JSON with all hashes and timings available on request.
03 — How It Works

Four steps from dense weight to ROLV Primitive©. Score → prune → quantize → store sparse. The operator is built once per weight matrix and reused across all inference calls. Build time is amortised across thousands of calls.

04 — On Correctness

The ROLV Primitive© is exact on its compressed submatrix — no approximation is introduced by ROLV Primitive© itself. The only source of output error is pruning, which zeroes low-magnitude rows before ROLV Primitive© is built.

This is expected and standard for compressed inference. The goal is to operate within a defined tolerance budget while maximising speed and energy savings. All published results include correctness metrics alongside speedup figures.

3.9×10⁻⁶
Max error
LLaMA-3.1-8B · all sparsity levels
250×
Tighter than ATOL
ATOL=0.001 standard · ROLV achieves 3.9×10⁻⁶
26/26
Correctness PASS
22 H200 + 22 B200 + 22 Intel + 22 AMD + 24 T4 weights + 4 LLaMA levels
✓ ×26
Perturbation tests
One weight change → output hash changes every run
Calculators

RSMT & ROLVswitch™

RSMT Calculator

Find the exact sparsity threshold where sparse storage beats dense for your dtype.

Loading...
ROLVswitch™

Finds the exact sparsity where vendor dense hits VRAM congestion first — your switch point to sparse.

Loading...

Contact Us

[email protected] 3 patents pending  ·  Published by Rolv Eitrem Heggenhougen