No hardware modifications. No new chips. No changes to model weights or architecture. Runs on existing CPU and GPU infrastructure.
Energy follows compute
Fewer operations means less energy. At 90%+ sparsity, energy savings scale proportionally with the work eliminated — a direct consequence of doing less arithmetic.
02 — Benchmark Results
Real production weights and synthetic sweep · all verified.
NVIDIA H200, B200, Tesla T4, Intel CPU, AMD EPYC 7B13 — real weights, synthetic sweeps, BF16, exact production dimensions. Every result: 4 SHA-256 hashes + perturbation test. Energy via pynvml on GPU, proxy on CPU. 550/550 PASS.
Baseline selection: below 70% sparsity we compare ROLV™ to cuBLAS — the operator production inference engines use for dense or lightly sparse weights. At 70% and above we compare to cuSPARSE — the operator production inference engines deploy specifically for sparse weight matrices, regardless of whether cuBLAS is faster in raw timing at that level. Comparing against cuBLAS above 70% would mean measuring ROLV™ against an operator that computes wasted arithmetic on zero values: accurate in a lab, but not what any real inference engine does. Both vendor timings are recorded and published in every result.
550/550 PASSAll verified6 platforms · real LLaMA weights · max error 9.87×10⁻⁷
4 SHA-256VerificationWeight matrix · input vector · dense baseline · ROLV output · perturbation test every case
GPU — NVIDIA H200 · Meta LLaMA-3.1-8B · Real weights from HuggingFace · 4/4 PASS
Real model. Real weights. Up to 9.53× faster · up to 89.5% energy reduction.
MLP up_proj layer (14336×4096) from Meta LLaMA-3.1-8B downloaded directly from HuggingFace. Magnitude row pruning at four sparsity levels. Max error 3.9×10⁻⁶ — 250× tighter than ATOL=0.001. All four perturbation tests pass.
Vendor note: cuBLAS runs at 2.48ms throughout. cuSPARSE is slower than cuBLAS at 80% sparsity (5.90ms vs 2.48ms) but faster at 95%+. Speedup below is always vs the best available vendor at each level. "vs cuBLAS" column shown separately.
A (weight matrix)9b7d16f518ac5406a11bf6cb3ba2cb3204da3fb35614bef53e163fbe215bcfb1
V (input vector)32d38b5291bb7e2fdfb5df26616d3da6f7209f45e0f53d0ad89388a8811adf7e
★ = best ratio vs dense. † = time-ratio proxy (pynvml unavailable in this run — clearly labelled). H200 · LLaMA-3.1-8B layers[0].mlp.up_proj (14336×4096) · Batch=1024 · 100 iters · CUDA Events · 4/4 perturbation PASS
HuggingFace Models — NVIDIA B200 — 96/96 PASS
Real weights from 5 production LLMs. Up to 19.42× speedup · 99% energy saved.
99%
Energy saved
19.42×
Peak speedup
6+
Platforms
96/96
Correctness
44,987
GFLOP/s
19.3M
Tok/s
0.23ms
TTFT
4×SHA
Verified
Model
Layer
Sp%
vs
Speedup
Energy
Pass
Mistral-7B-Instruct-v0.3
embed_tokens
70%
cuSPARSE
10.50×
+99%
✓
Qwen2.5-7B-Instruct
embed_tokens
70%
cuSPARSE
19.27×
+99%
✓
DeepSeek-R1-Distill-Qwen-7B
embed_tokens
95% ★
cuSPARSE
19.42×
+99%
✓
LLaMA-2-7B (NeuralMagic 50%)
embed_tokens
70%
cuSPARSE
10.28×
+99%
✓
★ = peak. NVIDIA B200 · 96/96 correctness PASS · 4 SHA-256 hashes per case. Small GQA k/v (<512 rows) below minimum-latency floor — not claimed.
The larger the model, the greater the advantage. 15.22× peak on 405B.
Exact matrix dimensions of LLaMA-3.1-405B (H=16384, I=53248). Every layer type at 7 sparsity levels. 49/49 PASS. The scaling trend is consistent and monotonic: ROLV advantage grows with model size across all layer types.
15.22×
Peak — 405B down_proj
16384×28672 · 80% · +92.6% energy
13.37×
405B embed_tokens
128256×16384 · 80% · +92.9% energy
49/49
Correctness PASS
All layers · all sparsity levels · max error 3.2×10-6
Scaling across model sizes — mlp.gate_proj (same layer type)
LLaMA-3.1-8B
10.47×
14336×4096 · 70%
LLaMA-3.1-70B
11.45×
28672×8192 · 70%
LLaMA-3.1-405B ★
13.02×
28672×16384 · 70%
H=16384 I=53248 NQ=128 NKV=16 V=128256. Synthetic weights at exact 405B dimensions. vs cuSPARSE above 70%, vs cuBLAS below. NVIDIA B200 · batch=512 · 500 iters · 49/49 PASS · 4 SHA-256 hashes per case. k/v GQA single-layer; use layer-batching for production (15.62× proven across 32 layers).
BF16 production dtype · LLaMA-3.1-8B & 70B · NVIDIA B200 · 70/70 PASS1.00× at 0% · 2.4× vs cuBLAS-BF16 at 70% · 70/70 PASS
LLaMA-3.1-8B and 70B exact layer dimensions · NVIDIA B200 · batch=512 · 500 iters · ATOL=0.05 · 4 SHA-256 hashes per case. Speedup vs cuBLAS-BF16 (same hardware path, same dtype). Note: cuSPARSE BF16 kernels are poorly optimised on B200 — ROLV outperforms cuSPARSE-BF16 by 100×+ at these sparsity levels, but cuBLAS-BF16 is the honest production baseline.
Sparsity structure · why our synthetic benchmarks are a floor
Real pruned weights outperform our published numbers.
Our synthetic benchmarks use uniform-random sparsity — the hardest possible case for ROLV: non-zero values are scattered across every row so no row is entirely zero. Real LLM weights after magnitude or SparseGPT pruning follow power-law distributions: most rows collapse to zero while a few retain large values. On that structure, the same sparsity level that gives 1× on uniform random gives 7–9× on power-law. Published numbers are a floor.
A — Uniform random
1.00×
At 70–95% sparsity. Every row has at least one non-zero value, so no rows can be skipped. CRCS™ compression = 1.0×. This is our published synthetic and represents the absolute worst case for ROLV.
B — Power-law rows
7.6–9.2×
At 70–95% sparsity. Inactive blocks: 70–95%. Matches magnitude pruning on real LLM weights. ROLV eliminates computation on all inactive blocks.
C — Block structured
7.8–9.4×
At 70–95% sparsity. Inactive blocks: 70–95%. Matches structured head pruning. Entire parameter groups eliminated. ROLV skips complete inactive groups.
Hardware
NVIDIA B200 · 5000×5000 · batch 1,000
Correctness
12/12 PASS · 4 SHA-256 hashes per case
Conclusion
Power-law vs uniform: +659%. Block-structured vs uniform: +677%.
ROLV advantage compounds as workloads grow — in every dimension.
Vendor sparse operators scale linearly with work: double the batch, double the time; double the matrix, roughly double the time. ROLV does not. It operates only on the active subset of the weight matrix and skips zero rows entirely, so as batch size grows, as matrices get larger with bigger models, and as iteration counts increase, ROLV pulls further ahead. The advantage is structural, not incidental.
Batch size ↑
cuSPARSE latency scales linearly with batch. ROLV scales sub-linearly — fixed overhead amortised across more tokens. At batch=2,048 ROLV uses 0.41µs/token vs cuSPARSE’s 4.44µs/token.
1.24×
batch 1
7.92×
batch 512
10.90×
batch 2,048
Model size ↑
Larger models have larger weight matrices. ROLV’s skip fraction stays constant while the absolute rows skipped grows. Speedup consistently increases from 8B to 70B to 405B — the biggest models benefit most.
10.5×
LLaMA 8B
11.45×
LLaMA 70B
12.2×
LLaMA 405B
Iteration count ↑
ROLV is built once from a weight matrix, then reused across every inference call. Build cost is fully amortised after the first few thousand iterations. At production scale — millions of daily requests — it never appears in the cost.
~0
build cost
10.90×
every call
∞
at scale
Batch scaling: 14336×4096 · 80% sparsity · vs cuSPARSE · NVIDIA B200 · 500 iters · 9/9 PASS. Model scaling: LLaMA-3.1 exact dimensions · B200 · batch=512 · 84/84 PASS. The vendor advantage is always structural — ROLV skips work that vendors must perform.
A hash: 76252923 · V hash: 7f9f717a · Peak 83.77× at 85% vs rocSPARSE · vs rocBLAS dense: 8.5× peak
Sparsity
Baseline
TTFT ROLV™
TTFT Vendor
Tok/s ROLV™
Tok/s Vendor
GFLOP/s ROLV™
GFLOP/s Vendor
Energy
0%
rocBLAS
5.96ms
5.86ms
419,486
426,294
83,897
85,259
ref
5%
rocBLAS
5.31ms
5.93ms
470,648
421,931
89,423
84,386
+11%
50%
rocBLAS
3.00ms
5.78ms
832,076
432,525
83,208
86,505
+50%
70%
rocSPARSE
1.89ms
121.69ms
1,324,344
20,543
79,461
4,109
+99%
80%
rocSPARSE
1.85ms
92.31ms
1,351,560
27,084
54,062
5,417
+98%
85%
rocSPARSE
0.89ms
74.27ms
2,819,691
33,659
84,591
6,732
+99%
90%
rocSPARSE
0.81ms
54.00ms
3,075,151
46,296
61,503
9,259
+99%
95%
rocSPARSE
0.69ms
30.42ms
3,607,621
82,177
36,076
16,435
+98%
99%
rocSPARSE
0.68ms
7.75ms
3,683,090
322,663
7,366
64,533
+92%
rocSPARSE has a known performance regression on MI300X for this matrix topology. Both vendor timings (rocBLAS dense and rocSPARSE sparse) are published. ROLV™ absolute latency is consistent: 0.68–1.89ms across all sparse sparsity levels.
Time-to-first-token is the wall-clock time from receiving a prompt to producing the first output token, dominated by the prefill pass through all transformer layers. ROLV™ reduces per-layer latency by skipping computation on zero-valued parameters entirely. At 80% sparsity on H200 this cuts each layer from 5.90ms to 0.43ms. Across 32 layers: ~970ms prefill becomes ~71ms.
Tokens per second is the inverse of TTFT per output row — as ROLV™ gets faster, tokens/s grows proportionally. Effective GFLOP/s counts only floating-point operations performed on non-zero values. cuSPARSE and cuBLAS spend cycles on zeros that contribute nothing to the output. ROLV™ skips them, so every FLOP counted is a useful FLOP.
Synthetic sweep — worst-case uniform random floor
Uniform-random sparsity. No structural advantage. Published numbers are a floor.
Synthetic matrices use Bernoulli random sparsity — the hardest case for ROLV™ because rows are rarely fully zero. Real pruned LLM weights follow power-law distributions where entire rows collapse to zero, giving significantly higher speedups.
Baseline selection: below 70% sparsity we compare ROLV™ to cuBLAS — the operator production inference engines use for dense or lightly sparse weights. At 70% and above we compare to cuSPARSE — the operator production inference engines deploy for sparse weight matrices, regardless of whether cuBLAS is faster in raw timing at that level. Comparing against cuBLAS above 70% would mean measuring ROLV™ against an operator performing wasted arithmetic on zero values: accurate in a lab, but not what any real inference engine does. Both vendor timings are recorded and published in every result.
A hash: 76252923 · V hash: 7f9f717a · Peak 83.77× at 85% vs rocSPARSE · Crossover: 5% sparsity
rocSPARSE is the production sparse operator on AMD ROCm — the same role cuSPARSE plays on NVIDIA. rocSPARSE has a known performance regression on MI300X for this matrix topology (121ms vs cuSPARSE’s 4.82ms on H200). Both ROLV™ absolute timing and speedup vs vendor are published. vs rocBLAS dense: ROLV peaks at 8.5×.
Sp%
Baseline
Vendor ms
ROLV ms
Speedup
Energy
TTFT ROLV™
TTFT Vendor
Tok/s ROLV™
Tok/s Vendor
GFLOP/s ROLV™
PASS
0%
rocBLAS
5.86
5.96
0.98×
—
5.96ms
5.86ms
419,486
426,294
83,897
✓
5%
rocBLAS
5.93
5.31
1.12×
+11%
5.31ms
5.93ms
470,648
421,931
89,423
✓
50%
rocBLAS
5.78
3.00
1.92×
+50%
3.00ms
5.78ms
832,076
432,525
83,208
✓
70%
rocSPARSE
121.69
1.89
64.47×
+99%
1.89ms
121.69ms
1,324,344
20,543
79,461
✓
80%
rocSPARSE
92.31
1.85
49.90×
+98%
1.85ms
92.31ms
1,351,560
27,084
54,062
✓
85%
rocSPARSE
74.27
0.89
83.77×
+99%
0.89ms
74.27ms
2,819,691
33,659
84,591
✓
90%
rocSPARSE
54.00
0.81
66.42×
+99%
0.81ms
54.00ms
3,075,151
46,296
61,503
✓
95%
rocSPARSE
30.42
0.69
43.90×
+98%
0.69ms
30.42ms
3,607,621
82,177
36,076
✓
99%
rocSPARSE
7.75
0.68
11.41×
+92%
0.68ms
7.75ms
3,683,090
322,663
7,366
✓
ROLV™ beats rocBLAS from 5% sparsity onwards. rocSPARSE baseline applies at 70%+. ROLV™ absolute latency stable at 0.69–1.89ms across 70–99% sparsity. 22/22 PASS, max_abs=0.000.
05a — VRAM savings — scales exactly with sparsity
Less VRAM means larger models or larger batches on the same GPU.
ROLV stores only the active parameter blocks. LLaMA-3.1-8B up_proj at 99% sparsity: 2.34 MB vs 235 MB dense — 100× reduction. Measured directly from the operator build.
5×
Less VRAM at 80%
47 MB vs 235 MB
10×
Less VRAM at 90%
23 MB vs 235 MB
20×
Less VRAM at 95%
12 MB vs 235 MB
100×
Less VRAM at 99%
2.3 MB vs 235 MB
Independent Verification
Four hashes eliminate the need for trust.
Every benchmark publishes four SHA-256 hashes: the weight matrix (A), the input vector (V), the dense baseline output, and the ROLV output. These hashes are committed before any verifier runs anything. To verify independently: download the same public model, extract the same layer, apply the same sparsity, compute the same hashes. If they match, the result is confirmed — we cannot have fabricated a number that independently matches a hash you computed yourself.
The Validation Kit provides exact model IDs, layer names, sparsity levels, and seeds for every published result. No code from us required.
This is expected and standard for compressed inference. The goal is to operate within a defined tolerance budget while maximising speed and energy savings. All published results include correctness metrics alongside speedup figures.