Tested on NVIDIA H200, B200, Tesla T4, Intel CPU, and AMD EPYC across 22 sparsity levels, real LLaMA-3.1-8B weights, and real production weight matrices. ROLV Primitive© beats cuBLAS from just 5% sparsity on GPU, and beats MKL from 0% on CPU. Confirmed correct in BF16. Energy reductions measured directly via pynvml. 528 verified test cases across 6 hardware platforms.
Works best when matrices are genuinely sparse. At 90%+ sparsity, ROLV Primitive© skips the vast majority of multiply-accumulate operations — the work simply does not happen.
No hardware modifications. No new chips. No changes to model weights or architecture. Runs on existing CPU and GPU infrastructure.
Fewer operations means less energy. At 90%+ sparsity, energy savings scale proportionally with the work eliminated — a direct consequence of doing less arithmetic.
NVIDIA H200, B200, Tesla T4, Intel CPU, AMD EPYC 7B13 — real weights, synthetic sweeps, BF16, exact production dimensions. Every result: 4 SHA-256 hashes + perturbation test. Energy via pynvml on GPU, proxy on CPU. 528/528 PASS.
MLP up_proj layer (14336×4096) from Meta LLaMA-3.1-8B downloaded directly from HuggingFace. Magnitude row pruning at four sparsity levels. Max error 3.9×10⁻⁶ — 250× tighter than ATOL=0.001. All four perturbation tests pass.
| Sparsity | Active params | Compr. | Best vendor ms | ROLV ms | vs vendor | vs cuBLAS | Energy† | Pass |
|---|---|---|---|---|---|---|---|---|
| 80% | 2,867 | 5× | 5.8984 cuSPARSE | 0.6190 | 9.53× | 4.01× | +89.5% | ✓ |
| 90% | 1,434 | 10× | 3.0077 cuSPARSE | 0.3475 | 8.66× | 7.14× | +88.4% | ✓ |
| 95% ★ | 717 | 20× | 1.5547 cuSPARSE | 0.2265 | 6.86× | 10.96× | +85.4% | ✓ |
| 99% | 143 | 100× | 0.4415 cuSPARSE | 0.1720 | 2.57× | 14.43× | +61.0% | ✓ |
★ = best ratio vs dense. † = time-ratio proxy (pynvml unavailable in this run — clearly labelled). H200 · LLaMA-3.1-8B layers[0].mlp.up_proj (14336×4096) · Batch=1024 · 100 iters · CUDA Events · 4/4 perturbation PASS
Tested on Mistral-7B-Instruct, Qwen2.5-7B-Instruct, DeepSeek-R1-Distill-Qwen-7B, and LLaMA-2-7B (NeuralMagic 50% pre-pruned) on NVIDIA B200. Large embedding and MLP layers see 10–19× speedup at 70%+ sparsity. Small GQA k/v matrices (<512 rows) are below the minimum-latency floor — ROLV does not claim speedup there. All 96 test cases PASS. Below 70%: vs cuBLAS. Above 70%: vs cuSPARSE.
| Model | Layer | Sparsity | vs | Speedup | Energy | Pass |
|---|---|---|---|---|---|---|
| Mistral-7B-Instruct-v0.3 | embed_tokens | 70% | cuSPARSE | 10.50× | +99% | ✓ |
| Mistral-7B-Instruct-v0.3 | q_proj | 80% | cuSPARSE | 2.97× | +66.3% | ✓ |
| Qwen2.5-7B-Instruct | embed_tokens | 70% | cuSPARSE | 19.27× | +99% | ✓ |
| Qwen2.5-7B-Instruct | q_proj | 70% | cuSPARSE | 3.32× | +69.9% | ✓ |
| DeepSeek-R1-Distill-Qwen-7B | embed_tokens | 95% ★ | cuSPARSE | 19.42× | +99% | ✓ |
| LLaMA-2-7B (NeuralMagic 50% pre-pruned) | embed_tokens | 70% | cuSPARSE | 10.28× | +99% | ✓ |
| LLaMA-2-7B (NeuralMagic 50% pre-pruned) | v_proj | 95% | cuSPARSE | 3.37× | +70.3% | ✓ |
★ = peak. Italic = cuSPARSE. Large embedding/MLP layers: 10–19×. Small GQA k/v matrices (<512 rows) below minimum-latency floor — ROLV does not claim speedup there. NVIDIA B200 · 96/96 correctness PASS · 4 SHA-256 hashes per case.
meta-llama/Llama-3.1-8B downloaded directly from HuggingFace. Gate, up, and down projection layers at three depths. 60/60 PASS across all layer types and sparsity levels. The MLP speedup is identical at every layer depth — not cherry-picked. Synthetic benchmarks predict real weights to within 0.5%.
| Layer | Shape | Best speedup | Energy | At sparsity | Consistent |
|---|---|---|---|---|---|
| embed_tokens ★ | 128256×4096 | 11.24× | +91% | 80% | ✓ |
| mlp.gate_proj (L0/16/31) | 14336×4096 | 10.42× | +99% | 70% | All 3 depths |
| mlp.up_proj (L0/16/31) | 14336×4096 | 10.42× | +99% | 70% | All 3 depths |
| mlp.down_proj (L0/31) | 4096×14336 | 8.65× | +99% | 70% | Both depths |
| q_proj | 4096×4096 | 6.63× | +99% | 70% | ✓ |
| k/v proj (GQA) | 1024×4096 | 3.46× | +71% | 70%† | ✓ |
★ = peak. Real weights, no synthetic pruning. Magnitude row pruning applied. NVIDIA B200 · batch=512 · 200 iters · 60/60 PASS · 59/60 perturbation PASS · 4 SHA-256 hashes per case. Cache deleted after run. † GQA single-layer; use layer-batching for production (15.62× proven).
Synthetic weights at the exact dimensions of every layer type in LLaMA-3.1-8B (H=4096, I=14336) and 70B (H=8192, I=28672). 7 layer types × 2 models × 6 sparsity levels = 84 cases. 84/84 PASS. Larger models benefit more — 70B consistently outperforms 8B on every layer type.
| Layer | Shape | 8B peak | 70B peak | Energy saving | At sparsity |
|---|---|---|---|---|---|
| embed_tokens | 128256×H | 11.27× | 11.95× | +91–93% | 80% |
| mlp.gate_proj | I×H | 10.47× | 11.45× | +91–99% | 70% |
| mlp.up_proj | I×H | 10.45× | 11.44× | +91–99% | 70% |
| mlp.down_proj | H×I | 8.47× | 10.83× | +91–99% | 70% |
| q_proj | H×H | 6.70× | 8.53× | +75–99% | 70% |
| k_proj / v_proj (GQA) | kv_dim×H | 3.32× | 4.43× | +49–77% | 70%† |
8B: H=4096 I=14336. 70B: H=8192 I=28672. Both: vocab=128256, NKV=8. vs cuSPARSE above 70%, vs cuBLAS below. NVIDIA B200 · batch=512 · 500 iters · 84/84 PASS. † GQA single-layer; use layer-batching for production (15.62× proven across 32 layers).
Exact matrix dimensions of LLaMA-3.1-405B (H=16384, I=53248). Every layer type at 7 sparsity levels. 49/49 PASS. The scaling trend is consistent and monotonic: ROLV advantage grows with model size across all layer types.
H=16384 I=53248 NQ=128 NKV=16 V=128256. Synthetic weights at exact 405B dimensions. vs cuSPARSE above 70%, vs cuBLAS below. NVIDIA B200 · batch=512 · 500 iters · 49/49 PASS · 4 SHA-256 hashes per case. k/v GQA single-layer; use layer-batching for production (15.62× proven across 32 layers).
ROLV runs in native BF16 throughout — weights, compute, and output all in BF16 using the same tensor cores as cuBLAS. At 0% sparsity ROLV matches cuBLAS exactly (1.00×). At 70%+ sparsity ROLV outperforms cuBLAS-BF16 on every layer tested. 70/70 PASS.
LLaMA-3.1-8B and 70B exact layer dimensions · NVIDIA B200 · batch=512 · 500 iters · ATOL=0.05 · 4 SHA-256 hashes per case. Speedup vs cuBLAS-BF16 (same hardware path, same dtype). Note: cuSPARSE BF16 kernels are poorly optimised on B200 — ROLV outperforms cuSPARSE-BF16 by 100×+ at these sparsity levels, but cuBLAS-BF16 is the honest production baseline.
Our synthetic benchmarks use uniform-random sparsity — the hardest possible case for ROLV because no rows are fully zero. Real LLM weights after magnitude or SparseGPT pruning follow power-law distributions: most rows collapse to zero while a few retain large values. On that structure, the same sparsity level that gives 1× on uniform random gives 7–9× on power-law. Published numbers are a floor.
At production serving batch sizes (512–2048), ROLV achieves 8–11× speedup on the MLP layers that dominate LLaMA inference. cuSPARSE time scales linearly with batch — ROLV scales sub-linearly. The larger the batch, the greater the advantage.
14336×4096 · 80% sparsity · vs cuSPARSE · NVIDIA B200 · 500 iters · PASS. cuSPARSE/token cost plateaus; ROLV/token keeps falling as batch grows. At batch=2048 ROLV uses 0.41µs per token vs cuSPARSE’s 4.44µs.
Time-to-first-token is the wall-clock time from receiving a prompt to producing the first output token. It is dominated by the prefill pass — a forward pass through all transformer layers. ROLV reduces the time of each weight-matrix multiply by eliminating redundant computation, cutting per-layer latency from 30ms to 2ms at 80% sparsity. Across 32 layers that compounds to a 13.6× prefill speedup.
Effective GFLOP/s counts only the floating-point operations actually performed on non-zero data — not the full matrix. At 80% sparsity ROLV does 2.73× more useful arithmetic per second than cuSPARSE, because cuSPARSE processes inactive elements that contribute nothing to the output. The metric that matters for your SLA is still wall-clock TTFT, which is what we measure and report.
ROLV stores only the active parameter blocks. LLaMA-3.1-8B up_proj at 99% sparsity: 2.34 MB vs 235 MB dense — 100× reduction. Measured directly from the operator build.
Every benchmark publishes four SHA-256 hashes: the weight matrix (A), the input vector (V), the dense baseline output, and the ROLV output. These hashes are committed before any verifier runs anything. To verify independently: download the same public model, extract the same layer, apply the same sparsity, compute the same hashes. If they match, the result is confirmed — we cannot have fabricated a number that independently matches a hash you computed yourself.
The Validation Kit provides exact model IDs, layer names, sparsity levels, and seeds for every published result. No code from us required.
The ROLV Primitive© is exact on its compressed submatrix — no approximation is introduced by ROLV Primitive© itself. The only source of output error is pruning, which zeroes low-magnitude rows before ROLV Primitive© is built.
This is expected and standard for compressed inference. The goal is to operate within a defined tolerance budget while maximising speed and energy savings. All published results include correctness metrics alongside speedup figures.
Find the exact sparsity threshold where sparse storage beats dense for your dtype.
Finds the exact sparsity where vendor dense hits VRAM congestion first — your switch point to sparse.