Tested on NVIDIA H200, B200, Tesla T4, Intel CPU, AMD EPYC, AMD Instinct MI300X, and Google Axion ARM across 22 sparsity levels, real LLaMA-3.1-8B weights, and real production weight matrices. ROLV Primitive© beats cuBLAS from just 5% sparsity on GPU, and beats MKL from 0% on CPU. Confirmed correct in BF16. Energy reductions measured directly via pynvml. 528 verified test cases across 6 hardware platforms.
Works best when matrices are genuinely sparse. At 90%+ sparsity, ROLV Primitive© skips the vast majority of multiply-accumulate operations — the work simply does not happen.
No hardware modifications. No new chips. No changes to model weights or architecture. Runs on existing CPU and GPU infrastructure.
Fewer operations means less energy. At 90%+ sparsity, energy savings scale proportionally with the work eliminated — a direct consequence of doing less arithmetic.
No satellite uplink. No cloud API. No GPU. Two thousand miles from the nearest port, a crew member searched ten years of maintenance manuals, cargo manifests, and safety procedures — in seconds — using AI running entirely on the ship’s existing hardware.
That’s what ROLV makes possible. A 42× speedup on a consumer CPU turns a pruned 70B model from unusable to deployable — on any machine, anywhere, with no connection required.
42× speedup on Intel i7 · 33× on 70B embed · no GPU required · 1142/1142 PASS
NVIDIA H200, B200, Tesla T4, Intel CPU, AMD EPYC 7B13 — real weights, synthetic sweeps, BF16, exact production dimensions. Every result: 4 SHA-256 hashes + perturbation test. Energy via pynvml on GPU, proxy on CPU. 1142/1142 PASS.
MLP up_proj layer (14336×4096) from Meta LLaMA-3.1-8B downloaded directly from HuggingFace. Magnitude row pruning at four sparsity levels. Max error 3.9×10⁻⁶ — 250× tighter than ATOL=0.001. All four perturbation tests pass.
| Sparsity | Active params | Compr. | Best vendor ms | ROLV ms | vs vendor | vs cuBLAS | Energy† | Pass |
|---|---|---|---|---|---|---|---|---|
| 80% | 2,867 | 5× | 5.8984 cuSPARSE | 0.6190 | 9.53× | 4.01× | +89.5% | ✓ |
| 90% | 1,434 | 10× | 3.0077 cuSPARSE | 0.3475 | 8.66× | 7.14× | +88.4% | ✓ |
| 95% ★ | 717 | 20× | 1.5547 cuSPARSE | 0.2265 | 6.86× | 10.96× | +85.4% | ✓ |
| 99% | 143 | 100× | 0.4415 cuSPARSE | 0.1720 | 2.57× | 14.43× | +61.0% | ✓ |
★ = best ratio vs dense. † = time-ratio proxy (pynvml unavailable in this run — clearly labelled). H200 · LLaMA-3.1-8B layers[0].mlp.up_proj (14336×4096) · Batch=1024 · 100 iters · CUDA Events · 4/4 perturbation PASS
| Model | Layer | Sp% | vs | Speedup | Energy | Pass |
|---|---|---|---|---|---|---|
| Mistral-7B-Instruct-v0.3 | embed_tokens | 70% | cuSPARSE | 10.50× | +99% | ✓ |
| Qwen2.5-7B-Instruct | embed_tokens | 70% | cuSPARSE | 19.27× | +99% | ✓ |
| DeepSeek-R1-Distill-Qwen-7B | embed_tokens | 95% ★ | cuSPARSE | 19.42× | +99% | ✓ |
| LLaMA-2-7B (NeuralMagic 50%) | embed_tokens | 70% | cuSPARSE | 10.28× | +99% | ✓ |
| Qwen2.5-72B-Instruct | embed_tokens | 70% | cuSPARSE | 11.72× ★ | +91% | ✓ |
| Qwen2.5-72B-Instruct | mlp.gate_proj | 70% | cuSPARSE | 11.39× | +91% | ✓ |
| DeepSeek-V3 (671B/37B active) | embed_tokens | 80% | cuSPARSE | 11.80× ★ | +92% | ✓ |
| DeepSeek-V3 (671B/37B active) | q_proj | 70% | cuSPARSE | 9.96× | +90% | ✓ |
| Kimi K2 (1T/32B active) | embed_tokens | 70% | cuSPARSE | 11.90× ★ | +92% | ✓ |
| Kimi K2 (1T/32B active) | q_proj | 70% | cuSPARSE | 9.98× | +90% | ✓ |
| Llama 4 Scout/Maverick ★ | embed_tokens | 70% | cuSPARSE | 11.91× | +92% | ✓ |
| Mistral Large 3 (675B/41B) | shared_expert.down | 70% | cuSPARSE | 9.40× | +89% | ✓ |
| Qwen3-235B-A22B | embed_tokens | 70% | cuSPARSE | 11.47× | +91% | ✓ |
| Microsoft Phi-4 (14B dense) | mlp.gate_proj | 70% | cuSPARSE | 9.34× | +89% | ✓ |
| GPT-OSS 120B/20B (OpenAI) | embed_tokens | 70% | cuSPARSE | 11.33× | +91% | ✓ |
★ = peak. NVIDIA B200 · 582/582 correctness PASS · 4 SHA-256 hashes per case. Qwen2.5-72B: 36/36 PASS, 11.72× peak embed, 11.39× MLP. Small GQA k/v (<512 rows) below minimum-latency floor — not claimed.
★ = peak. Real weights, no synthetic pruning. Magnitude row pruning applied. NVIDIA B200 · batch=512 · 200 iters · 60/60 PASS · 59/60 perturbation PASS · 4 SHA-256 hashes per case. Cache deleted after run. † GQA single-layer; use layer-batching for production (15.62× proven).
8B: H=4096 I=14336. 70B: H=8192 I=28672. Both: vocab=128256, NKV=8. vs cuSPARSE above 70%, vs cuBLAS below. NVIDIA B200 · batch=512 · 500 iters · 84/84 PASS. † GQA single-layer; use layer-batching for production (15.62× proven across 32 layers).
Exact matrix dimensions of LLaMA-3.1-405B (H=16384, I=53248). Every layer type at 7 sparsity levels. 49/49 PASS. The scaling trend is consistent and monotonic: ROLV advantage grows with model size across all layer types.
H=16384 I=53248 NQ=128 NKV=16 V=128256. Synthetic weights at exact 405B dimensions. vs cuSPARSE above 70%, vs cuBLAS below. NVIDIA B200 · batch=512 · 500 iters · 49/49 PASS · 4 SHA-256 hashes per case. k/v GQA single-layer; use layer-batching for production (15.62× proven across 32 layers).
LLaMA-3.1-8B and 70B exact layer dimensions · NVIDIA B200 · batch=512 · 500 iters · ATOL=0.05 · 4 SHA-256 hashes per case. Speedup vs cuBLAS-BF16 (same hardware path, same dtype). Note: cuSPARSE BF16 kernels are poorly optimised on B200 — ROLV outperforms cuSPARSE-BF16 by 100×+ at these sparsity levels, but cuBLAS-BF16 is the honest production baseline.
Our synthetic benchmarks use uniform-random sparsity — the hardest possible case for ROLV: non-zero values are scattered across every row so no row is entirely zero. Real LLM weights after magnitude or SparseGPT pruning follow power-law distributions: most rows collapse to zero while a few retain large values. On that structure, the same sparsity level that gives 1× on uniform random gives 7–9× on power-law. Published numbers are a floor.
Vendor sparse operators scale linearly with work: double the batch, double the time; double the matrix, roughly double the time. ROLV does not. It operates only on the active subset of the weight matrix and skips zero rows entirely, so as batch size grows, as matrices get larger with bigger models, and as iteration counts increase, ROLV pulls further ahead. The advantage is structural, not incidental.
cuSPARSE latency scales linearly with batch. ROLV scales sub-linearly — fixed overhead amortised across more tokens. At batch=2,048 ROLV uses 0.41µs/token vs cuSPARSE’s 4.44µs/token.
Larger models have larger weight matrices. ROLV’s skip fraction stays constant while the absolute rows skipped grows. Speedup consistently increases from 8B to 70B to 405B — the biggest models benefit most.
ROLV is built once from a weight matrix, then reused across every inference call. Build cost is fully amortised after the first few thousand iterations. At production scale — millions of daily requests — it never appears in the cost.
Batch scaling: 14336×4096 · 80% sparsity · vs cuSPARSE · NVIDIA B200 · 500 iters · 9/9 PASS. Model scaling: LLaMA-3.1 exact dimensions · B200 · batch=512 · 84/84 PASS. The vendor advantage is always structural — ROLV skips work that vendors must perform.
TTFT, tokens/second, and effective GFLOP/s measured directly at each sparsity level across all four platforms. NVIDIA H200 shown by default.
A hash: b2687223 · V hash: f8b47533
| Sparsity | Baseline | TTFT ROLV™ | TTFT Vendor | Tok/s ROLV™ | Tok/s Vendor | GFLOP/s ROLV™ | GFLOP/s Vendor | Energy |
|---|---|---|---|---|---|---|---|---|
| 0% | cuBLAS | 2.51ms | 2.48ms | 1,003,984 | 1,003,984 | 100,842 | 100,842 | ref |
| 50% | cuBLAS | 1.31ms | 2.48ms | 1,908,397 | 992,032 | 52,441 | 100,842 | +47% |
| 70% | cuSPARSE | 0.68ms | 4.82ms | 7,352,941 | 1,247,000 | 22,134 | 12,502 | +86% |
| 80% | cuSPARSE | 0.43ms | 5.90ms | 11,627,907 | 1,694,915 | 44,987 | 16,485 | +97% |
| 90% | cuSPARSE | 0.28ms | 3.71ms | 17,857,143 | 1,347,709 | 26,762 | 16,189 | +99% |
| 95% | cuSPARSE | 0.19ms | 2.02ms | 26,315,789 | 1,237,624 | 17,841 | 14,887 | +99% |
| 99% | cuSPARSE | 0.08ms | 0.61ms | 62,500,000 | 8,196,721 | 5,120 | 98,000 | +99% |
At 80% sparsity: 32-layer prefill goes from ~970ms → ~71ms. GFLOP/s counts only arithmetic on non-zero data.
Time-to-first-token is the wall-clock time from receiving a prompt to producing the first output token, dominated by the prefill pass through all transformer layers. ROLV™ reduces per-layer latency by skipping computation on zero-valued parameters entirely. At 80% sparsity on H200 this cuts each layer from 5.90ms to 0.43ms. Across 32 layers: ~970ms prefill becomes ~71ms.
Tokens per second is the inverse of TTFT per output row — as ROLV™ gets faster, tokens/s grows proportionally. Effective GFLOP/s counts only floating-point operations performed on non-zero values. cuSPARSE and cuBLAS spend cycles on zeros that contribute nothing to the output. ROLV™ skips them, so every FLOP counted is a useful FLOP.
First published ROLV results on ARM64 architecture. Google Axion (Neoverse V2) — Google Cloud C4A instance. ROLV performs identically on ARM as on x86: same algorithm, same advantage, same correctness. 22/22 PASS · max error 0.00e+00.
| Sparsity | vs | Vendor ms | ROLV ms | Speedup | Energy | Tok/s ROLV | TTFT ROLV | Pass |
|---|---|---|---|---|---|---|---|---|
| 0% | OpenBLAS | 46.85 | 46.53 | 1.01× | ref | 10,745 | 46.53ms | ✓ |
| 50% | OpenBLAS | 46.49 | 24.05 | 1.93× | +48% | 20,787 | 24.05ms | ✓ |
| 70% | CPU-CSR | 62.08 | 14.48 | 4.29× | +77% | 34,527 | 14.48ms | ✓ |
| 80% ★ | CPU-CSR | 42.41 | 9.79 | 4.33× | +77% | 51,078 | 9.79ms | ✓ |
| 90% | CPU-CSR | 20.94 | 5.35 | 3.92× | +76% | 93,534 | 5.35ms | ✓ |
| 95% | CPU-CSR | 10.65 | 2.83 | 3.76× | +73% | 176,578 | 2.83ms | ✓ |
| 99% | CPU-CSR | 2.37 | 0.82 | 2.89× | +65% | 608,004 | 0.82ms | ✓ |
★ = peak. Google Cloud C4A · Google Axion (ARM Neoverse V2, aarch64) · 2000×2000 · batch=500 · 100 iters · 22/22 PASS · 4 SHA-256 hashes. A: 82371dc0 · V: 3107f98a
First ever ROLV Primitive© benchmark on ARM architecture. Google Cloud C4A instance running Google Axion (Neoverse V2) — the same CPU family powering AWS Graviton and Apple Silicon. ROLV outperforms OpenBLAS from just 5% sparsity, reaching 5.12× at 70% vs CPU-CSR. Same software operator, zero changes — ARM just works.
Google Cloud C4A · Google Axion (ARM Neoverse V2) · aarch64 · 3000×3000 · batch=1000 · iters=1000 · 22/22 PASS · max error 0.00e+00 · 4 SHA-256 hashes per run.
Exact LLaMA-3.1-8B and 70B layer dimensions on a consumer Intel Core i7 laptop CPU. vs MKL below 70% sparsity, vs CPU-CSR above 70%. ROLV on a laptop CPU outperforms CPU-CSR by up to 42× — making sparse LLM inference viable on edge hardware without a GPU.
| Layer | Shape | Sp% | vs | Vendor ms | ROLV ms | Speedup | Energy | Pass |
|---|---|---|---|---|---|---|---|---|
| 8B embed_tokens | 128256×4096 | 70% | CPU-CSR | 13,868 | 700.6 | 19.8× | +81% | ✓ |
| 8B mlp.gate_proj | 14336×4096 | 80% | CPU-CSR | 927.8 | 43.6 | 21.3× | +79% | ✓ |
| 8B mlp.down_proj | 4096×14336 | 70% | CPU-CSR | 2,123.8 | 53.2 | 39.9× | +89% | ✓ |
| 8B mlp.down_proj ★ | 4096×14336 | 95% | CPU-CSR | 355.9 | 8.4 | 42.4× | +91% | ✓ |
| 8B q_proj | 4096×4096 | 70% | CPU-CSR | 385.6 | 16.4 | 23.5× | +83% | ✓ |
| 70B embed_tokens ★ | 128256×8192 | 70% | CPU-CSR | 39,974 | 1,187.8 | 33.6× | +88% | ✓ |
| 70B mlp.gate_proj | 28672×8192 | 70% | CPU-CSR | 8,564.8 | 261.3 | 32.8× | +88% | ✓ |
| 70B mlp.up_proj | 28672×8192 | 70% | CPU-CSR | 8,982.8 | 509.0 | 17.6× | +83% | ✓ |
★ = peak. Intel Core i7 (Intel64 Family 6 Model 140, 68.4 GB RAM) · batch=512 · synthetic weights at exact LLaMA-3.1-8B and 70B dimensions · 84/84 PASS · 4 SHA-256 hashes per case. vs MKL below 70%, vs CPU-CSR above 70%.
Synthetic matrices use Bernoulli random sparsity — the hardest case for ROLV™ because rows are rarely fully zero. Real pruned LLM weights follow power-law distributions where entire rows collapse to zero, giving significantly higher speedups.
A hash: b2687223 · V hash: f8b47533 · Peak 13.64× at 80% vs cuSPARSE
| Sp% | Baseline | Vendor ms | ROLV ms | Speedup | Energy | TTFT ROLV™ | TTFT Vendor | Tok/s ROLV™ | Tok/s Vendor | GFLOP/s ROLV™ | PASS |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0% | cuBLAS | 2.48 | 2.51 | 0.99× | — | 2.51ms | 2.48ms | 1,003,984 | 992,032 | 100,842 | ✓ |
| 50% | cuBLAS | 2.48 | 1.31 | 1.89× | +47% | 1.31ms | 2.48ms | 1,908,397 | 992,032 | 52,441 | ✓ |
| 70% | cuSPARSE | 4.82 | 0.68 | 7.09× | +86% | 0.68ms | 4.82ms | 7,352,941 | 1,247,000 | 22,134 | ✓ |
| 80% | cuSPARSE | 5.9 | 0.43 | 13.64× | +97% | 0.43ms | 5.9ms | 11,627,907 | 1,694,915 | 44,987 | ✓ |
| 90% | cuSPARSE | 3.71 | 0.28 | 13.25× | +99% | 0.28ms | 3.71ms | 17,857,143 | 1,347,709 | 26,762 | ✓ |
| 95% | cuSPARSE | 2.02 | 0.19 | 10.63× | +99% | 0.19ms | 2.02ms | 26,315,789 | 1,237,624 | 17,841 | ✓ |
| 99% | cuSPARSE | 0.61 | 0.08 | 7.63× | +99% | 0.08ms | 0.61ms | 62,500,000 | 8,196,721 | 5,120 | ✓ |
At 80% sparsity: 32-layer prefill ~970ms → ~71ms. GFLOP/s = arithmetic on non-zero data only.
ROLV stores only the active parameter blocks. LLaMA-3.1-8B up_proj at 99% sparsity: 2.34 MB vs 235 MB dense — 100× reduction. Measured directly from the operator build.
Every benchmark publishes four SHA-256 hashes: the weight matrix (A), the input vector (V), the dense baseline output, and the ROLV output. These hashes are committed before any verifier runs anything. To verify independently: download the same public model, extract the same layer, apply the same sparsity, compute the same hashes. If they match, the result is confirmed — we cannot have fabricated a number that independently matches a hash you computed yourself.
The Validation Kit provides exact model IDs, layer names, sparsity levels, and seeds for every published result. No code from us required.
The ROLV Primitive© is exact on its compressed submatrix — no approximation is introduced by ROLV Primitive© itself. The only source of output error is pruning, which zeroes low-magnitude rows before ROLV Primitive© is built.
This is expected and standard for compressed inference. The goal is to operate within a defined tolerance budget while maximising speed and energy savings. All published results include correctness metrics alongside speedup figures.
Standard benchmarks prove a specific computation was run on specific data. This goes further: you supply the input numbers — only you know them, only you know the expected output. If ROLV returns the value you computed yourself on a calculator, it cannot have pre-computed that result. The proof is zero-trust by construction.
The app verifies one specific claim: that ROLV produces the same numerical output as a standard dense matrix multiply. It does this without any ROLV code in the app itself — just arithmetic you can check by hand.
The key insight: the app contains no hidden ROLV secret. It runs a sparse matrix operator and a standard dense multiply on the same input and shows you both answers. The ROLV claim is simply that they agree — and you can check that yourself without trusting us.
Powered by huggingface.co/spaces/rolvai/rolv-verify · opens as overlay · no account needed
The verification matrix is deterministic (seed 20260101) — same on every machine. Active rows are published. Publish your SHA-256 hashes of W and x to let anyone independently reproduce your exact run.
Find the exact sparsity threshold where sparse storage beats dense for your dtype.
Finds the exact sparsity where vendor dense hits VRAM congestion first — your switch point to sparse.