rolvsparse© is a new compute primitive that restructures how every AI processor handles matrix arithmetic — delivering up to 243× speedup and 99.5% energy reduction. Tested on real model weights from Llama 4 Maverick, Qwen3-235B-A22B (all 128 experts), and Qwen2.5-72B. Every platform. No hardware changes. No model retraining.
On NVIDIA B200, real Llama 4 Maverick MoE expert FFN weights (16384×5120, bfloat16, from HuggingFace) show 369K → 7.66M tokens/s — a 20.7× gain on identical hardware. Time-to-first-token drops 177×. Output hash-verified and canonical-checked.
up_proj · model-00001-of-00084.safetensors · 16384 × 5120 · bfloat16
72B params · Mixture-of-Experts · 8,192 × 28,672
rolvsparse© reduces actual joules per inference by mathematically skipping zero-value multiplications. On Llama 4 Maverick, energy drops from 786 J to 50.6 J per 1,000 iterations — a 93.6% reduction — with identical outputs. For a hyperscaler with $10B annual energy spend, that is $6.5B–$9.9B in annual savings.
For a hyperscaler with 100,000 GPUs and $10B annual energy spend, rolvsparse©'s 65–99% savings translates to $6.5B–$9.9B annually. Hardware capex savings from needing fewer GPUs add a further $4B–$10B per year at $20B spend.
rolvsparse© is not a sparsity-only optimization. At 0% sparsity — fully dense matrices — it achieves 63× speedup on NVIDIA B200 versus cuBLAS by restructuring memory access and computation layout at the arithmetic level. Every AI workload benefits: dense transformer layers, attention heads, embedding lookups — no model modification needed.
This result establishes rolvsparse© as a universal compute primitive. The library restructures how matrix operations are dispatched and computed independently of data sparsity. Paired with real-world sparsity, speedups compound to 193× on production workloads.
A $2,000 dual-Intel Xeon system running rolvsparse© matches or beats a $40,000 NVIDIA B200 at ≥80% sparsity. AMD MI300X achieves 242× sparse speedup. AMD EPYC 7B13 CPU achieves 117× at 90% sparsity. This is a structural break in AI infrastructure economics. Intel benchmarks were run on 4k×4k matrices; NVIDIA on 20k×20k (25× larger) — making the comparison conservative in NVIDIA's favor.
| Sparsity | Intel Xeon CPU + rolvsparse© | NVIDIA B200 GPU Dense (no rolv) | NVIDIA cuSPARSE | Result |
|---|---|---|---|---|
| 70% | ~15,000 | ~80,000 | ~854 | NVIDIA B200 ahead |
| 80% | ~87,900 | ~80,000 | ~1,199 | Intel Xeon w/rolv overtakes NVIDIA B200 |
| 90% | ~86,600 | ~80,000 | ~2,389 | Intel Xeon w/rolv ahead; cuSPARSE collapses |
| 95% | ~80,000 | ~80,000 | ~5,044 | Intel Xeon w/rolv = NVIDIA B200 |
| 99% | ~80,500 | ~80,000 | ~21,487 | Intel Xeon w/rolv still ahead |
Intel 4k×4k matrices · NVIDIA 20k×20k (25× larger). At equal sizes rolv's advantage would be greater. Hardware cost: Intel ~$2,000 vs NVIDIA B200 ~$35,000–$40,000.
At ≥80% sparsity a $2,000 dual-Xeon server running rolvsparse© matches or beats a $40,000 B200 running optimised cuBLAS — with no rolv at all. The gap in hardware cost is 20×. The gap in tokens/s disappears.
| Sparsity | Intel Xeon + rolvsparse© |
NVIDIA B200 cuBLAS · no rolv |
Hardware Cost | Verdict |
|---|---|---|---|---|
| 70% | ~15,000 | ~80,000 | $2k vs $40k | GPU ahead |
| 80% | ~87,900 | ~80,000 | $2k vs $40k | $2k CPU overtakes $40k GPU |
| 90% | ~86,600 | ~80,000 | $2k vs $40k | rolv ahead; 20× cheaper |
| 95% | ~80,000 | ~80,000 | $2k vs $40k | $2,000 CPU = $40,000 GPU |
| 99% | ~80,500 | ~80,000 | $2k vs $40k | rolv Intel still ahead |
Intel 4k×4k matrices · NVIDIA 20k×20k (25× larger). At equal matrix sizes rolv's advantage would be greater. This comparison is conservative in NVIDIA's favour.
On AMD MI300X, rolvsparse© delivers up to 242× speedup versus rocBLAS at 70% sparsity (random pattern), with 99.59% energy savings. Dense matrices (0% sparsity) achieve a consistent 21–22× speedup. Effective TFLOPS reach 2,000–2,110 — versus rocBLAS baseline. rolvsparse© tokens/s: ~2.6M across all sparsity levels.
All benchmarks published with full methodology — matrix dimensions, hardware configs, iteration counts, energy readings, and cryptographic hashes. Any party can verify using reference code at rolv.ai.
Qwen3-235B-A22B activates 8 experts per token from a pool of 128. With batch=512, every expert in the model is touched per forward pass. We stacked all 128 up_proj weights (each 1536×4096, bfloat16, from model-00001-of-00118.safetensors) into a single 196,608×4,096 operational matrix — the most honest possible real-world benchmark.
| Config | Matrix | Throughput Speedup | TTFT Speedup | Energy Saved | Eff. TFLOPS | Justification |
|---|---|---|---|---|---|---|
| Single expert | 1,536 × 4,096 | 3.2× | 1.2× | 57.3% | 146 | 1 token activation |
| 8 experts stacked | 12,288 × 4,096 | 15.8× | 2.1× | 93.7% | 867 | Conservative batch serving |
| 128 experts — full layer ★ | 196,608 × 4,096 | 41× | 16.8× | 97.6% | 2,715 | Production: all experts touched per batch |
20k×20k matrices · batch 5k · 1,000 iterations. Intel/AMD CPU at smaller sizes.
| Platform | Dense Speedup | Sparse Speedup | Energy Savings | Tokens/s (rolv) | Eff. TFLOPS |
|---|---|---|---|---|---|
| NVIDIA B200 / H100 | ~63× | up to 243× | 98–99.6% | ~5.1M | 4,087–4,095 |
| AMD MI300X | 17–22× | up to 242× | 94–99.6% | ~2.6M | 2,000–2,110 |
| AMD EPYC 7B13 CPU | ~9× | up to 117× | 89–99.1% | 12k–151k | 865–2,566 GFLOPS |
| Intel Xeon CPU | 7–8× | up to 43× | 87–97.7% | 14k–88k | 449–563 GFLOPS |
| Google TPU v5e-8 | 1.6–6.6× | 3–62× | 40–97% | 300–600k | ~900 GFLOPS |
| Apple M4 | 3.6× | 10–70× | 72–75% | 145–800k | ~10 TFLOPS |
| Workload | Platform | Config | Speedup | Energy Saved |
|---|---|---|---|---|
| Qwen3-235B-A22B — All 128 Experts ★ NEW | B200 | 196,608×4,096 · all experts | 41× | 97.6% |
| Qwen3-235B-A22B — 8 Experts Stacked | B200 | 12,288×4,096 · batch 512 | 15.8× | 93.7% |
| Dense Matrix (0% sparsity) ★ | B200 | Fully dense | 63× | — |
| FE Solver · Phone Drop-Test | NVIDIA | ~99.5% sparse | 193× | 99.5% |
| LLM Proxy Matrix | B200 | High sparsity | 158× | 99.4% |
| Rec GEMM · Meta-style | NVIDIA | Rec GEMM | 98.8× | 99.0% |
| Netflix RecSys | B200 | 50k×10k, 98.8% | 61.9× | 89.5% |
| Llama 4 Maverick MoE Expert FFN | B200 | 8192×28672, batch 512 | 50.6× | 93.6% |
| Llama-3 70B FFN | B200 | 8192×28672 | 50.5× | 98.0% |
| Qwen2.5-72B MoE Expert FFN | B200 | 8192×28672, batch 512 | 50.5× | 91.4% |
| Graph GNN · ogbn-products | NVIDIA | GNN sparse | 49.2× | 98.0% |
| Mistral-7B Wanda | B200 | 70% sparse | 39.1× | 97.4% |
| GPT-J-6B MLP Pruned | B200 | 4096×16384, 40% | 35.7× | 96.9% |
| Llama-2-7B Pruned 70% | NVIDIA | 70% sparse | 29.6× | 96.0% |
| Llama-2-7B FFN 70% | H100 NVL | 4096×11008 | 22× | 95% |
| Reddit GNN | B200 | 114M edges, 99.79% | 18.2× | 94.5% |
| MusicGen-large FFN | NVIDIA | FFN sparse | 18.8× | 94.7% |
| KIMI K2.5 Expert Matrix | NVIDIA | Expert sparse | 9.7× | 89.7% |
| BERT-Base Pruned 90% | B200 | 768×3072, 90% | 6.2× | 79% |
| Google ViT-Huge Attention | B200 | 1280×1280, 90% | 4.0× | 75% |
| Synthetic 40–70% sparsity | B200 | 20k×20k, batch 5k | 46–63× | 98% |
| Pattern / Zeros | Platform | Config | Speedup | Energy Saved | Tokens/s |
|---|---|---|---|---|---|
| Random — 0% (fully dense) | MI300X | 20k×20k, rocBLAS | 21.52× | 95.35% | 2,637,715 |
| Random — 10–60% | MI300X | 20k×20k, rocBLAS | 21–22× | 95.4% | ~2.62–2.64M |
| Random — 70% sparse ★ | MI300X | 20k×20k, rocBLAS | 242× | 99.59% | 2,554,488 |
| Random — 80% sparse | MI300X | 20k×20k, rocBLAS | 163× | 99.39% | 2,549,781 |
| Random — 90% sparse | MI300X | 20k×20k, rocBLAS | 84.56× | 98.82% | 2,569,426 |
| Random — 95% sparse | MI300X | 20k×20k, rocBLAS | 43.60× | 97.71% | 2,544,139 |
| Power_law — 70% sparse | MI300X | 20k×20k, rocBLAS | 226× | 99.56% | 2,546,798 |
| Mistral-7B Wanda | MI300X | 70% sparse | 15.8× | 93.7% | — |
rolv hash always identical: 8dbe5f139fd946d4cd84e8cc…dad56dd8dd. rolv tokens/s ~2.6M across all sparsity levels. Effective TFLOPS: 2,000–2,110. Full Benchmarks PDF →
| Pattern / Zeros | Platform | Config | Speedup | Energy Saved | Tokens/s |
|---|---|---|---|---|---|
| Random — 0% (fully dense) | AMD EPYC 7B13 | 6k×6k, Batch 256 | 9.23× | 89.17% | 12,015 |
| Random — 10–70% | AMD EPYC 7B13 | 6k×6k, Batch 256 | 9.15–9.34× | 89.07–89.29% | ~12k |
| Random — 75% sparse ★ threshold | AMD EPYC 7B13 | 6k×6k, Batch 256 | 109.61× | 99.09% | 142,609 |
| Random — 80% sparse | AMD EPYC 7B13 | 6k×6k, Batch 256 | 107.58× | 99.07% | 140,068 |
| Random — 85% sparse | AMD EPYC 7B13 | 6k×6k, Batch 256 | 108.49× | 99.08% | 141,200 |
| Random — 90% sparse | AMD EPYC 7B13 | 6k×6k, Batch 256 | 116.67× | 99.14% | 151,039 |
| Random — 95% sparse | AMD EPYC 7B13 | 6k×6k, Batch 256 | 109.25× | 99.08% | 142,357 |
| Random — 99% sparse | AMD EPYC 7B13 | 6k×6k, Batch 256 | 95.93× | 98.96% | 124,606 |
rolvsparse© Sparse Memory Threshold (RSMT) activates at 75% zeros on AMD EPYC 7B13 CPU. rolv hash always identical: 8dbe5f139fd946d4cd84e8cc…dad56dd8dd. Dense baseline: 865 GFLOPS. Full Benchmarks PDF →
| Pattern / Zeros | Platform | Config | Speedup | Energy Saved | Tokens/s |
|---|---|---|---|---|---|
| Random — 0% (fully dense) | Intel Xeon | 4k×4k, Batch 500 | 7.93× | 87.40% | 14,029 |
| Random — 10–70% | Intel Xeon | 4k×4k, Batch 500 | 7.2–7.7× | 86.1–87.0% | 13,490–15,350 |
| Random — 80% sparse ★ threshold | Intel Xeon | 4k×4k, Batch 500 | 43.03× | 97.68% | 87,931 |
| Random — 90% sparse | Intel Xeon | 4k×4k, Batch 500 | 42.38× | 97.64% | 86,652 |
| Random — 95% sparse | Intel Xeon | 4k×4k, Batch 500 | 39.18× | 97.45% | 80,070 |
| Random — 99% sparse | Intel Xeon | 4k×4k, Batch 500 | 39.43× | 97.46% | 80,580 |
| Power_law — 80% | Intel Xeon | 4k×4k, Batch 500 | 37.78× | 97.35% | 77,501 |
| vs NVIDIA B200 dense (≥80%) | Intel Xeon | $2k vs $40k | Matches/Beats B200 | — | — |
Intel benchmarks: 4k×4k. NVIDIA: 20k×20k (25× larger). rolvsparse© Sparse Memory Threshold (RSMT) activates at 80% zeros on Intel Xeon. rolv hash always identical: 8dbe5f139fd946d4cd84e8cc…dad56dd8dd. Full Benchmarks PDF →
| Workload | Platform | Config | Speedup | Energy Saved |
|---|---|---|---|---|
| Synthetic 60–80% sparsity | TPU v5e-8 | JAX BCOO baseline | 30–62× | 97% |
| Dense baseline | TPU v5e-8 | XLA dense | 1.6–6.6× | 40–83% |
| rolv tokens/s | TPU v5e-8 | Production scale | 300–600k | — |
| Workload | Platform | Config | Speedup | Energy Saved |
|---|---|---|---|---|
| Synthetic 50–70% sparsity | Apple M4 | MPS Dense baseline | 3.6× | 72–75% |
| Sparse inference (10–70×) | Apple M-series | MPS sparse (incorrect) | 10–70× | 72% |
| ViT-Base · Android On-Device | Mobile SoC | On-device sparse | 2.2× | 54.6% |
| EV First-Layer Vision Safety | Mobile/EV | Embedded inference | 2.3× | +36.7% range |
| EV Battery Mgmt & Range | Mobile/EV | Embedded inference | 2.1× | +33.4% range |
Note: Apple MPS sparse path produces incorrect outputs. rolvsparse© is the only numerically correct sparse path on Apple Silicon.
| Platform | Config | Dense Speedup | vs. Baseline |
|---|---|---|---|
| NVIDIA B200 (0% sparsity ★) | 20k×20k, fully dense | 63× | cuBLAS — 0% sparsity |
| NVIDIA B200 (40–70% sparsity) | 20k×20k | 46–63× | vs cuBLAS |
| NVIDIA H100 NVL | Dense baseline | 18–22× | vs cuBLAS/CSR |
| AMD MI300X | Dense baseline | 17–22× | vs rocBLAS |
| AMD EPYC 7B13 CPU | Dense baseline | ~9× | vs OpenBLAS |
| Intel Xeon | Dense baseline | 7–43× | vs MKL |
| Google TPU v5e-8 | Dense baseline | 1.6–6.6× | vs XLA |
| Apple M4 | Dense baseline | 3.6× | vs MPS |
★ 0% sparsity = fully dense matrix. rolvsparse© restructures computation layout regardless of data sparsity — applicable to any workload.
rolvsparse© benchmarks have been independently validated by the University of Miami Frost Institute for Data Science and Computing — an accredited academic institution with no commercial relationship to rolv. All results are deterministic, reproducible, and published with full methodology.
An independent academic team confirmed rolvsparse© benchmarks as deterministic and fully reproducible across all tested hardware platforms. Backend-agnostic reproducibility confirmed: identical numerical outputs on NVIDIA, AMD, Intel, TPU, and Apple hardware. Cryptographic output hashes published for independent third-party verification.
"Deterministic and reproducible results confirmed across all tested platforms." — Frost Institute Validation Report
rolvsparse© democratizes AI inference. Run our validation script on any hardware — a laptop, a cheap cloud VM, your workstation — and generate your own SHA-256 baseline hash. Send it to us and we'll return a full "Us vs. Them" report showing exactly how much faster and more efficient your workload becomes with rolvsparse©. The math proves itself.
The baseline hash is yours — generated entirely on your own hardware, from your own run. rolvsparse© must produce the exact same result hash to prove no precision is lost. That's the guarantee.
Download Validation Kit →The Frost Institute confirmed all rolvsparse© benchmarks as deterministic and reproducible on real hardware. No commercial interest. Engaged solely to verify accuracy and reproducibility of published results.
View Validation PDF →A deterministic tolerance harness using NVIDIA Nsight confirms rolvsparse© produces bit-accurate outputs relative to cuBLAS baseline within validated floating-point tolerance. Reference code publicly available.
Download Validation Test →Covers NVIDIA B200/H100, AMD MI300X, Intel Xeon, Google TPU v5e-8, and Apple M-series. Matrix dimensions, hardware config, iteration counts, energy readings, and output hashes all published.
Download Full Benchmarks →The ROLV paper and all supporting materials are published on Zenodo — CERN's open research repository — with a permanent DOI. Indexed in OpenAIRE and citable in academic work.
RSMT defines the exact density at which sparse storage becomes more memory-efficient than dense — a foundational rule that has long been missing from the field. VRAM, not compute, is the dominant bottleneck in large-scale inference. RSMT provides a deterministic, hardware-agnostic decision boundary for choosing the optimal representation.
| Value Type | Index Type | b | i | RSMT d | Use sparse when… |
|---|---|---|---|---|---|
| float32 | int64 | 4 | 8 | 0.333 | density < 33% |
| float16 / BF16 | int64 | 2 | 8 | 0.200 | density < 20% |
| float32 | int32 | 4 | 4 | 0.500 | density < 50% |
| int8 | int32 | 1 | 4 | 0.200 | density < 20% |
Composite efficiency: (Sparsity × Energy Savings) / 100
rolv E. Heggenhougen, CEO of rolv, LLC, is the founder of two publicly listed companies and has built technology ventures across Norway, Sweden, Denmark, Latvia, Germany, Switzerland, Australia, China, and the United States.
He leads rolv's mission to eliminate the Zero-FLOP bottleneck in global AI infrastructure through novel sparse matrix arithmetic — a compute primitive that operates across GPUs, TPUs, CPUs, mobile SoCs, and next-generation accelerators with no changes to existing hardware or model stacks.
Mr. Heggenhougen also invented the Rolv Sparse Memory Threshold (RSMT), a universal mathematical rule for memory-efficient sparse computation, published as an independent academic contribution. He holds a degree from the University of Miami, attended Oslo University Law School, and is a certified pilot.
Fluent in Norwegian, Danish, and Swedish; working knowledge of German.