rolvsparse© is a new compute primitive that restructures how every AI processor handles matrix arithmetic — delivering up to 133.5× real-world speedup on Llama-4 Maverick and 99.9% energy reduction. Real weights. Every platform. No hardware changes. No model retraining.
On NVIDIA B200, real Llama-4 Maverick MoE expert FFN weights deliver a 133.5× throughput gain — with 99.9% energy saved and 52.1× TTFT speedup. Llama-4 400B hits 125.3× speedup and 100.9× TTFT. DeepSeek-R1 delivers 44.2×. Output hash-verified and canonical-checked.
up_proj · model-00001-of-00084.safetensors · 16384 × 5120 · bfloat16
72B params · Mixture-of-Experts · 8,192 × 28,672
up_proj · 256 experts × 2048×7168 → 524,288×7168 stacked · bfloat16 → fp32 · Sparsity 0.006% · Build time 0.11 s
We benchmarked the FFN layer at the architecture scale of GPT-4o and Claude 3.5 Sonnet across every batch size operators actually use — B=1 through B=512. The speedup increases as concurrency grows. At B=512 — where cuBLAS is fully optimised — ROLV delivers 68.7× (GPT-4o class) and 83× (Claude 3.5 class). Architecture-matched dimensions, synthetic fp32 weights (standard methodology for closed models). NVIDIA B200.
| Batch | Serving context | GPT-4o Class speedup vs cuBLAS |
Claude 3.5 Class speedup vs cuBLAS |
GPT-4o p99 (ms) | Claude 3.5 p99 (ms) | Energy saved |
|---|---|---|---|---|---|---|
| 1 | Single user · SLA-critical | 23.6× | 36.3× | 0.061 | 0.066 | 95–97% |
| 4 | Small burst | 33.0× | 59.7× | 0.057 | 0.053 | 97–98% |
| 16 | Enterprise API | 31.1× | 61.2× | 0.074 | 0.077 | 97–98% |
| 64 | High concurrency | 38.8× | 59.3× | 0.075 | 0.088 | 97–98% |
| 128 | Heavy serving | 52.1× | 68.7× | 0.100 | 0.134 | 98% |
| 256 | Datacenter batch | 60.5× | 77.5× | 0.151 | 0.202 | 98–99% |
| 512 | Max throughput — cuBLAS comfort zone | 68.7× | 83.0× | 0.252 | 0.360 | 98.5–98.8% |
GPT-4o class: 8 experts × (18,432×7,168) = 147,456×7,168. Claude 3.5 class: 8 experts × (28,672×8,192) = 229,376×8,192. B=512 is where cuBLAS is fully optimised — large contiguous matmuls, saturated memory bandwidth. cuBLAS p99 at B=512: 16.6 ms (GPT-4o), 29.0 ms (Claude 3.5). ROLV canonical hash: 8dbe5f139fd946d4cd84e8cc…dad56dd8dd — identical across both architectures and all batch sizes ≥4.
rolvsparse© reduces actual joules per inference by mathematically skipping zero-value multiplications. On Llama 4 Maverick, energy drops from 786 J to 50.6 J per 1,000 iterations — a 93.6% reduction — with identical outputs.
For a hyperscaler with 100,000 GPUs and $10B annual energy spend, rolvsparse©'s 65–99% savings translates to $6.5B–$9.9B annually. Hardware capex savings from needing fewer GPUs add a further $4B–$10B per year at $20B spend.
| Model / Workload | Hardware | Speedup | Energy Saved | Tokens/sec (rolv) | TTFT Speedup |
|---|---|---|---|---|---|
| Llama-4 Maverick 400B real weights | NVIDIA B200 | 133.5× | 99.9% | 149,514 | 52.1× |
| Llama-4 400B real weights | NVIDIA B200 | 125.3× | 99.4% | 852,680 | 100.9× |
| DeepSeek-R1 · 256 MoE experts real weights | NVIDIA B200 | 44.2× | 98.7% | 704,363 | 41.6× |
| Mixtral-8×22B (56 layers) | NVIDIA B200 | 35.1× | 98.2% | 2,266,374 | 28.2× |
| Llama-3 70B FFN real weights | NVIDIA B200 | 50.5× | 98.0% | 7,179,519 | — |
| Mistral-7B Wanda | NVIDIA B200 | 39.1× | 97.4% | — | — |
| Mistral-7B Wanda | AMD MI300X | 15.8× | 93.7% | — | — |
| Qwen2.5-72B MoE FFN | NVIDIA B200 | 50.5× | 91.4% | 6,740,529 | — |
| Kimi K2.5 (~1T MoE) | NVIDIA B200 | 10.5× | 90.6% | 490,929 | 29.7× |
| Qwen2.5-32B FFN real HF weights · 27,648×5,120 | Google TPU v5e-8 | 5.9× | 83.0% | 3,924,124 | — |
| Qwen3.5-35B-A3B-GPTQ-Int4 · 64 experts stacked real HF weights · GPTQ-Int4 | NVIDIA B200 | 9.4× | 89.3% | 127,076,958 | — |
| GLM-OCR · 24 layers stacked real HF weights · #1 OmniDocBench · 0.9B | NVIDIA B200 | 50.0× | 98.0% | 318,848,172 | — |
| BERT-Base Real FFN real HF weights · 0% sparsity | Intel Xeon | 12.3× | 91.8% | 103,801 | — |
| GPT-4o Class · B=512 synthetic | NVIDIA B200 | 68.7× | 98.5% | 2,125,994 | 40.1× |
| Claude 3.5 Sonnet Class · B=512 synthetic | NVIDIA B200 | 83.0× | 98.8% | 1,467,584 | 56.3× |
A $2,000 dual-Intel Xeon system running rolvsparse© matches or beats a $40,000 NVIDIA B200 at ≥80% sparsity. AMD MI300X achieves 242× sparse speedup. AMD EPYC 7B13 CPU achieves 117× at 90% sparsity. This is a structural break in AI infrastructure economics. Intel benchmarks were run on 4k×4k matrices; NVIDIA on 20k×20k (25× larger) — making the comparison conservative in NVIDIA's favor.
At ≥80% sparsity a $2,000 dual-Xeon server running rolvsparse© matches or beats a $40,000 B200 running optimised cuBLAS — with no rolv at all. The gap in hardware cost is 20×. The gap in tokens/s disappears. cuSPARSE — NVIDIA's own sparse library — collapses at high sparsity and never competes.
| Sparsity | Intel Xeon + rolvsparse© |
NVIDIA B200 cuBLAS · no rolv |
NVIDIA B200 cuSPARSE |
Hardware Cost | Verdict |
|---|---|---|---|---|---|
| 70% | ~15,000 | ~80,000 | ~854 | $2k vs $40k | GPU ahead |
| 80% | ~87,900 | ~80,000 | ~1,199 | $2k vs $40k | $2k CPU overtakes $40k GPU |
| 90% | ~86,600 | ~80,000 | ~2,389 | $2k vs $40k | rolv ahead; cuSPARSE collapses; 20× cheaper |
| 95% | ~80,000 | ~80,000 | ~5,044 | $2k vs $40k | $2,000 CPU = $40,000 GPU |
| 99% | ~80,500 | ~80,000 | ~21,487 | $2k vs $40k | rolv Intel still ahead |
Intel 4k×4k matrices · NVIDIA 20k×20k (25× larger). At equal matrix sizes rolv's advantage would be greater. This comparison is conservative in NVIDIA's favour. Hardware cost: Intel ~$2,000 vs NVIDIA B200 ~$35,000–$40,000.
rolvsparse© runs on-device — Android SoCs, automotive compute modules, embedded safety systems. No cloud dependency. No hardware swap. The same operator that accelerates frontier LLMs on NVIDIA B200 runs on a $200 phone chip and extends EV battery range by restructuring how sparse matrices are computed at the arithmetic level.
Real model weights from HuggingFace for all open models. Architecture-matched synthetic fp32 used only for closed models (GPT-4o / Claude 3.5) — the standard methodology. All results available upon request with full methodology, hash verification, and independent validation below.
Real model weights from HuggingFace where available. Architecture-matched synthetic fp32 for closed models (GPT-4o / Claude 3.5 Sonnet) — standard methodology.
| Model | Matrix | Speedup | Energy Saved | Eff. TFLOPS † | Tokens/sec | TTFT Speedup |
|---|---|---|---|---|---|---|
| Llama-4 Maverick 400B | 655,360×16384 · 128E | 133.5× | 99.9% | 3,210.8 | 149,514 | 52.1× |
| Llama-4 400B | 393,216×16384 · 8E | 125.3× | 99.4% | 10,986.7 | 852,680 | 100.9× |
| Llama-4 Scout | 40,960×16,384 · 16E | 34.0× | 98.8% | 5,096.4 | 3,797,089 | 11.7× |
| DeepSeek-R1 | 524,288×7168 · 256E | 44.2× | 98.7% | 5,294.1 | 704,363 | 41.6× |
| DeepSeek-V3 | 524,288×7168 · 256E | 1.4× | 98.7% | 5,072.5 | 674,873 | 4.5× |
| Mixtral-8×22B (56 layers) | 131,072×6144 · 8E | 35.1× | 98.2% | 3,650.3 | 2,266,374 | 28.2× |
| Kimi K2.5 (~1T MoE) | 786,432×896 · 384E | 10.5× | 90.6% | 691.9 | 490,929 | 29.7× |
| Qwen3-235B (16E) | 24,576×4096 · 16E | 7.8× | 95.5% | 1,357.0 | 6,740,529 | 3.4× |
| Qwen3-235B (8E) | 12,288×4096 · 8E | 4.3× | 93.7% | 867.4 | 8,616,776 | 2.1× |
| GPT-4o Class · B=512 ★ | 147,456×7168 · 8E · synthetic | 68.7× | 98.5% | 4,494.2 | 2,125,994 | 40.1× |
| Claude 3.5 Sonnet Class · B=512 ★ | 229,376×8192 · 8E · synthetic | 83.0× | 98.8% | 5,515.3 | 1,467,584 | 56.3× |
★ GPT-4o and Claude 3.5 Sonnet weights are not public — architecture-matched synthetic fp32, standard methodology. All other rows use real weights from HuggingFace. NVIDIA B200. Hash-verified.
† Effective TFLOPS explained: This column shows effective TFLOPS — computed as the nominal FLOPs of the equivalent dense matmul (2 × M × K × N) divided by ROLV's actual wall-clock time. When ROLV's result exceeds the hardware's theoretical peak (e.g. ~4.5 PFLOPS bfloat16 on B200), it means ROLV is doing far fewer multiply-accumulate operations than the dense baseline to produce the same output — not that the silicon is running faster than physics allows. The metric answers the question: how many dense-equivalent FLOPs per second is ROLV delivering? Values above hardware peak are proof of work reduction, not a measurement error.
Real weights. Multiple hardware platforms. † Eff. TFLOPS = nominal dense FLOPs ÷ ROLV wall-clock time — values above hardware peak reflect work reduction, not a measurement error.
| Model | Hardware | Sparsity | Speedup | Energy Saved | Eff. TFLOPS † | Tokens/sec |
|---|---|---|---|---|---|---|
| Llama-3 70B FFN | NVIDIA B200 | 50% | 50.53× | 98.0% | 3,372.7 | 7,179,519 |
| Mistral-7B Wanda | NVIDIA B200 | 55% | 39.1× | 97.4% | — | — |
| Llama-2-7B FFN | NVIDIA H100 | 70% | 22.06× | 95.5% | 236.9 | 8,757,286 |
| Mistral-7B Wanda | AMD MI300X | 55% | 15.8× | 93.7% | — | — |
| Llama-3.1-8B FFN | Google TPU v5e-1 | 0% | 8.4× | 88.2% | — | 12,902,131 |
| Qwen2.5-32B FFN real HF weights | Google TPU v5e-8 | 0% | 5.9× | 83.0% | — | 3,924,124 |
| Qwen3.5-35B-A3B · 64 experts stacked real HF weights · GPTQ-Int4 · 81,920×512 | NVIDIA B200 | 61.7% | 9.4× | 89.3% | 10,659.99 | 127,076,958 |
| GLM-OCR · 24 layers stacked real HF weights · gate_proj · 98,304×1,024 | NVIDIA B200 | 0% | 50.0× | 98.0% | 64,192.62 † | 318,848,172 |
| BERT-Large FFN | AMD MI300X | 70% | 4.84× | 79.3% | 26.6 | 10,578,270 |
| BERT-Large FFN | NVIDIA H100 | 70% | 3.55× | 71.8% | 39.5 | 15,694,435 |
A $2,000 Intel Xeon system matches or beats a $40,000 NVIDIA B200 at ≥80% sparsity. Real model weights.
| Model / Use Case | Sparsity | Speedup vs Dense | Energy Saved | Tokens/sec (ROLV) |
|---|---|---|---|---|
| GPT-J-6B FFN | 40% | 314.6× | 99.7% | 38,154 |
| Mistral-7B FFN | 0% | 253.6× | 99.6% | 36,450 |
| Llama-2-7B FFN | 70% | 169.2× | 99.4% | 43,116 |
| Kimi K2.5 Expert Slice | — | 40.3× | 97.9% | 84,413 |
| BERT-Base FFN | 0% | 12.3× | 91.8% | 103,801 |
| Finite Element Solver | 80% | 112.48× | 99.1% | — |
Randomised and structured sparsity patterns on each hardware platform. Architecture-matched dimensions. Select a processor then a sparsity pattern.
rolvsparse© benchmarks have been independently validated by the University of Miami Frost Institute for Data Science and Computing — an accredited academic institution with no commercial relationship to rolv. All results are deterministic, reproducible, and hash-verified across every platform.
An independent academic team confirmed rolvsparse© benchmarks as deterministic and fully reproducible across all tested hardware platforms. Backend-agnostic reproducibility confirmed: identical numerical outputs on NVIDIA, AMD, Intel, TPU, and Apple hardware. Cryptographic SHA-256 output hashes published for independent third-party verification.
"Deterministic and reproducible results confirmed across all tested platforms." — Frost Institute Validation Report
Run our verification script on your own hardware and get a cryptographic SHA-256 fingerprint of the result. Email the JSON to rolv@rolv.ai — we run the same computation through rolvsparse© on identical inputs, produce the identical output hash, and return a full "Us vs. Them" comparison report showing your exact speedup and energy savings.
The Frost Institute confirmed all rolvsparse© benchmarks as deterministic and reproducible on real hardware across every tested platform. No commercial interest. Engaged solely to verify accuracy and reproducibility.
↓ View Validation Letter →Identical numerical outputs confirmed on NVIDIA, AMD, Intel, TPU, and Apple hardware. The cryptographic hash 8dbe5f139fd946d4cd84e8cc…dad56dd8dd is the same across every platform and sparsity level.
↓ Download Verification Kit →RSMT defines the exact density at which sparse storage becomes more memory-efficient than dense — a foundational rule that has long been missing from the field. VRAM, not compute, is the dominant bottleneck in large-scale inference. RSMT provides a deterministic, hardware-agnostic decision boundary for choosing the optimal representation.
| Value Type | Index Type | b | i | RSMT d | Use sparse when… |
|---|---|---|---|---|---|
| float32 | int64 | 4 | 8 | 0.333 | density < 33% |
| float16 / BF16 | int64 | 2 | 8 | 0.200 | density < 20% |
| float32 | int32 | 4 | 4 | 0.500 | density < 50% |
| int8 | int32 | 1 | 4 | 0.200 | density < 20% |
Composite efficiency: (Sparsity × Energy Savings) / 100
rolv E. Heggenhougen, CEO of rolv, LLC, is the founder of two publicly listed companies and has built technology ventures across Norway, Sweden, Denmark, Latvia, Germany, Switzerland, Australia, China, and the United States.
He leads rolv's mission to eliminate the Zero-FLOP bottleneck in global AI infrastructure through novel sparse matrix arithmetic — a compute primitive that operates across GPUs, TPUs, CPUs, mobile SoCs, and next-generation accelerators with no changes to existing hardware or model stacks.
Mr. Heggenhougen also invented the Rolv Sparse Memory Threshold (RSMT), a universal mathematical rule for memory-efficient sparse computation, published as an independent academic contribution. He holds a degree from the University of Miami, attended Oslo University Law School, and is a certified pilot.
Fluent in Norwegian, Danish, and Swedish; working knowledge of German.