rolvsparse© is a new compute primitive that restructures how every AI processor handles matrix arithmetic — delivering up to 243× speedup and 99.5% energy reduction. Sparse and dense. Every platform. No hardware changes. No model retraining.
On NVIDIA B200, real Llama 4 Maverick MoE expert FFN weights (16384×5120, bfloat16, from HuggingFace) show 369K → 7.66M tokens/s — a 20.7× gain on identical hardware. Time-to-first-token drops 177×. Output hash-verified and canonical-checked.
up_proj · model-00001-of-00084.safetensors · 16384 × 5120 · bfloat16
72B params · Mixture-of-Experts · 8,192 × 28,672
up_proj · 256 experts × 2048×7168 → 524,288×7168 stacked · bfloat16 → fp32 · Sparsity 0.006% · Build time 0.11 s
We benchmarked the FFN layer at the architecture scale of GPT-4o and Claude 3.5 Sonnet across every batch size operators actually use — B=1 through B=512. The speedup increases as concurrency grows. At B=512 — where cuBLAS is fully optimised — ROLV delivers 68.7× (GPT-4o class) and 83× (Claude 3.5 class). Weights: synthetic fp32, architecture-matched dimensions. NVIDIA B200.
GPT-4o and Claude 3.5 Sonnet weights are not public. This benchmark uses synthetic matrices at architecture-matched dimensions — the standard methodology used by cuBLAS, FlashAttention, and vLLM for closed-model benchmarks. Weight distribution: Normal(0, 0.02), fp32. Sparsity: ~0.000009% (natural zeros only). ROLV's advantage is structural — it comes from the operator architecture, not from weight sparsity.
| Batch | Serving context | GPT-4o Class speedup vs cuBLAS |
Claude 3.5 Class speedup vs cuBLAS |
GPT-4o p99 (ms) | Claude 3.5 p99 (ms) | Energy saved |
|---|---|---|---|---|---|---|
| 1 | Single user · SLA-critical | 23.6× | 36.3× | 0.061 | 0.066 | 95–97% |
| 4 | Small burst | 33.0× | 59.7× | 0.057 | 0.053 | 97–98% |
| 16 | Enterprise API | 31.1× | 61.2× | 0.074 | 0.077 | 97–98% |
| 64 | High concurrency | 38.8× | 59.3× | 0.075 | 0.088 | 97–98% |
| 128 | Heavy serving | 52.1× | 68.7× | 0.100 | 0.134 | 98% |
| 256 | Datacenter batch | 60.5× | 77.5× | 0.151 | 0.202 | 98–99% |
| 512 | Max throughput — cuBLAS comfort zone | 68.7× | 83.0× | 0.252 | 0.360 | 98.5–98.8% |
GPT-4o class: 8 experts × (18,432×7,168) = 147,456×7,168. Claude 3.5 class: 8 experts × (28,672×8,192) = 229,376×8,192. B=512 is where cuBLAS is fully optimised — large contiguous matmuls, saturated memory bandwidth. cuBLAS p99 at B=512: 16.6 ms (GPT-4o), 29.0 ms (Claude 3.5). ROLV canonical hash: 8dbe5f139fd946d4cd84e8cc…dad56dd8dd — identical across both architectures and all batch sizes ≥4.
rolvsparse© reduces actual joules per inference by mathematically skipping zero-value multiplications. On Llama 4 Maverick, energy drops from 786 J to 50.6 J per 1,000 iterations — a 93.6% reduction — with identical outputs. For a hyperscaler with $10B annual energy spend, that is $6.5B–$9.9B in annual savings.
For a hyperscaler with 100,000 GPUs and $10B annual energy spend, rolvsparse©'s 65–99% savings translates to $6.5B–$9.9B annually. Hardware capex savings from needing fewer GPUs add a further $4B–$10B per year at $20B spend.
rolvsparse© is not a sparsity-only optimization. At 0% sparsity — fully dense matrices — it achieves 63× speedup on NVIDIA B200 versus cuBLAS by restructuring memory access and computation layout at the arithmetic level. Every AI workload benefits: dense transformer layers, attention heads, embedding lookups — no model modification needed.
This result establishes rolvsparse© as a universal compute primitive. The library restructures how matrix operations are dispatched and computed independently of data sparsity. Paired with real-world sparsity, speedups compound to 193× on production workloads.
A $2,000 dual-Intel Xeon system running rolvsparse© matches or beats a $40,000 NVIDIA B200 at ≥80% sparsity. AMD MI300X achieves 242× sparse speedup. AMD EPYC 7B13 CPU achieves 117× at 90% sparsity. This is a structural break in AI infrastructure economics. Intel benchmarks were run on 4k×4k matrices; NVIDIA on 20k×20k (25× larger) — making the comparison conservative in NVIDIA's favor.
At ≥80% sparsity a $2,000 dual-Xeon server running rolvsparse© matches or beats a $40,000 B200 running optimised cuBLAS — with no rolv at all. The gap in hardware cost is 20×. The gap in tokens/s disappears. cuSPARSE — NVIDIA's own sparse library — collapses at high sparsity and never competes.
| Sparsity | Intel Xeon + rolvsparse© |
NVIDIA B200 cuBLAS · no rolv |
NVIDIA B200 cuSPARSE |
Hardware Cost | Verdict |
|---|---|---|---|---|---|
| 70% | ~15,000 | ~80,000 | ~854 | $2k vs $40k | GPU ahead |
| 80% | ~87,900 | ~80,000 | ~1,199 | $2k vs $40k | $2k CPU overtakes $40k GPU |
| 90% | ~86,600 | ~80,000 | ~2,389 | $2k vs $40k | rolv ahead; cuSPARSE collapses; 20× cheaper |
| 95% | ~80,000 | ~80,000 | ~5,044 | $2k vs $40k | $2,000 CPU = $40,000 GPU |
| 99% | ~80,500 | ~80,000 | ~21,487 | $2k vs $40k | rolv Intel still ahead |
Intel 4k×4k matrices · NVIDIA 20k×20k (25× larger). At equal matrix sizes rolv's advantage would be greater. This comparison is conservative in NVIDIA's favour. Hardware cost: Intel ~$2,000 vs NVIDIA B200 ~$35,000–$40,000.
On AMD MI300X, rolvsparse© delivers up to 242× speedup versus rocBLAS at 70% sparsity (random pattern), with 99.59% energy savings. Dense matrices (0% sparsity) achieve a consistent 21–22× speedup. Effective TFLOPS reach 2,000–2,110 — versus rocBLAS baseline. rolvsparse© tokens/s: ~2.6M across all sparsity levels.
All benchmarks published with full methodology — matrix dimensions, hardware configs, iteration counts, energy readings, and cryptographic hashes. Any party can verify using reference code at rolv.ai.
20k×20k matrices · batch 5k · 1,000 iterations. Intel/AMD CPU at smaller sizes.
| Platform | Dense Speedup | Sparse Speedup | Energy Savings | Tokens/s (rolv) | Eff. TFLOPS |
|---|---|---|---|---|---|
| NVIDIA B200 / H100 | ~63× | up to 243× | 98–99.6% | ~5.1M | 4,087–4,095 |
| AMD MI300X | 17–22× | up to 242× | 94–99.6% | ~2.6M | 2,000–2,110 |
| AMD EPYC 7B13 CPU | ~9× | up to 117× | 89–99.1% | 12k–151k | 865–2,566 GFLOPS |
| Intel Xeon CPU | 7–8× | up to 43× | 87–97.7% | 14k–88k | 449–563 GFLOPS |
| Google TPU v5e-8 | 1.6–6.6× | 3–62× | 40–97% | 300–600k | ~900 GFLOPS |
| Apple M4 | 3.6× | 10–70× | 72–75% | 145–800k | ~10 TFLOPS |
rolvsparse© benchmarks have been independently validated by the University of Miami Frost Institute for Data Science and Computing — an accredited academic institution with no commercial relationship to rolv. All results are deterministic, reproducible, and published with full methodology.
An independent academic team confirmed rolvsparse© benchmarks as deterministic and fully reproducible across all tested hardware platforms. Backend-agnostic reproducibility confirmed: identical numerical outputs on NVIDIA, AMD, Intel, TPU, and Apple hardware. Cryptographic output hashes published for independent third-party verification.
"Deterministic and reproducible results confirmed across all tested platforms." — Frost Institute Validation Report
rolvsparse© democratizes AI inference. Run our validation script on any hardware — a laptop, a cheap cloud VM, your workstation — and generate your own SHA-256 baseline hash. Send it to us and we'll return a full "Us vs. Them" report showing exactly how much faster and more efficient your workload becomes with rolvsparse©. The math proves itself.
The baseline hash is yours — generated entirely on your own hardware, from your own run. rolvsparse© must produce the exact same result hash to prove no precision is lost. That's the guarantee.
Download Validation Kit (v2.0) →The Frost Institute confirmed all rolvsparse© benchmarks as deterministic and reproducible on real hardware. No commercial interest. Engaged solely to verify accuracy and reproducibility of published results.
View Validation PDF →A deterministic tolerance harness using NVIDIA Nsight confirms rolvsparse© produces bit-accurate outputs relative to cuBLAS baseline within validated floating-point tolerance. Reference code publicly available.
Download Validation Test →Covers NVIDIA B200/H100, AMD MI300X, Intel Xeon, Google TPU v5e-8, and Apple M-series. Matrix dimensions, hardware config, iteration counts, energy readings, and output hashes all published.
Download Full Benchmarks →RSMT defines the exact density at which sparse storage becomes more memory-efficient than dense — a foundational rule that has long been missing from the field. VRAM, not compute, is the dominant bottleneck in large-scale inference. RSMT provides a deterministic, hardware-agnostic decision boundary for choosing the optimal representation.
| Value Type | Index Type | b | i | RSMT d | Use sparse when… |
|---|---|---|---|---|---|
| float32 | int64 | 4 | 8 | 0.333 | density < 33% |
| float16 / BF16 | int64 | 2 | 8 | 0.200 | density < 20% |
| float32 | int32 | 4 | 4 | 0.500 | density < 50% |
| int8 | int32 | 1 | 4 | 0.200 | density < 20% |
Composite efficiency: (Sparsity × Energy Savings) / 100
rolv E. Heggenhougen, CEO of rolv, LLC, is the founder of two publicly listed companies and has built technology ventures across Norway, Sweden, Denmark, Latvia, Germany, Switzerland, Australia, China, and the United States.
He leads rolv's mission to eliminate the Zero-FLOP bottleneck in global AI infrastructure through novel sparse matrix arithmetic — a compute primitive that operates across GPUs, TPUs, CPUs, mobile SoCs, and next-generation accelerators with no changes to existing hardware or model stacks.
Mr. Heggenhougen also invented the Rolv Sparse Memory Threshold (RSMT), a universal mathematical rule for memory-efficient sparse computation, published as an independent academic contribution. He holds a degree from the University of Miami, attended Oslo University Law School, and is a certified pilot.
Fluent in Norwegian, Danish, and Swedish; working knowledge of German.