Patent-Pending · Software-Only · No Hardware Changes · No Model Retraining
A software compute primitive that restructures matrix arithmetic to eliminate zero-valued multiply-accumulate operations. Works on fully dense matrices (0% sparsity) through 99%+ sparse. No hardware changes. No model retraining. One SHA-256 hash across every platform.
All models downloaded directly from HuggingFace and run on real hardware. Compared vs vendor-optimised cuBLAS / rocBLAS. Energy via NVML live power polling. SHA-256 output hash confirmed canonical across every platform.
Real production weights — Llama 4, DeepSeek-R1, Qwen, Mixtral, Kimi — downloaded directly and run on NVIDIA B200. Compared vs vendor-optimised cuBLAS (dense). SHA-256 hash verified. Updated March 2026.
University of Miami Frost Institute validated. ↓ Full PDF
The standard objection to sparse operators: they only work on pre-pruned models. rolvsparse© disproves this. NVIDIA Nemotron-3 Super 120B — real FP8 HuggingFace weights, 0.00% sparsity, density exactly 1.0 — delivers 21.8× speedup and 95.4% energy reduction on a fully dense matrix. The same operator that handles 0% also scales continuously to 164.6× total (885× per-iteration) at 99%+ sparsity. One library, no configuration changes.
Same operator · same library · same SHA-256 hash · no configuration changes between sparsity levels · no model retraining · no hardware changes.
HP All-in-One (Intel i7-1165G7, ~$1,000, Windows 11). Real HuggingFace weights, 0% sparsity — fully dense on every model. Microsoft Phi-4 14B: 76× faster than Intel MKL, 109,646 tokens/second, 98.7% less energy. Mistral-7B: 127× faster, 0.7 ms TTFT. At ≥80% sparsity a $2,000 dual-Xeon server matches or overtakes a $40,000 NVIDIA B200 running cuBLAS.
| Sparsity | $2k Xeon+rolv | $40k B200 | Verdict |
|---|---|---|---|
| 70% | ~15k | ~80k | GPU ahead |
| 80% | ~88k | ~80k | $2k overtakes $40k |
| 90% | ~87k | ~80k | 20× cheaper, same speed |
| 99% | ~80k | ~80k | rolv still ahead |
Intel 4k×4k vs NVIDIA 20k×20k — conservative in NVIDIA's favour. cuSPARSE collapses above 80% sparsity.
Numbers represent total speedup including build time. All on NVIDIA B200 vs cuBLAS. Real HuggingFace weights except Claude 3.5-class (architecture-matched synthetic, standard methodology).
Think of it this way: a dense GPU calculation is like paying a full workforce to move 1,000 boxes — even though 900 of them are empty. rolvsparse© only moves the boxes with something in them. The work gets done faster because less work was actually required.
Formally: Effective TFLOPS = nominal dense FLOPs ÷ rolv wall-clock time. When this exceeds the GPU's rated peak (~1,800 TFLOPS on B200) it is not a physics violation — it means fewer multiply-accumulate operations were executed. Dense TFLOPS = true silicon utilisation. Effective TFLOPS = work elimination.
† B200 hardware peak: ~1,800 TFLOPS dense. Values above this reflect work elimination, not measurement error.
Every AI API bills by the token. More tokens per second from the same GPU = lower cost per token, more users served, or both. At 164.6× speedup on Llama 4 400B: a single GPU now produces what previously required a rack of 164 GPUs.
| Model | cuBLAS tok/s | rolv tok/s | Energy saved |
|---|---|---|---|
| Llama 4 Maverick 400B | 169 | 149,514 | 99.9% |
| Llama 4 400B (8E) | 5,180 | 852,680 | 99.4% |
| GLM-OCR · 24 layers | 6.4M | 318M | 98.0% |
| Claude 3.5-class B=512 | 17,680 | 1,467,584 | 98.8% |
| Llama-2-7B pruned (H100) | 397k | 8,757,286 | 95.5% |
TTFT is the latency a user feels before seeing the first word of a response. It is a separate metric from throughput speedup — measured independently on the same run. SHA-256 hash: 8dbe5f…dad56dd8dd
| Model | TTFT cuBLAS | TTFT rolv | TTFT Speedup | Throughput Speedup |
|---|---|---|---|---|
| Llama 4 Maverick 400B | 47.46 ms | 0.91 ms | 177.5× | 133.5× total |
| Llama 4 400B (8 experts) | 98.99 ms | 0.98 ms | 100.9× | 164.6× |
| DeepSeek-R1 (256 experts) | 58.06 ms | 1.40 ms | 41.6× | 78.9× |
| Claude 3.5-class · B=512 | 29.0 ms | 0.52 ms | 56.3× | 83.0× |
| Llama 4 Scout (16 experts) | 11.27 ms | 0.96 ms | 11.7× | 81.7× |
| Kimi K2.5 (~1T MoE) | 29.37 ms | 0.99 ms | 29.7× | 10.6× |
TTFT and throughput speedup are measured on the same run but are independent metrics — a model can have very high TTFT speedup with moderate throughput speedup, or vice versa.
rolvsparse© changes the unit economics of AI infrastructure in two ways: dramatically lower energy opex, and a fundamental reduction in the number of GPUs required to deliver a given throughput — or equivalently, a massive increase in what your existing fleet can produce.
At 98.8% energy reduction (Claude 3.5-class production serving), the same GPU draws just 1.2% of its previous power for the matrix operation. At scale:
The speedup multiplier works in both directions. The examples below use a conservative 10× factor — well within verified results across all tested workloads (verified range: 21.8×–164.6×). Real savings will be higher.
Add rolvsparse© to your existing fleet. At a conservative 10× throughput factor, each GPU now produces the output of 10 standard GPUs. Your hardware multiplies in value without buying a single new processor.
Conservative 10× factor used. Verified speedups range from 21.8× to 164.6× depending on model and workload. B200 list price ~$35,000.
Instead of buying the full GPU count, buy 1/10th as many and run rolvsparse©. Same throughput. Dramatically lower capex and ongoing energy cost. This is the conservative estimate — actual savings are higher.
Conservative 10× factor used throughout. Energy savings not included — they compound the advantage further.
| Workload | Speedup | 1 GPU replaces | Value of 1,000 GPUs | GPUs needed (conservative 10×) | Procurement saving |
|---|---|---|---|---|---|
| Llama 4 400B · 8 experts | 164.6× | 164.6 B200s | $5.76B | 10,000 | $3.15B |
| Llama 4 Maverick (total) | 133.5× | 133.5 B200s | $4.67B | 10,000 | $3.15B |
| DeepSeek-R1 · 256 experts | 78.9× | 78.9 B200s | $2.76B | 10,000 | $3.15B |
| Claude 3.5-class · B=512 | 83.0× | 83.0 B200s | $2.91B | 10,000 | $3.15B |
| Nemotron-3 · 0% sparse | 21.8× | 21.8 B200s | $763M | 10,000 | $3.15B |
GPU price assumption: $35,000 per NVIDIA B200 (Blackwell). Conservative 10× factor used for "GPUs needed" column — actual verified speedups shown in column 2. Energy savings not included — they compound the advantage further.
Based on verified 98.8% energy reduction at Claude 3.5-class batch=512. GPU capex uses conservative 10× throughput factor (verified total range: 21.8×–164.6×).
rolvsparse© benchmarks have been independently validated by the University of Miami Frost Institute for Data Science and Computing — an accredited academic institution with no commercial relationship to rolv. All results are deterministic, reproducible, and hash-verified across every platform.
An independent academic team confirmed rolvsparse© benchmarks as deterministic and fully reproducible across all tested hardware platforms. Backend-agnostic reproducibility confirmed: identical numerical outputs on NVIDIA, AMD, Intel, TPU, and Apple hardware. Cryptographic SHA-256 output hashes published for independent third-party verification.
"Deterministic and reproducible results confirmed across all tested platforms." — Frost Institute Validation Report
Run our verification script on your own hardware and get a cryptographic SHA-256 fingerprint of the result. Email the JSON to [email protected] — we run the same computation through rolvsparse© on identical inputs, produce the identical output hash, and return a full "Us vs. Them" comparison report showing your exact speedup and energy savings.
The Frost Institute confirmed all rolvsparse© benchmarks as deterministic and reproducible on real hardware across every tested platform. No commercial interest. Engaged solely to verify accuracy and reproducibility.
↓ View Validation Letter →Identical numerical outputs confirmed on NVIDIA, AMD, Intel, TPU, and Apple hardware. The cryptographic hash 8dbe5f139fd946d4cd84e8cc…dad56dd8dd is the same across every platform and sparsity level.
↓ Download Verification Kit →RSMT defines the exact density at which sparse storage becomes more memory-efficient than dense — a foundational rule that has long been missing from the field. VRAM, not compute, is the dominant bottleneck in large-scale inference. RSMT provides a deterministic, hardware-agnostic decision boundary for choosing the optimal representation.
| Value Type | Index Type | b | i | RSMT d | Use sparse when… |
|---|---|---|---|---|---|
| float32 | int64 | 4 | 8 | 0.333 | density < 33% |
| float16 / BF16 | int64 | 2 | 8 | 0.200 | density < 20% |
| float32 | int32 | 4 | 4 | 0.500 | density < 50% |
| int8 | int32 | 1 | 4 | 0.200 | density < 20% |
Composite efficiency: (Sparsity × Energy Savings) / 100
rolv E. Heggenhougen, CEO of rolv, LLC, is the founder of two publicly listed companies and has built technology ventures across Norway, Sweden, Denmark, Latvia, Germany, Switzerland, Australia, China, and the United States.
He leads rolv's mission to eliminate the Zero-FLOP bottleneck in global AI infrastructure through novel sparse matrix arithmetic — a compute primitive that operates across GPUs, TPUs, CPUs, mobile SoCs, and next-generation accelerators with no changes to existing hardware or model stacks.
Mr. Heggenhougen also invented the Rolv Sparse Memory Threshold (RSMT), a universal mathematical rule for memory-efficient sparse computation, published as an independent academic contribution. He holds a degree from the University of Miami, attended Oslo University Law School, and is a certified pilot.
Fluent in Norwegian, Danish, and Swedish; working knowledge of German.