Independently Validated · University of Miami Frost Institute · Patents Pending
rolvsparse©

Cut AI energy costs by 99%.
rolv speeds up every AI chip — no new hardware.

rolvsparse© is a new compute primitive that restructures how every AI processor handles matrix arithmetic — delivering up to 243× speedup and 99.5% energy reduction. Sparse and dense. Every platform. No hardware changes. No model retraining.

243×
Peak Speedup
AMD MI300X · sparse
83×
Production Serving
Claude 3.5-class · B=512 · B200
98.8%
Peak Energy Saved
Production serving · B200
177×
Faster TTFT
Llama 4 Maverick · B200
5
Hardware Platforms
One library · one hash
Time-To-First-Token
rolv 177× faster TTFT
TTFT on real Llama 4 Maverick weights (up_proj, 16384×5120, bfloat16) — from 64.8 ms to 0.37 ms on NVIDIA B200. At GPT-4o class scale (B=512): 40× TTFT speedup. Users experience instantaneous first-token response.
Cryptographic Output Identity
8dbe5f139fd946d4cd84e8cc…dad56dd8dd
Identical SHA-256 output hash across NVIDIA, AMD, Intel, Google TPU, and Apple Silicon — every sparsity level, every pattern. Cryptographically verified correctness.
Download Benchmark Report University of MiamiValidation PDF → Validation Test →
01 — Throughput

Up to 83× faster on production LLMs.

On NVIDIA B200, real Llama 4 Maverick MoE expert FFN weights (16384×5120, bfloat16, from HuggingFace) show 369K → 7.66M tokens/s — a 20.7× gain on identical hardware. Time-to-first-token drops 177×. Output hash-verified and canonical-checked.

NVIDIA B200 · PyTorch 2.8.0+cu128 · CUDA 12.8 · Batch 512 · 1,000 iterations

Llama 4 Maverick — MoE Expert FFN Real weights · HuggingFace

up_proj · model-00001-of-00084.safetensors · 16384 × 5120 · bfloat16

cuBLAS
369k
rolvsparse©
7.66M
20.7×
Throughput
177×
TTFT Speedup
81.5%
Energy Saved
1,285
Eff. TFLOPS
Energy: 42.97 J (rolv) vs 232.32 J · TTFT: 0.000365 s vs 0.064842 s
A_hash: d8384314ebd1014a0eb1abdc97aeef50b80c2297… · ✓ CANONICAL · Hash-verified · Real weights from HuggingFace
NVIDIA B200 · 178 GB · Batch 512 · 1,000 iterations

Qwen2.5-72B-Instruct — MoE Expert FFN

72B params · Mixture-of-Experts · 8,192 × 28,672

cuBLAS
127k
rolvsparse©
6.42M
50.5×
Throughput
50.5×
Per-Iter
91.4%
Energy Saved
3,018
Eff. TFLOPS
Energy: 64.02 J (rolv) vs 741.70 J · Per-iter: 0.000080 s vs 0.004027 s
✓ Output hash verified · Deterministic · Reproducible across all platforms
NVIDIA B200 · PyTorch 2.8.0+cu128 · CUDA 12.8 · Batch 512 · 200 iterations

DeepSeek-R1 — All 256 MoE Experts Stacked Real weights · HuggingFace · CANONICAL ✓

up_proj · 256 experts × 2048×7168 → 524,288×7168 stacked · bfloat16 → fp32 · Sparsity 0.006% · Build time 0.11 s

cuBLAS
8.9k
rolvsparse©
704.4k
78.9×
Throughput
41.6×
TTFT Speedup
98.7%
Energy Saved
5,294
Eff. TFLOPS
Energy: 106.90 J (rolv) vs 8,430.24 J · TTFT: 0.00140 s vs 0.05806 s
A_hash: 31575ec5d58089784332d7e1… · 4 shards · layers.3.mlp.experts.0–255 · ✓ CANONICAL · Hash-verified · Real weights from HuggingFace
193×
FE Solver
Phone drop-test finite element solver. Highest recorded real-world speedup. 99.5% energy saved.
158×
LLM Proxy Matrix
LLM proxy matrix on NVIDIA B200. 99.4% energy reduction.
98.8×
Rec GEMM
Meta-style recommendation GEMM. 99.0% energy savings.
61.9×
Netflix RecSys
50k×10k matrix. 89.5% energy savings.
01b — Production Serving Benchmark

GPT-4o & Claude 3.5-class. The models running 80% of all API traffic.

We benchmarked the FFN layer at the architecture scale of GPT-4o and Claude 3.5 Sonnet across every batch size operators actually use — B=1 through B=512. The speedup increases as concurrency grows. At B=512 — where cuBLAS is fully optimised — ROLV delivers 68.7× (GPT-4o class) and 83× (Claude 3.5 class). Weights: synthetic fp32, architecture-matched dimensions. NVIDIA B200.

Why synthetic weights?

GPT-4o and Claude 3.5 Sonnet weights are not public. This benchmark uses synthetic matrices at architecture-matched dimensions — the standard methodology used by cuBLAS, FlashAttention, and vLLM for closed-model benchmarks. Weight distribution: Normal(0, 0.02), fp32. Sparsity: ~0.000009% (natural zeros only). ROLV's advantage is structural — it comes from the operator architecture, not from weight sparsity.

Batch Serving context GPT-4o Class
speedup vs cuBLAS
Claude 3.5 Class
speedup vs cuBLAS
GPT-4o p99 (ms) Claude 3.5 p99 (ms) Energy saved
1Single user · SLA-critical23.6×36.3×0.0610.06695–97%
4Small burst33.0×59.7×0.0570.05397–98%
16Enterprise API31.1×61.2×0.0740.07797–98%
64High concurrency38.8×59.3×0.0750.08897–98%
128Heavy serving52.1×68.7×0.1000.13498%
256Datacenter batch60.5×77.5×0.1510.20298–99%
512Max throughput — cuBLAS comfort zone68.7×83.0×0.2520.36098.5–98.8%

GPT-4o class: 8 experts × (18,432×7,168) = 147,456×7,168. Claude 3.5 class: 8 experts × (28,672×8,192) = 229,376×8,192. B=512 is where cuBLAS is fully optimised — large contiguous matmuls, saturated memory bandwidth. cuBLAS p99 at B=512: 16.6 ms (GPT-4o), 29.0 ms (Claude 3.5). ROLV canonical hash: 8dbe5f139fd946d4cd84e8cc…dad56dd8dd — identical across both architectures and all batch sizes ≥4.

83×
Peak serving speedup
Claude 3.5-class at B=512. Speedup grows with batch — ROLV scales better than cuBLAS under load.
98.8%
Energy saved
0.52 mJ/token vs 43.57 mJ/token at B=512. At 1B tokens/day: 12 kWh → 0.14 kWh per layer.
5,515
Eff. TFLOPS
Claude 3.5-class at B=512. cuBLAS baseline: 66 TFLOPS. Build time: 54 ms, amortised.
02 — Energy Efficiency

91–99% less energy. Same hardware. Same outputs.

rolvsparse© reduces actual joules per inference by mathematically skipping zero-value multiplications. On Llama 4 Maverick, energy drops from 786 J to 50.6 J per 1,000 iterations — a 93.6% reduction — with identical outputs. For a hyperscaler with $10B annual energy spend, that is $6.5B–$9.9B in annual savings.

Energy per 1,000 Iterations — Lower is Better
Llama 4 Maverick MoE FFN · NVIDIA B200 (real weights)
Dense
232 J
rolv
42.97 J
Qwen2.5-72B MoE FFN · NVIDIA B200
Dense
741.7 J
rolv
64.0 J
FE Solver · Phone Drop-Test
Dense
Baseline
rolv
0.5%
Mistral-7B Wanda · AMD MI300X
rocSPARSE
Baseline
rolv
6.3%
Energy Savings by Workload
FE Solver · Phone Drop-Test99.5%
LLM Proxy Matrix · B20099.4%
Rec GEMM · Meta-style99.0%
Llama-3 70B FFN · B20098.0%
Mistral-7B Wanda · B20097.4%
GPT-J-6B MLP Pruned96.9%
Llama 4 Maverick MoE FFN ★ real81.5%
DeepSeek-R1 MoE FFN · 256 experts ★ real98.7%
Mistral-7B Wanda · MI300X93.7%
Qwen2.5-72B MoE FFN91.4%
Netflix RecSys89.5%
KIMI K2.5 Expert · NVIDIA89.7%
Infrastructure Economics

For a hyperscaler with 100,000 GPUs and $10B annual energy spend, rolvsparse©'s 65–99% savings translates to $6.5B–$9.9B annually. Hardware capex savings from needing fewer GPUs add a further $4B–$10B per year at $20B spend.

03 — Dense Matrix Performance

rolvsparse© accelerates fully dense matrices too.

rolvsparse© is not a sparsity-only optimization. At 0% sparsity — fully dense matrices — it achieves 63× speedup on NVIDIA B200 versus cuBLAS by restructuring memory access and computation layout at the arithmetic level. Every AI workload benefits: dense transformer layers, attention heads, embedding lookups — no model modification needed.

63× Speedup at 0% Sparsity — NVIDIA B200

This result establishes rolvsparse© as a universal compute primitive. The library restructures how matrix operations are dispatched and computed independently of data sparsity. Paired with real-world sparsity, speedups compound to 193× on production workloads.

63×
Dense Speedup
NVIDIA B200 · 0% sparsity · No model changes needed.
0
Model Changes
Works on unmodified dense models. No pruning, quantization, or retraining.
18–63×
NVIDIA Range
Dense speedup range across B200 and H100, 40–70% sparsity band.
04 — All Hardware Platforms

One library. Every chip. CPU beats flagship GPU.

A $2,000 dual-Intel Xeon system running rolvsparse© matches or beats a $40,000 NVIDIA B200 at ≥80% sparsity. AMD MI300X achieves 242× sparse speedup. AMD EPYC 7B13 CPU achieves 117× at 90% sparsity. This is a structural break in AI infrastructure economics. Intel benchmarks were run on 4k×4k matrices; NVIDIA on 20k×20k (25× larger) — making the comparison conservative in NVIDIA's favor.

The Democratization Argument

Intel Xeon + rolvsparse© vs. NVIDIA B200 — Full Comparison

At ≥80% sparsity a $2,000 dual-Xeon server running rolvsparse© matches or beats a $40,000 B200 running optimised cuBLAS — with no rolv at all. The gap in hardware cost is 20×. The gap in tokens/s disappears. cuSPARSE — NVIDIA's own sparse library — collapses at high sparsity and never competes.

Sparsity Intel Xeon
+ rolvsparse©
NVIDIA B200
cuBLAS · no rolv
NVIDIA B200
cuSPARSE
Hardware Cost Verdict
70%~15,000~80,000~854$2k vs $40kGPU ahead
80%~87,900~80,000~1,199$2k vs $40k$2k CPU overtakes $40k GPU
90%~86,600~80,000~2,389$2k vs $40krolv ahead; cuSPARSE collapses; 20× cheaper
95%~80,000~80,000~5,044$2k vs $40k$2,000 CPU = $40,000 GPU
99%~80,500~80,000~21,487$2k vs $40krolv Intel still ahead

Intel 4k×4k matrices · NVIDIA 20k×20k (25× larger). At equal matrix sizes rolv's advantage would be greater. This comparison is conservative in NVIDIA's favour. Hardware cost: Intel ~$2,000 vs NVIDIA B200 ~$35,000–$40,000.

AMD MI300X — 242× Sparse Speedup. Dense: 17–22×.

On AMD MI300X, rolvsparse© delivers up to 242× speedup versus rocBLAS at 70% sparsity (random pattern), with 99.59% energy savings. Dense matrices (0% sparsity) achieve a consistent 21–22× speedup. Effective TFLOPS reach 2,000–2,110 — versus rocBLAS baseline. rolvsparse© tokens/s: ~2.6M across all sparsity levels.

242×
Peak Sparse Speedup
2,110
Eff. TFLOPS
2.6M
Tokens/s
99.6%
Peak Energy Savings
NVIDIA B200 / H100
Highest throughput. Dense: 63×. Sparse: 243×.
Dense speedup~63×
Sparse speedupup to 243×
Energy savings98–99.6%
rolv tokens/s~5.1M
Eff. TFLOPS~4,087–4,095
NotecuBLAS baseline
AMD MI300X
242× sparse. Dense: 21–22×. 2,110 TFLOPS.
Dense speedup17–22×
Sparse speedupup to 242×
Energy savings94–99.6%
rolv tokens/s~2.6M
Eff. TFLOPS2,000–2,110
NoterocBLAS baseline
AMD EPYC 7B13 CPU
117× sparse. 9× dense. CPU-native.
Dense speedup9–9.3×
Sparse speedupup to 117×
Energy savings89–99.1%
rolv tokens/s12k–151k
Eff. GFLOPS865–2,566
NoteThreshold at 75% zeros
Intel Xeon CPU
$2k CPU beats $40k GPU at ≥80%.
Dense speedup7–8×
Sparse speedupup to 43×
Energy savings87–97.7%
rolv tokens/s14k–88k
Hardware cost~$2,000
NoteThreshold at 80% zeros
Google TPU v5e-8
Significant gains on Google AI hardware.
Dense speedup1.6–6.6×
Sparse speedup3–62×
Energy savings40–97%
rolv tokens/s300–600k
Eff. TFLOPS~900 GFLOPS
NoteXLA CSR slow
Apple M4 / M-series
Only correct sparse path on Apple Silicon.
Dense speedup3.6×
Sparse speedup10–70×
Energy savings72–75%
rolv tokens/s145–800k
Battery ext.30–50%
NoteMPS sparse: incorrect
Mobile & EV
Battery life extension. +31.9% EV range.
ViT-Base · Android2.2× faster
Mobile energy saved54.6%
EV Vision Safety2.3× faster
EV Battery Mgmt2.1× faster
EV range increaseup to +31.9%
Mobile battery+30–50%
05 — Benchmark Data

Real-world results. Every number is reproducible.

All benchmarks published with full methodology — matrix dimensions, hardware configs, iteration counts, energy readings, and cryptographic hashes. Any party can verify using reference code at rolv.ai.

Speedup (×) vs vendor best Energy saved (%) ★ Dense = 0% sparsity benchmark
Workload
Speedup
×
Energy saved
%

Cross-Platform Synthetic Summary

20k×20k matrices · batch 5k · 1,000 iterations. Intel/AMD CPU at smaller sizes.

PlatformDense SpeedupSparse SpeedupEnergy SavingsTokens/s (rolv)Eff. TFLOPS
NVIDIA B200 / H100~63×up to 243×98–99.6%~5.1M4,087–4,095
AMD MI300X17–22×up to 242×94–99.6%~2.6M2,000–2,110
AMD EPYC 7B13 CPU~9×up to 117×89–99.1%12k–151k865–2,566 GFLOPS
Intel Xeon CPU7–8×up to 43×87–97.7%14k–88k449–563 GFLOPS
Google TPU v5e-81.6–6.6×3–62×40–97%300–600k~900 GFLOPS
Apple M43.6×10–70×72–75%145–800k~10 TFLOPS
Select processor
Select sparsity pattern

06 — Independent Verification

Every result is independently verified.

rolvsparse© benchmarks have been independently validated by the University of Miami Frost Institute for Data Science and Computing — an accredited academic institution with no commercial relationship to rolv. All results are deterministic, reproducible, and published with full methodology.

University of Miami — Frost Institute for Data Science and Computing

An independent academic team confirmed rolvsparse© benchmarks as deterministic and fully reproducible across all tested hardware platforms. Backend-agnostic reproducibility confirmed: identical numerical outputs on NVIDIA, AMD, Intel, TPU, and Apple hardware. Cryptographic output hashes published for independent third-party verification.

"Deterministic and reproducible results confirmed across all tested platforms." — Frost Institute Validation Report

Frost InstituteValidation PDF → ValidationTest PDF → All Benchmarks PDF →
No GPU Required

Try It Yourself — Any Hardware. Any Laptop.

rolvsparse© democratizes AI inference. Run our validation script on any hardware — a laptop, a cheap cloud VM, your workstation — and generate your own SHA-256 baseline hash. Send it to us and we'll return a full "Us vs. Them" report showing exactly how much faster and more efficient your workload becomes with rolvsparse©. The math proves itself.

Step 1
Run the Script
Download and run rolv-verifier.py on your own hardware. No GPU required — any CPU, any laptop.
Step 2
Get Your Hash
The script outputs a SHA-256 fingerprint of your result — a cryptographic baseline signature unique to your hardware and run.
Step 3
Get Your Report
Email the JSON output to rolv@rolv.ai. We run ROLV against your exact data and return a full "Us vs. Them" comparison report.
How SHA-256 Verification Works
The Baseline
Your hardware generates a unique SHA-256 "fingerprint" of the calculation result.
The Match
When ROLV processes the same data, it must produce the exact same hash — proving no math was skipped.
The Proof
Identical hash = identical precision. ROLV delivers the same high-fidelity results as your current vendor — just faster and cheaper.
v2.0 — Real Hardware Energy Readings
NVIDIA GPUs
pynvml polls the GPU power rail every 50 ms; joules computed via trapezoidal integration of live readings.
AMD GPUs
pyrsmi provides equivalent live readings where the driver supports it.
CPU / Apple Silicon
Estimated from psutil CPU utilization × TDP — clearly labelled as an estimate in the output JSON.
Example Output — Standard CPU · Llama 4 Maverick FFN Slice · 8192×28672 · Batch 512
Baseline TTFT     · 6.247 s         ·  Tokens/s: 119
Baseline hash     · 093c342c3631e05d1fabe048bade2284e2bb11743956c08fb84dfa600cb315f8
→ Send to rolv    · receive your rolvsparse© comparison report
rolvsparse© result· TTFT: 0.116 s     ·  Tokens/s: 7,703  ·  64.7× speedup
System Requirements
PyTorch 2.5.0+ · CUDA 12.1 recommended for exact hash reproducibility · pip install torch==2.5.1 --index-url https://download.pytorch.org/whl/cu121
Note: Minor floating-point differences may occur across CUDA versions. We confirm equivalence numerically (atol=1e-5) if hashes differ across versions.

The baseline hash is yours — generated entirely on your own hardware, from your own run. rolvsparse© must produce the exact same result hash to prove no precision is lost. That's the guarantee.

Download Validation Kit (v2.0) →
Academic Validation

University of Miami Frost Institute

The Frost Institute confirmed all rolvsparse© benchmarks as deterministic and reproducible on real hardware. No commercial interest. Engaged solely to verify accuracy and reproducibility of published results.

View Validation PDF →
Reproducibility

Nsight-Validated Tolerance Harness

A deterministic tolerance harness using NVIDIA Nsight confirms rolvsparse© produces bit-accurate outputs relative to cuBLAS baseline within validated floating-point tolerance. Reference code publicly available.

Download Validation Test →
Full Suite

Complete Benchmark Report

Covers NVIDIA B200/H100, AMD MI300X, Intel Xeon, Google TPU v5e-8, and Apple M-series. Matrix dimensions, hardware config, iteration counts, energy readings, and output hashes all published.

Download Full Benchmarks →
07 — RSMT & Engineering Tools

The Rolv Sparse Memory Threshold: a universal rule.

RSMT defines the exact density at which sparse storage becomes more memory-efficient than dense — a foundational rule that has long been missing from the field. VRAM, not compute, is the dominant bottleneck in large-scale inference. RSMT provides a deterministic, hardware-agnostic decision boundary for choosing the optimal representation.

d = b / (b + i)
b = bytes per stored value  ·  i = bytes per index
If actual density < d → sparse storage uses less memory
Value TypeIndex TypebiRSMT dUse sparse when…
float32int64480.333density < 33%
float16 / BF16int64280.200density < 20%
float32int32440.500density < 50%
int8int32140.200density < 20%
RSMT Calculator
rolv Unit Calculator

Composite efficiency: (Sparsity × Energy Savings) / 100

08 — Leadership

The Founder.

rolv E. Heggenhougen, CEO of rolv, LLC, is the founder of two publicly listed companies and has built technology ventures across Norway, Sweden, Denmark, Latvia, Germany, Switzerland, Australia, China, and the United States.

He leads rolv's mission to eliminate the Zero-FLOP bottleneck in global AI infrastructure through novel sparse matrix arithmetic — a compute primitive that operates across GPUs, TPUs, CPUs, mobile SoCs, and next-generation accelerators with no changes to existing hardware or model stacks.

Mr. Heggenhougen also invented the Rolv Sparse Memory Threshold (RSMT), a universal mathematical rule for memory-efficient sparse computation, published as an independent academic contribution. He holds a degree from the University of Miami, attended Oslo University Law School, and is a certified pilot.

Fluent in Norwegian, Danish, and Swedish; working knowledge of German.

Patents
2 patents issued, 6 pending (Oct 2025). Covering Binary, Quantum, DNA, Optical, and Plant platforms for AI, plus Mobile and EV applications.
Companies
Founder of two publicly listed companies and ventures across nine countries including Norway, Sweden, Germany, Switzerland, Australia, China, and the U.S.
Education
Graduate of University of Miami. Attended Oslo University Law School. Certified pilot. Fluent in Norwegian, Danish, Swedish.
Validation
All rolv benchmarks independently validated by the University of Miami Frost Institute for Data Science and Computing. Open to third-party audit.
Research
Inventor of the Rolv Sparse Memory Threshold (RSMT) — a universal mathematical rule for memory-efficient sparse computation, published openly.