Patent-Pending Software Compute Primitive · Independently Validated · University of Miami

Cut AI energy costs by up to 99.9%.
Up to 164.6× faster on real HuggingFace weights.

Patent-Pending · Software-Only · No Hardware Changes · No Model Retraining

A software compute primitive that restructures matrix arithmetic to eliminate zero-valued multiply-accumulate operations. Works on fully dense matrices (0% sparsity) through 99%+ sparse. No hardware changes. No model retraining. One SHA-256 hash across every platform.

99.9%
Peak Energy Saved
Llama 4 Maverick · real weights · B200
164.6×
Total Speedup
Llama 4 400B (8 experts) · incl. build time · B200
83×
Production Serving
Claude 3.5-class · B=512 · B200
0%→99%
Sparsity Range
Dense to maximally sparse — one operator
5
Hardware Platforms
NVIDIA · AMD · Intel · TPU · Apple
University of Miami↓ Validation Letter ↓ Validation Kit v2.0
01 — Benchmark Results

Real weights. Real hardware. Every number hash-verified.

All models downloaded directly from HuggingFace and run on real hardware. Compared vs vendor-optimised cuBLAS / rocBLAS. Energy via NVML live power polling. SHA-256 output hash confirmed canonical across every platform.

② Real HuggingFace Models — Speedup & Energy

Up to 164.6× faster. Every model from HuggingFace.

Real production weights — Llama 4, DeepSeek-R1, Qwen, Mixtral, Kimi — downloaded directly and run on NVIDIA B200. Compared vs vendor-optimised cuBLAS (dense). SHA-256 hash verified. Updated March 2026.

Speedup vs cuBLAS Energy saved

University of Miami Frost Institute validated.  ↓ Full PDF

③ 0% to 99% Sparse — One Operator, the Full Range

Works on fully dense matrices. No sparsity required.

The standard objection to sparse operators: they only work on pre-pruned models. rolvsparse© disproves this. NVIDIA Nemotron-3 Super 120B — real FP8 HuggingFace weights, 0.00% sparsity, density exactly 1.0 — delivers 21.8× speedup and 95.4% energy reduction on a fully dense matrix. The same operator that handles 0% also scales continuously to 164.6× total (885× per-iteration) at 99%+ sparsity. One library, no configuration changes.

0% Sparsity — Fully Dense
21.8× · 95.4% energy saved
Nemotron-3 Super 120B · real FP8 HuggingFace weights · density 1.0 · NVIDIA B200 · 27.7M tokens/s · SHA-256: 8dbe5f…dad56dd8
Peak Sparsity — 885×
133.5× total · 885× per-iter · 99.9% energy
Llama 4 Maverick 400B · 128 experts fused · NVIDIA B200 · real weights · 133.5× total speedup (incl. 0.60s build time) · 885× per-iter · same SHA-256 hash confirmed canonical.
0% · Fully Dense · 21.8× 25% 50% 75% 99% Sparse · 164.6× total · 885× per-iter

Same operator · same library · same SHA-256 hash · no configuration changes between sparsity levels · no model retraining · no hardware changes.

④ CPU — The Democratization Story

A $1,000 desktop. 76× faster on Phi-4 14B. 127× on Mistral-7B. Zero sparsity required.

HP All-in-One (Intel i7-1165G7, ~$1,000, Windows 11). Real HuggingFace weights, 0% sparsity — fully dense on every model. Microsoft Phi-4 14B: 76× faster than Intel MKL, 109,646 tokens/second, 98.7% less energy. Mistral-7B: 127× faster, 0.7 ms TTFT. At ≥80% sparsity a $2,000 dual-Xeon server matches or overtakes a $40,000 NVIDIA B200 running cuBLAS.

New · March 2026 · Microsoft Phi-4 14B
76× · 98.7% energy · 109,646 tok/s
Real HuggingFace weights · 0% sparsity, fully dense · HP i7-1165G7 · $1,000 desktop · vs Intel MKL · SHA-256: 8dbe5f…dad56dd8dd
76×
Speedup vs MKL
98.7%
Energy saved
109k
Tokens / sec
5.7ms
TTFT
HP All-in-One i7 · $1,000 · Windows 11 · Real HuggingFace Weights · vs cuBLAS Dense
Speedup vs dense Energy saved
$2k Xeon + rolvsparse© vs $40k NVIDIA B200
Sparsity$2k Xeon+rolv$40k B200Verdict
70%~15k~80kGPU ahead
80%~88k~80k$2k overtakes $40k
90%~87k~80k20× cheaper, same speed
99%~80k~80krolv still ahead

Intel 4k×4k vs NVIDIA 20k×20k — conservative in NVIDIA's favour. cuSPARSE collapses above 80% sparsity.

Full Platform Data on Synthetics — All Processors
Per-platform interactive browser — NVIDIA · AMD · Intel · TPU · Apple — random, power-law, structured & dense patterns. Includes synthetic proxy benchmarks for GPT-4o and Claude 3.5-class serving architectures.
02 — FLOPS, Tokens & What the Numbers Mean

What does 164.6× faster actually mean?

Plain English
164.6 GPUs
One GPU running rolvsparse© delivers the same output as 164.6 GPUs running standard software. You either need 164× fewer processors for the same job, or do 164× more work with the same hardware.
164× less waiting
A job that takes 164 minutes on a standard GPU cluster finishes in 1 minute with rolvsparse©. A batch that takes 1 hour runs in under 22 seconds. Response latency that was 5 seconds drops to 30 ms.
99.4% less power
The same GPU draws 1/164th of the energy to produce the same result. A data centre that previously needed 10 MW of power for this workload now needs 60 kW. Same output. Same hardware. No changes to models or infrastructure.
How many GPUs does 1 rolv GPU replace? — across workloads
164.6
GPUs replaced
Llama 4 400B · 8E
133.5
GPUs replaced
Llama 4 Maverick
83
GPUs replaced
Claude 3.5-class B=512
78.9
GPUs replaced
DeepSeek-R1
21.8
GPUs replaced
Nemotron-3 · 0% sparse

Numbers represent total speedup including build time. All on NVIDIA B200 vs cuBLAS. Real HuggingFace weights except Claude 3.5-class (architecture-matched synthetic, standard methodology).

What is "Effective TFLOPS"? — For the technically minded

Think of it this way: a dense GPU calculation is like paying a full workforce to move 1,000 boxes — even though 900 of them are empty. rolvsparse© only moves the boxes with something in them. The work gets done faster because less work was actually required.

Formally: Effective TFLOPS = nominal dense FLOPs ÷ rolv wall-clock time. When this exceeds the GPU's rated peak (~1,800 TFLOPS on B200) it is not a physics violation — it means fewer multiply-accumulate operations were executed. Dense TFLOPS = true silicon utilisation. Effective TFLOPS = work elimination.

5,294
Eff. TFLOPS
DeepSeek-R1 · B200
3,210
Eff. TFLOPS
Llama 4 Maverick · B200
5,515
Eff. TFLOPS
Claude 3.5-class · B=512

† B200 hardware peak: ~1,800 TFLOPS dense. Values above this reflect work elimination, not measurement error.

Tokens Per Second — What Operators Actually Pay For

Every AI API bills by the token. More tokens per second from the same GPU = lower cost per token, more users served, or both. At 164.6× speedup on Llama 4 400B: a single GPU now produces what previously required a rack of 164 GPUs.

ModelcuBLAS tok/srolv tok/sEnergy saved
Llama 4 Maverick 400B169149,51499.9%
Llama 4 400B (8E)5,180852,68099.4%
GLM-OCR · 24 layers6.4M318M98.0%
Claude 3.5-class B=51217,6801,467,58498.8%
Llama-2-7B pruned (H100)397k8,757,28695.5%
Time-To-First-Token (TTFT) — Separate from Throughput Speedup

TTFT is the latency a user feels before seeing the first word of a response. It is a separate metric from throughput speedup — measured independently on the same run. SHA-256 hash: 8dbe5f…dad56dd8dd

Model TTFT cuBLAS TTFT rolv TTFT Speedup Throughput Speedup
Llama 4 Maverick 400B47.46 ms0.91 ms177.5×133.5× total
Llama 4 400B (8 experts)98.99 ms0.98 ms100.9×164.6×
DeepSeek-R1 (256 experts)58.06 ms1.40 ms41.6×78.9×
Claude 3.5-class · B=51229.0 ms0.52 ms56.3×83.0×
Llama 4 Scout (16 experts)11.27 ms0.96 ms11.7×81.7×
Kimi K2.5 (~1T MoE)29.37 ms0.99 ms29.7×10.6×

TTFT and throughput speedup are measured on the same run but are independent metrics — a model can have very high TTFT speedup with moderate throughput speedup, or vice versa.

03 — Infrastructure Economics

What 164.6× faster means in dollars and processors.

rolvsparse© changes the unit economics of AI infrastructure in two ways: dramatically lower energy opex, and a fundamental reduction in the number of GPUs required to deliver a given throughput — or equivalently, a massive increase in what your existing fleet can produce.

Energy Opex Savings

At 98.8% energy reduction (Claude 3.5-class production serving), the same GPU draws just 1.2% of its previous power for the matrix operation. At scale:

$6.5B–$9.9B
Annual energy savings
100k GPUs · $10B energy spend · 65–99% reduction
0.14 kWh
Per layer · 1B tokens/day
vs 12 kWh without rolv · Claude 3.5-class
$4B–$10B
Additional capex freed
Cooling · power · fewer GPUs · $20B spend
GPU Capex — Two Ways to Think About It

The speedup multiplier works in both directions. The examples below use a conservative 10× factor — well within verified results across all tested workloads (verified range: 21.8×–164.6×). Real savings will be higher.

Scenario A — You already have GPUs

Add rolvsparse© to your existing fleet. At a conservative 10× throughput factor, each GPU now produces the output of 10 standard GPUs. Your hardware multiplies in value without buying a single new processor.

Example: You have 100,000 NVIDIA B200s at $35,000 each = $3.5B invested.
At a conservative 10× throughput factor: your fleet now produces what 1,000,000 B200s would cost = $35B in effective throughput capacity — on the same hardware you already own.

Conservative 10× factor used. Verified speedups range from 21.8× to 164.6× depending on model and workload. B200 list price ~$35,000.

Scenario B — You need to hit a throughput target

Instead of buying the full GPU count, buy 1/10th as many and run rolvsparse©. Same throughput. Dramatically lower capex and ongoing energy cost. This is the conservative estimate — actual savings are higher.

Example: Your workload requires 100,000 NVIDIA B200s at $35,000 each = $3.5B procurement.
At a conservative 10× factor: you need only 10,000 B200s = $350M — saving $3.15B (90% capex reduction).
At verified speedups (21.8×–164.6×) actual savings are substantially higher.

Conservative 10× factor used throughout. Energy savings not included — they compound the advantage further.

Verified speedup range — $35k per B200 (Blackwell) · Conservative 10× used in examples above
Workload Speedup 1 GPU replaces Value of 1,000 GPUs GPUs needed (conservative 10×) Procurement saving
Llama 4 400B · 8 experts164.6×164.6 B200s$5.76B10,000$3.15B
Llama 4 Maverick (total)133.5×133.5 B200s$4.67B10,000$3.15B
DeepSeek-R1 · 256 experts78.9×78.9 B200s$2.76B10,000$3.15B
Claude 3.5-class · B=51283.0×83.0 B200s$2.91B10,000$3.15B
Nemotron-3 · 0% sparse21.8×21.8 B200s$763M10,000$3.15B

GPU price assumption: $35,000 per NVIDIA B200 (Blackwell). Conservative 10× factor used for "GPUs needed" column — actual verified speedups shown in column 2. Energy savings not included — they compound the advantage further.

ROI Calculator

Estimate your energy & hardware savings

Based on verified 98.8% energy reduction at Claude 3.5-class batch=512. GPU capex uses conservative 10× throughput factor (verified total range: 21.8×–164.6×).

05 — Independent Verification

Every result is independently verified.

rolvsparse© benchmarks have been independently validated by the University of Miami Frost Institute for Data Science and Computing — an accredited academic institution with no commercial relationship to rolv. All results are deterministic, reproducible, and hash-verified across every platform.

University of Miami — Frost Institute for Data Science and Computing

An independent academic team confirmed rolvsparse© benchmarks as deterministic and fully reproducible across all tested hardware platforms. Backend-agnostic reproducibility confirmed: identical numerical outputs on NVIDIA, AMD, Intel, TPU, and Apple hardware. Cryptographic SHA-256 output hashes published for independent third-party verification.

"Deterministic and reproducible results confirmed across all tested platforms." — Frost Institute Validation Report

Frost Institute↓ Validation Letter Verification Kit↓ v2.0
No GPU Required

Try It Yourself — Any Hardware. Any Laptop.

Run our verification script on your own hardware and get a cryptographic SHA-256 fingerprint of the result. Email the JSON to [email protected] — we run the same computation through rolvsparse© on identical inputs, produce the identical output hash, and return a full "Us vs. Them" comparison report showing your exact speedup and energy savings.

Step 1
Run the Script
Download and run rolv-verifier.py on your hardware. No GPU required — any CPU, any laptop. Novice users: paste into Jupyter and press Shift+Enter.
Step 2
Get Your Hash
The script outputs a SHA-256 fingerprint of your result — a cryptographic baseline unique to your hardware and run. It also captures full hardware specs and energy readings.
Step 3
Get Your Report
Email the .json file to [email protected]. We run ROLV against your exact inputs and return a full comparison showing speedup, energy savings, and matching hash.
How SHA-256 Verification Works
The Baseline
Your hardware generates a unique SHA-256 fingerprint of the matrix result — produced entirely on your own machine.
The Match
ROLV processes the same data on our infrastructure and must produce the exact same hash — proving no math was skipped or precision reduced.
The Proof
Identical hash = identical precision. In rare cases of CUDA version drift we confirm numerically (atol=1e-5). The guarantee stands.
v2.0 — Real Hardware Energy Readings
NVIDIA GPUs
pynvml polls the GPU power rail every 50 ms; joules computed via trapezoidal integration of live readings.
AMD GPUs
pyrsmi provides equivalent live readings where the driver supports it. Falls back to estimate if unavailable.
CPU / Apple Silicon
Estimated from psutil CPU utilization × TDP — clearly labelled as an estimate in the output JSON via energy_measurement_method.
Recommended Cloud Environments
RunPod.io
NVIDIA & AMD GPU testing — A100, H100, B200, MI300X. Clean CUDA/ROCm stacks, accurate NVML/AMD SMI telemetry.
Google Cloud
AMD and Intel CPU instances — stable OS images, predictable performance, no power throttling. Ideal for EPYC benchmarks.
Google Colab
Intel Xeon CPU & Google TPU v5e-1 and v6e-1 — free tier available, standardised PyTorch/XLA environments.
Kaggle
Free Google TPU v5e-8 access — ideal for reproducing our TPU benchmarks. No credit card required.
System Requirements
PyTorch 2.5.0+ · CUDA 12.1 recommended · pip install torch==2.5.1 --index-url https://download.pytorch.org/whl/cu121
Output file format: rolv_baseline_<email>_<timestamp>.json — email to [email protected].
↓ Download Validation Kit v2.0
Academic Validation

University of Miami Frost Institute

The Frost Institute confirmed all rolvsparse© benchmarks as deterministic and reproducible on real hardware across every tested platform. No commercial interest. Engaged solely to verify accuracy and reproducibility.

↓ View Validation Letter →
Reproducibility

SHA-256 Hash-Verified · Cross-Platform

Identical numerical outputs confirmed on NVIDIA, AMD, Intel, TPU, and Apple hardware. The cryptographic hash 8dbe5f139fd946d4cd84e8cc…dad56dd8dd is the same across every platform and sparsity level.

↓ Download Verification Kit →
06 — RSMT & Engineering Tools

The Rolv Sparse Memory Threshold: a universal rule.

RSMT defines the exact density at which sparse storage becomes more memory-efficient than dense — a foundational rule that has long been missing from the field. VRAM, not compute, is the dominant bottleneck in large-scale inference. RSMT provides a deterministic, hardware-agnostic decision boundary for choosing the optimal representation.

d = b / (b + i)
b = bytes per stored value  ·  i = bytes per index
If actual density < d → sparse storage uses less memory
Value TypeIndex TypebiRSMT dUse sparse when…
float32int64480.333density < 33%
float16 / BF16int64280.200density < 20%
float32int32440.500density < 50%
int8int32140.200density < 20%
RSMT Calculator
rolv Unit Calculator

Composite efficiency: (Sparsity × Energy Savings) / 100

07 — Leadership

The Founder.

rolv E. Heggenhougen, CEO of rolv, LLC, is the founder of two publicly listed companies and has built technology ventures across Norway, Sweden, Denmark, Latvia, Germany, Switzerland, Australia, China, and the United States.

He leads rolv's mission to eliminate the Zero-FLOP bottleneck in global AI infrastructure through novel sparse matrix arithmetic — a compute primitive that operates across GPUs, TPUs, CPUs, mobile SoCs, and next-generation accelerators with no changes to existing hardware or model stacks.

Mr. Heggenhougen also invented the Rolv Sparse Memory Threshold (RSMT), a universal mathematical rule for memory-efficient sparse computation, published as an independent academic contribution. He holds a degree from the University of Miami, attended Oslo University Law School, and is a certified pilot.

Fluent in Norwegian, Danish, and Swedish; working knowledge of German.

Patents
2 patents issued, 6 pending (Oct 2025). Covering Binary, Quantum, DNA, Optical, and Plant platforms for AI, plus Mobile and EV applications.
Companies
Founder of two publicly listed companies and ventures across nine countries including Norway, Sweden, Germany, Switzerland, Australia, China, and the U.S.
Education
Graduate of University of Miami. Attended Oslo University Law School. Certified pilot. Fluent in Norwegian, Danish, Swedish.
Validation
All rolv benchmarks independently validated by the University of Miami Frost Institute for Data Science and Computing. Open to third-party audit.
Research
Inventor of the Rolv Sparse Memory Threshold (RSMT) — a universal mathematical rule for memory-efficient sparse computation, published openly.