Model	Weights	GPU	Experts	Nat sp%	FLOPs↓	vs cuBLAS iter	vs cuBLAS total	vs cuSPARSE	Energy↓	Tokens/s	PASS	Pert
Mixtral-8×7B	REAL ✓	B200	8·top-2	75.0%	75.0%	1.86×	1.57×	109×	46%	2,185,075	✓	✓
Qwen3-30B-A3B	REAL ✓	B200	128·top-8	93.8%	93.8%	3.43×	3.30×	32×	71%	6,650,774	✓	✓
Llama-4-Scout ★	REAL ✓	B200	16·top-1	93.8%	93.8%	4.75×	4.45×	103×	79%	5,795,875	✓	✓
Qwen2-57B-A14B	REAL ✓	H200	64·top-8	87.5%	93.8%	4.40×	3.84×	90×	77%	2,357,882	✓	✓
OLMoE-1B-7B	REAL ✓	H200 NVL	64·top-8	87.5%	87.5%	2.49×	2.43×	43×	60%	4,580,013	✓	✓
DeepSeek-V2-Lite	REAL ✓	H200 NVL	64·top-6	90.6%	90.6%	2.94×	2.93×	40×	66%	3,959,777	✓	✓
Phi-3.5-MoE	REAL ✓	H200 NVL	16·top-2	87.5%	87.5%	3.38×	3.37×	74×	70%	2,430,602	✓	✓
Qwen1.5-MoE-A2.7B	REAL ✓	H200 NVL	60·top-4	93.3%	93.3%	3.37×	3.35×	35×	70%	4,834,346	✓	✓
Gemma4-26B-A4B ★	REAL ✓	H200 NVL	128·top-8	93.8%	95.3%	4.47×	4.39×	53×	78%	2,398,905	✓	✓

Model / Layer	GPU	Sparsity	vs cuBLAS	vs cuSPARSE	Energy↓	PASS
LLaMA-3.1-8B up_proj [REAL]	H200	80%	2.17×	9.53×	54%	✓
LLaMA-3.1-8B up_proj [REAL]	H200	90%	2.79×	8.66×	64%	✓
DeepSeek-R1 embed [REAL]	B200	95%	19.42×	19.42×	95%	✓
Mistral-7B q_proj [REAL]	Intel i7	95%	14.01×	14.01×	93%	✓

Sp%	ROLV ms	cv%	vs FP8	vs TRT-LLM	vs INT8-LT	vs 2:4	vs FP32	vs cuSPARSE	PASS
0%	2.569	3.10%	1.32×	1.07×	0.84×	2.46×	1.08×	14.25×	✓
50%	1.392	0.27%	2.46×	1.97×	1.57×	4.54×	1.99×	13.61×	✓
70%	0.938	0.40%	3.65×	3.02×	2.33×	6.73×	2.94×	12.70×	✓
85%	0.477	0.85%	7.21×	5.79×	4.61×	13.24×	5.81×	13.92×	✓
90%	0.335	0.41%	10.26×	8.27×	6.57×	18.84×	8.23×	14.51×	✓
95%	0.206	0.55%	16.70×	13.04×	10.69×	30.72×	13.38×	14.97×	✓
99%	0.088	1.20%	39.17×	32.61×	25.08×	71.84×	31.27×	19.27×	✓

Sp%	ROLV ms	cv%	vs FP8	vs TRT-LLM	vs INT8-LT	vs 2:4	vs FP32	vs cuSPARSE	PASS
0%	2.570	3.12%	1.32×	1.08×	0.84×	2.46×	1.08×	14.25×	✓
50%	1.387	0.25%	2.46×	2.04×	1.57×	4.56×	1.99×	13.66×	✓
70%	0.939	0.44%	3.65×	2.86×	2.33×	6.73×	2.94×	12.68×	✓
85%	0.475	0.82%	7.24×	5.92×	4.63×	13.30×	5.82×	13.96×	✓
90%	0.335	0.40%	10.25×	8.39×	6.56×	18.83×	8.23×	14.50×	✓
95%	0.206	0.57%	16.68×	13.17×	10.67×	30.61×	13.38×	14.96×	✓
99%	0.088	1.23%	38.99×	31.22×	24.97×	71.52×	31.26×	19.42×	✓

Layer	Model	ROLV ms	vs FP8	vs TRT-LLM	vs 2:4	vs cuSPARSE	PASS
down_proj ★ PEAK	Llama-3.1-8B	0.079	42.93×	36.74×	78.05×	21.66×	✓
down_proj ★ PEAK	Mistral-7B	0.080	42.62×	35.94×	77.52×	21.57×	✓
up_proj	Llama-3.1-8B	0.088	39.16×	31.46×	71.80×	19.35×	✓
up_proj	Mistral-7B	0.088	39.03×	32.15×	71.58×	19.57×	✓
gate_proj	Llama-3.1-8B	0.088	39.17×	32.61×	71.84×	19.27×	✓
gate_proj	Mistral-7B	0.088	38.99×	31.22×	71.52×	19.42×	✓
q_proj 4096×4096	Llama-3.1-8B	0.043	23.58×	33.77×	43.57×	15.26×	✓
q_proj 4096×4096	Mistral-7B	0.043	23.66×	28.63×	43.75×	14.43×	✓

Same inference · 95% less energy · 5 CPU architectures + 4 GPU/accelerator surfaces · 2,462 PASS

Same inference. 95% less energy.
Mathematical AI inference acceleration. 5–42× vs vendor libraries. Bit-equivalent output. SHA-256 verified.

Bit-exact · NVIDIA B200 + AMD Zen 4 · 1,344 / 1,344 cells dual-gate PASS · measured 67–93% GPU energy reduction · methodology & full results →

▶ Run the live benchmark on our server

Pick a model, get a cryptographically signed result by email in 10–15 min · verifies bit-identical output to vendor · runs on real HuggingFace weights

99%

energy savings

42.93×

vs FP8 cuBLAS

NVIDIA H200 NVL

Llama / Mistral down_proj · 99% sp

LATEST

78%

measured energy reduction

672/672

dual-gate PASS

NVIDIA B200 · bit-exact

3 models · 8 sparsity levels · token-identical

91%

energy savings

9.97×

vs cuSPARSE-BF16

NVIDIA Tesla T4

BF16 sweep · 95% sp · 24/24 PASS

95%

energy savings

13.53×

vs rocBLAS

AMD MI300X

Llama-3.1-405B shapes · 95% sp

98.7%

energy savings

77.38×

vs CPU-CSR

Intel Xeon (4-core)

Llama-3.1-8B o_proj · 99% sp

98.8%

energy savings

25.27×

vs MKL sparse

Intel i7 (4-core)

Llama-7B down_proj · 99% sp

79%

energy savings

5.01×

vs CPU-CSR

AMD EPYC 7B13

Server CPU · 70% sp · 22/22 PASS

81%

energy savings

5.12×

vs CPU-CSR

Google Axion (ARM)

Neoverse V2 · 70% sp · 22/22 PASS

LATEST

14.57×

peak vs vendor_best

672/672

dual-gate PASS

AMD Ryzen Zen 4 · bit-exact

8845HS · 3 models · 8 op points · token-identical

ROLV is built for AI inference — the regime where weight matrices are static and the structure of modern AI workloads can be exploited mathematically. Every number above is from the verified ROLV benchmark report — real HuggingFace weights, signed runs, full hash chain. NVIDIA, AMD, and Intel results pass identical correctness gates: column-normalised ATOL ≤ 0.05, cosine ≥ 0.999, perturbation test, every cell.

Energy at Scale

The fastest path to half the electricity bill.

Energy is the binding constraint on AI growth. Grid capacity, datacenter siting, water cooling, sustainability commitments — every major AI buyer is solving a power problem. ROLV reduces the joules per token directly, with no model changes and no accuracy loss.

Modest deployment

50%+

energy reduction at a 2× speedup — achievable on any platform, any workload, today

Production typical

80–95%

energy reduction on real production layer shapes — verified on H200, B200, MI300X, Xeon, EPYC

Peak verified

98.7%

peak energy savings — Llama-3.1-8B at 99% sparsity, Intel Xeon, measured power draw

For a hyperscaler running 100,000 inference GPUs

~$140M

annual electricity savings at $0.10/kWh

~350,000 t

CO₂ avoided annually (US grid mix)

~120 MW

grid capacity freed — powers ~90,000 homes

Calculation: 100,000 GPUs × 700W TDP × 60% utilization × 8,760 hours/year × 0.50 energy reduction factor. Assumes US average grid emissions (0.4 kg CO₂/kWh). Hyperscaler-scale electricity rates and demand-response programs typically improve these numbers further.

Energy savings are measured, not modeled. Per-cell power readings via NVIDIA pynvml on GPU and Intel RAPL on CPU, integrated over the benchmark window. The same hardware monitor that production telemetry reads — not theoretical compute reduction. Full methodology in the verified benchmark report.

Universal Compatibility

Works on every platform. Today and tomorrow.

NVIDIA · AMD · Intel · ARM · Apple · Google TPU · Custom ASICs · FPGAs · Any hardware that does matrix multiply.

Live benchmark

Your device vs ROLV. Side by side. Right now.

The same server runs both standard matrix multiply (Intel MKL, the industry baseline) and ROLV on identical inputs — same hardware, same matrix, same iteration count. The left panel shows what every AI system runs today. The right panel shows ROLV. Both signed and explained.

Your hardware — this machine runs the baseline

Result is signed in your name and emailed when complete — you don’t need to stay on this page. Both required.
Saved on this device so you don’t have to re-enter next time.

Model:

Benchmark server: 8 vCPU · 32 GB RAM · CPU-only · shared with other visitors · a fraction of the NVIDIA H200/B200 hardware where ROLV delivers 100×+ speedups

Live demo to date — results from real visitors

92

Total runs

201×

Best peak observed

45×

Mean peak

89.1%

Mean energy saved

10

Distinct models

7

Countries

Windows, macOS, Linux, Android, iPhone · 2–12 cores · every run signed with SHA-256 + perturbation test · 9/9 PASS on every successful run

No need to wait. Watch it run live if you want, or close the tab and check your inbox — the signed result is emailed to you when complete either way. Typical benchmark takes 10–15 minutes; you’ll get the email when ready.

Standard library — MKL baseline

Standard AI computation — no optimisation

This is what every AI system runs today

Computing...

ROLV server — same inputs

ROLV Primitive© — server-side, protected

Real AI model weights, 500 iterations, signed result

Contacting server... (may take 30s if sleeping)

Free · no install · works on any device · results signed and verifiable

A Story About Waste at Scale

ROLV Makes AI Available to Anyone,
Anywhere with a PC.

Picture a container ship crossing the Pacific. It carries 20,000 containers. The manifest says 5,000 of them are empty — have always been empty, will be empty on arrival. But the ship cannot leave them behind. Its loading system was built decades ago and it can only operate one way: load everything, sail everything, unload everything.

It burns fuel proportional to its total cargo — including the 5,000 empty containers. The crew works proportional to total cargo. The port fees are proportional to total cargo. Every crossing. Every time.

This is what cuBLAS does with MoE inference. The empty containers are the inactive experts — architecturally zero, guaranteed by the router, known before the computation starts. cuBLAS has no mechanism to leave them on the dock. It computes all of them, every token, every layer, every inference call.

ROLV Primitive© is the loading system that reads the manifest first. It identifies the empty containers before departure. It sails only what carries cargo. Same destination. Same output. A fraction of the fuel.

The numbers behind the analogy

DeepSeek-V3 — 256 experts, top-8 active

248

empty containers per token
96.9% of all compute wasted by cuBLAS

ROLV Primitive© reduces work to

8

active experts — exactly
8.76× faster · 110× vs cuSPARSE · PASS

Mixtral-8×7B — 8 experts, top-2 active

6

empty containers per token
75% of all compute wasted by cuBLAS

ROLV reduces work to

2

active experts — exactly
1.86× faster · 109× vs cuSPARSE · PASS

Every frontier model crossing the Pacific today carries empty containers. ROLV leaves them on the dock.

Energy at Scale — The Bigger Story

Energy savings, measured at the GPU.

Measured 67–93% GPU energy reduction via nvidia-smi hardware power telemetry — not derived, not modelled.

Energy is the binding constraint on AI build-out today. Speedups are useful; measured energy reduction is the durable competitive advantage. The latest run on NVIDIA B200 (May 2026) reports 67–93% GPU energy reduction, captured via nvidia-smi hardware power telemetry on a per-cell basis — vendor joules and ROLV joules measured separately, ratio recorded directly. Not derived from speedup. Not modelled. Measured.

A 1,000-GPU H200 CLUSTER

~$950K/yr

saved on electricity alone (700W per GPU · $0.10/kWh · 24/7 utilisation · 95% workload reduction)

WITH PUE ≈ 1.5

~$1.4M/yr

total facility savings including cooling, power conversion, and infrastructure overhead

CARBON AVOIDED

~3,500 t CO₂/yr

per 1,000 GPUs · equivalent to taking 750 cars off the road, every year, indefinitely

AT HYPERSCALER SCALE

~$140M/yr

facility-level savings on a 100,000-GPU cluster · 350,000 t CO₂/yr avoided · no model retraining · no accuracy loss

Why this matters now

AI inference electricity demand is on track to rival the energy budget of small nations. The U.S. grid is already constrained — new data centres are being delayed by interconnection queues that stretch years. The conventional answer is “build more power plants.” ROLV offers a different answer: do the same inference with a fraction of the electricity. Same models. Same accuracy. Bit-identical output verified by SHA-256 and (in the latest dual-gate harness) by token-identity check on every cell.

For a frontier-AI hyperscaler, this is the difference between needing one new gas turbine peaker plant and not needing one. For a sovereign AI program, it is the difference between depending on imported energy and not depending. For a CPU-only deployment of an open-weight model, it is the difference between “impossible without GPUs” and “running today on the laptop you already own.”

Methodology and assumptions, explicit: these dollar projections are upper-bound figures derived from operator-level FLOPs reduction at 95% sparsity, applied to public GPU power and electricity-cost figures. Actual realized savings on production inference workloads depend on (a) where ROLV is integrated in the serving stack, (b) how much of the wall-clock is matmul vs other compute (attention, KV-cache, tokenization), and (c) the natural sparsity actually present in the deployed model. End-to-end serving validation is in progress; until it completes, these figures are best read as the ceiling, not the deployment number. Per-cell energy% disclosure is in the per-case JSON output.

Benchmarks — Real Weights · SHA-256 Verified · 1,000 iters

Full results. Every claim verified.

NVIDIA · market context · May 2026

Jensen Huang has publicly framed the next chapter of AI as “demand for inference will go up by a billion times” — the layer where ROLV operates exclusively. Hopper / Blackwell remain the dominant production inference platform; cuSPARSE is NVIDIA’s own sparse linear algebra reference and the methodologically-correct baseline for any sparse compute primitive. ROLV beats it by 9–109× across the H200 and B200 portfolio below.

Real HuggingFace weights · SHA-256 hashed inputs and outputs · ATOL≤0.05 + cosine≥0.999 + perturbation gate every cell. Production speedups vs the modern inference stack actually deployed today — FP8 cuBLAS on Hopper/Blackwell, NVIDIA TensorRT-LLM INT8, INT8 cuBLASLt, structured 2:4 sparse — alongside the legacy sparse vendor reference (cuSPARSE).

■ H200 NVL · Llama-3.1-8B + Mistral-7B-Instruct · 112/112 PASS

Two independent model architectures. Identical matrix shapes for these layer types. The numbers below come from two separate runs on real public HuggingFace weights — and they match within measurement noise (cv% < 1.3% on most cells). That cross-architecture consistency is itself a validation of the underlying ROLV behaviour.

Layer	Sp%	Llama-3.1-8B vs FP8	Mistral-7B vs FP8	vs TRT-LLM INT8	vs 2:4 struct	vs cuSPARSE	PASS
down_proj ★ PEAK	99%	42.93×	42.62×	~36×	~78×	~22×	✓
gate_proj	99%	39.17×	38.99×	~32×	~72×	~19×	✓
up_proj	99%	39.16×	39.03×	~32×	~72×	~20×	✓
q_proj	99%	23.58×	23.66×	~31×	~44×	~15×	✓
Peaks at 99% sparsity · full sweep covers 0/50/70/85/90/95/99% · both weight_prune and activation_natural modes · 112/112 cells PASS · vs INT8 cuBLASLt (apples-to-apples vendor INT8) ranges 0.84× (sp=0%) to ~25× (sp=99%) — honest disclosure

NVIDIA H200 NVL (150 GB VRAM, sm_90) · FP32 calibration · ROLV path B (INT8 cuBLASLt) selected by content-aware dispatcher · batch=1024 · 1000 iters per cell · cv% < 1% on most cells · numbers above measured via the bench_e2e_hf harness on real HuggingFace weights

■ e2e harness · production-model coverage · 280/280 PASS

All Tier 0 / Tier 1 numbers above are sourced from bench_e2e_hf running against real production model weights downloaded from HuggingFace public repositories. Per-layer matmul measurements with token-throughput projections derived from measured timings.

Model	Cells	Architecture	Peak vs FP8	vs cuSPARSE	PASS
Llama-3.1-8B	56	Dense LLM	42.93×	21.66×	56/56
Mistral-7B-Instruct	56	Dense LLM	42.62×	21.57×	56/56
Mixtral-8×7B ★	56	MoE (q/w1/w2/w3)	42.78×	21.52×	56/56
DeepSeek-R1-Distill	42	Dense LLM	20.36×	14.16×	42/42
Whisper-Large-v3 ★	70	Audio encoder-decoder	13.10×	12.14×	70/70
Total: 5 production models · 3 architecture categories · 280/280 PASS · cosine ≥ 0.999 · ATOL ≤ 1e-5 · SHA-256 + perturbation gate every cell

MoE proof point (Mixtral): expert FFN layers w1/w2/w3 swept 0–99% sparsity, peak matches dense-LLM peaks within 0.5% noise — MoE is a measured ROLV regime, not a projection. Architecture diversity (Whisper): audio encoder-decoder with no causal masking, fc1/fc2 projections, mel-spectrogram input — ROLV is not LLM-specific. Full model.generate() serving measurements (autoregressive decode + KV cache + sampling) are a separate workstream.

■ H200 NVL · Tier 0 sweep summary · 5 models · 952/952 PASS

Model	Cases	Mean vs FP8 (0% sp)	Mean vs FP8 (95% sp)	Peak vs FP8	Peak vs TRT-LLM	PASS
SmolLM2-1.7B	224	3.89×	7.82×	9.46×	15.21×	224/224
Qwen2.5-1.5B	224	4.17×	8.34×	10.92×	17.04×	224/224
Phi-3.5-mini	56	4.51×	8.96×	11.83×	19.62×	56/56
Qwen2.5-7B	224	4.83×	9.41×	14.18×	23.71×	224/224
DeepSeek-R1-Distill-7B	224	4.32×	8.87×	12.74×	20.96×	224/224
Tier 0 H200 total: 952/952 PASS · monotonic per-sparsity curves on every model · 5–14× vs FP8 in production sparsity band (70–90%) · peaks at 99% sparsity

Real HuggingFace weights · sparsity 0/50/70/85/90/95/99% · both weight_prune and activation_natural modes · per-layer testing across q/k/v/o/gate/up/down projections · 1000 iters per cell · 4 SHA-256 hashes (input A, input V, baseline output, ROLV output) + perturbation test every case

■ H200 NVL · Production-model E2E harness · 5 models · 280/280 PASS

Cross-validation surface using the bench_e2e_hf harness, which loads real HuggingFace model weights from public repositories and benchmarks every transformer projection layer with token-throughput projections. Same correctness gates as every other surface. Includes MoE coverage (Mixtral-8×7B) and audio-architecture validation (Whisper-Large-v3) — methodology is shape-driven and architecture-agnostic.

Model	Architecture	Cells	Peak vs FP8	Peak vs cuSPARSE	Peak vs 2:4	PASS
Llama-3.1-8B ★ PEAK	Dense LLM	56	42.93×	21.66×	78.05×	56/56
Mistral-7B-Instruct	Dense LLM	56	42.62×	21.57×	77.52×	56/56
Mixtral-8×7B	MoE LLM	56	42.78×	21.52×	77.87×	56/56
DeepSeek-R1	Distill LLM	42	20.36×	14.16×	36.03×	42/42
Whisper-Large-v3	Audio (encoder-decoder)	70	13.10×	12.14×	23.43×	70/70
Aggregate: 280/280 PASS · dense-LLM peaks cluster within 0.7% (cv ≈ 0.4%) — speedup is shape-driven, not weight-distribution-driven

Per-layer matmul measurements on real HF weights via bench_e2e_hf harness, with token-throughput projections. Full model.generate() serving measurements (KV cache + autoregressive decode + sampling) are a separate workstream — treat per-layer numbers as upper bounds on serving-level wins.

■ B200 · MoE real models at natural routing sparsity

Production MoE models with their natural routing sparsity. No weight pruning. Zero quality impact. The cuSPARSE comparison is the methodologically-correct one for sparse compute primitives.

Model	Nat sp%	vs cuBLAS	vs cuSPARSE	Energy↓	Tokens/s	PASS
Mixtral-8×7B	75.0%	1.86×	109×	46%	2,185,075	✓
Qwen3-30B-A3B	93.8%	3.43×	32×	71%	6,650,774	✓
Llama-4-Scout ★	93.8%	4.75×	103×	79%	5,795,875	✓

NVIDIA B200 · BF16 · TF32 ON · 1,000 iters · ATOL=0.05 col-norm fp64 · 4 SHA-256 hashes + perturbation PASS every case · weights downloaded from public HuggingFace repositories

■ vs cuSPARSE (NVIDIA sparse vendor reference)

cuSPARSE is NVIDIA’s own sparse linear algebra library — the reference implementation tuned by hundreds of engineers. This is the methodologically-correct sparse-vs-sparse comparison for any sparse compute primitive.

Hardware	Workload	Sparsity	cuSPARSE ms	ROLV ms	ROLV wins	PASS
NVIDIA H200	LLaMA up_proj	80%	5.90	0.619	9.53×	✓
NVIDIA H200	LLaMA up_proj	90%	3.01	0.348	8.66×	✓
NVIDIA B200	Mixtral-8×7B MoE	75%	25.65	0.234	109×	✓
NVIDIA B200	Llama-4-Scout MoE	94%	9.14	0.088	103×	✓

Same input matrices, same sparsity patterns, same correctness gates as every other surface in this report. cuSPARSE numbers measured with NVIDIA’s optimal sparse kernel chosen per case.

AMD · market context · May 2026

AMD reports Q1 2026 today with the stock up roughly 59% YTD on AI-GPU demand expectations and a market cap near $571B. MI300X has landed at Microsoft Azure (powering Azure OpenAI services), Meta, Oracle, Dell PowerEdge, HPE Cray, Lenovo ThinkSystem, Supermicro. HuggingFace tests 700,000 of its most popular models nightly on MI300X. ROCm software remains the gap vs CUDA — that gap is exactly what ROLV closes on AMD silicon already in volume deployment.

AMD Instinct MI300X · 10-model production portfolio · 486/486 PASS. Real layer shapes from ten production-grade frontier models including the largest open-weight LLMs in deployment. Same hybrid harness as the NVIDIA portfolio, with rocBLAS (dense) and rocSPARSE (sparse) auto-selected as the dual vendor baselines.

■ AMD MI300X · 10-model portfolio · 486/486 PASS

Headline: peak 74.02× vs rocSPARSE (AMD’s cuSPARSE equivalent) — the methodologically-correct sparse-vs-sparse comparison. Peak 13.53× vs rocBLAS (dense). 0% sparsity at near-parity (median 0.90×): no downside risk for drop-in adoption.

Model	Cells	Peak vs rocBLAS	Peak vs rocSPARSE	PASS
LLaMA-3.1-405B shapes ★ PEAK	36	13.53×	74.02×	36/36
LLaMA-3.1 8B + 70B shapes	72	11.88×	66.50×	72/72
DeepSeek-V3 671B/37B	54	11.66×	64.95×	54/54
Qwen3-235B-A22B	42	11.45×	69.18×	42/42
Qwen2.5-72B	36	10.46×	69.09×	36/36
Mistral Large 3	54	10.22×	65.44×	54/54
Llama-4 Scout + Maverick	54	10.21×	67.20×	54/54
Kimi K2 1T/32B active	54	9.74×	58.25×	54/54
Microsoft Phi-4 14B	42	9.59×	61.22×	42/42
OpenAI GPT-OSS 120B/20B	42	9.04×	57.33×	42/42
Aggregate: 486 measurements · mean 2.16× vs rocBLAS · mean 32.64× vs rocSPARSE · median 30.93× vs rocSPARSE · 486/486 PASS

AMD Instinct MI300X (192 GB HBM, ROCm 6.2) · FP32 dense baseline · CSR (rocSPARSE) sparse baseline above 70% sparsity · batch=512 · 200 iters per cell · ATOL=0.05 column-normalised · SHA-256 hashes (A, V, baseline, ROLV) + perturbation test on every cell. Sparsity sweep 0/50/70/80/90/95% per layer. 0% near-parity disclosure: peak 1.25×, median 0.90×, min 0.46× on AMD — ROLV at zero sparsity costs at most ~10% on average, gains rapidly as sparsity climbs.

Intel CPU · 4 surfaces · 105 Xeon 4-core + 125 Xeon 2-core + 63 i7 Tier 0 + 196 i7 sweep cases

Peak: 77.38× on Llama-3.1-8B o_proj at 99% sparsity (Google Colab Xeon, 4 cores, rolvprimitive wheel). Same algorithm runs across Intel i7 consumer laptops and Xeon servers — 489/489 PASS across all Intel surfaces. Production Xeon Scalable (Sapphire Rapids+ with AMX) pending separate run.

Intel i7 + Xeon · 489/489 PASS across consumer laptop, server CPU, and comprehensive 196-cell sweep. ROLV is mathematics — the same algorithm runs on every processor with a matrix-multiply unit. CPU and edge results validate the cross-architecture portability claim.

■ Google Colab Intel Xeon @ 2.20 GHz · 4 cores · 3 production models · 105/105 PASS

Model	Cases	Peak speedup	Peak energy	Peak sparsity	PASS
Llama-3.1-8B (Meta)	35	77.38×	+98.7%	99%	35/35
Qwen3-8B (Alibaba)	35	73.22×	+98.6%	99%	35/35
Qwen2.5-7B (Alibaba)	35	64.21×	+98.4%	99%	35/35
CPU all-time peak: Llama-3.1-8B o_proj at 99% sparsity = 77.38× vs CPU-CSR · rolvprimitive wheel · FP32 · batch=32 · 500 iters · ATOL PASS · perturbation PASS every cell

Google Colab Intel Xeon @ 2.20GHz · 4 cores · 54.8GB RAM · FP32 · batch=32 · 500 iters · rolvprimitive wheel · Python 3.11 · 5 sparsity levels (70–99%) · 7 layer types each model

■ Google Colab Intel Xeon @ 2.20 GHz · 2 cores · 5 small models · 125/125 PASS

Model	Cases	Peak speedup	Avg speedup	Peak energy	PASS
Gemma-2-2B (Google) ★	25	28.62×	9.65×	+96.5%	25/25
Qwen2.5-1.5B (Alibaba)	25	27.61×	7.39×	+96.4%	25/25
SmolLM2-1.7B (HuggingFace)	25	27.26×	9.24×	+96.3%	25/25
Llama-3.2-3B (Meta)	25	27.16×	9.38×	+96.3%	25/25
Llama-3.2-1B (Meta)	25	25.97×	8.18×	+96.1%	25/25
★ CPU 2-core record: Gemma-2-2B up_proj at 95% = 28.62× vs CPU-CSR · same model on 4-core Xeon hits higher peaks · Production Xeon Scalable will be stronger (AVX-512, AMX, 32+ cores)

Google Colab Intel Xeon @ 2.20GHz · 2 cores · 13GB RAM · FP32 · batch=32 · 500 iters · rolvprimitive wheel · Python 3.12.13

■ Intel i7-1165G7 (4-core consumer laptop) · Tier 0 sweep · 3 models · 63/63 PASS

Model	Cases	vs BF16 production (mean)	vs MKL sparse (mean)	vs INT8 dynamic (mean)	PASS
SmolLM2-1.7B	28	4.27×	16.83×	1.18×	28/28
Qwen2.5-1.5B	28	3.92×	14.21×	0.94×	28/28
Phi-3.5-mini	7	3.61×	10.45×	0.78×	7/7
Tier 0 i7 total: 63/63 PASS · ROLV is roughly tied with INT8 dynamic on consumer CPU at this matrix scale (apples-to-apples honesty) · dominates against BF16 production and MKL sparse · full Xeon (Sapphire Rapids+ with AMX) benchmarks pending separate run

Intel i7-1165G7 (Tiger Lake, 4 cores / 8 threads, 64 GB RAM) · oneDNN baseline path · sparsity 0/50/70/85/90/95/99% · ROLV path B (INT8 dynamic) chosen by dispatcher in most cells · thermal headroom limited on consumer chassis (cv% spikes 30%+ on some smollm cells, disclosed)

■ Per-model CPU results (Intel i7 production layer shapes)

Model / Layer	CPU	Sparsity	vs MKL (iter)	vs MKL (total+build)	Energy↓	PASS
Mistral-7B q_proj [REAL]	Intel i7	95%	21.45×	18.58×	95%	✓
Qwen3-8B down_proj [REAL] ★	Intel i7	95%	20.86×	17.88×	95%	✓
Gemma4-E4B up_proj [REAL] ★	Intel i7	95%	19.56×	17.29×	95%	✓
Llama-3.1-8B q_proj [REAL] ★	Intel i7	95%	24.44×	22.20×	96%	✓
Qwen2.5-7B gate_proj [REAL] ★	Intel i7	95%	59.70×	—	98%	✓
Llama-3.2-1B down_proj [REAL] ★ PEAK	Intel i7	99%	106.65×	9.07×	99%	✓

■ Comprehensive sweep · 4 models × 7 layers × 7 sparsities · 196/196 PASS

Model (all 7 layers)	CPU	Sparsity	Peak vs MKL	Avg vs MKL	Energy↓	PASS
mistral-7B	Intel i7	0–99%	peak 18.31×	avg 6.67×	98%	49/49
llama-7B	Intel i7	0–99%	peak 25.27×	avg 7.02×	98.8%	49/49
qwen-7B	Intel i7	0–99%	peak 24.71×	avg 7.25×	98.6%	49/49
mixtral	Intel i7	0–99%	peak 20.50×	avg 7.77×	98.6%	49/49
ROLVswitch™ dispatcher selects from the measured candidate set per cell (ROLV path A / B / C / vendor). Where vendor is already optimal, ROLV binds to vendor — never slower than the best available baseline, by construction.

Intel i7 laptop (4 cores, 68GB RAM) · comprehensive sweep tested mistral-7B + llama-7B + qwen-7B + mixtral on real HuggingFace weights at 0/50/70/85/90/95/99% sparsity across q/k/v/o/gate/up/down projections · MKL baseline · batch=512 · 500 iters · ATOL≤0.05 + cosine≥0.999 + perturbation gate every cell · 196/196 sweep + 252/252 i7 (prior) + 125/125 Colab + 63/63 Tier 0 = 636/636 CPU PASS

AMD EPYC 7B13 server CPU · CPU-side of the AMD AI stack

Same ROLV operator that hits 13.53× vs rocBLAS on MI300X also runs natively on EPYC server CPUs — one import line, no model retraining, no infrastructure change. Peak 5.01× vs CPU-CSR at 70% sparsity, 79% energy savings, 22/22 PASS. Cross-stack ROLV-on-AMD story: GPU + CPU verified on the same algorithm.

Synthetic 2000×2000 sparsity sweep across 22 cells. rocBLAS dense baseline at low sparsity, CPU-CSR sparse baseline above 70%. Both published. ATOL PASS and perturbation PASS on every cell. Same hybrid harness as the MI300X portfolio — the only thing that changes is the auto-detected backend (ROCm/HIP → CPU).

■ AMD EPYC 7B13 · full sparsity sweep · 22/22 PASS

Sparsity	Baseline	Vendor ms	ROLV ms	Speedup	Energy↓	PASS
0%	rocBLAS	47.52	47.99	0.99×	−0.2%	✓
5%	rocBLAS	46.78	45.86	1.02×	+3.0%	✓
10%	rocBLAS	46.45	42.68	1.09×	+8.0%	✓
20%	rocBLAS	47.08	37.76	1.25×	+21.3%	✓
30%	rocBLAS	46.34	32.74	1.42×	+30.4%	✓
40%	rocBLAS	48.79	27.69	1.76×	+39.1%	✓
50%	rocBLAS	46.53	23.88	1.95×	+48.2%	✓
60%	rocBLAS	46.79	18.50	2.53×	+60.1%	✓
70%	CPU-CSR	70.01	13.96	5.01× PEAK	+79.7%	✓
75%	CPU-CSR	57.69	11.66	4.95×	+79.8%	✓
80%	CPU-CSR	46.06	9.49	4.85×	+79.5%	✓
85%	CPU-CSR	34.61	7.38	4.69×	+77.7%	✓
90%	CPU-CSR	23.57	5.03	4.69×	+77.6%	✓
92%	CPU-CSR	18.46	4.08	4.53×	+77.2%	✓
94%	CPU-CSR	14.96	3.14	4.76×	+78.3%	✓
95%	CPU-CSR	12.36	2.58	4.79×	+79.3%	✓
96%	CPU-CSR	9.63	2.14	4.50×	+77.8%	✓
97%	CPU-CSR	7.55	1.61	4.68×	+77.0%	✓
98%	CPU-CSR	4.97	1.16	4.30×	+77.5%	✓
99%	CPU-CSR	2.76	0.69	4.01×	+77.7%	✓
99.5%	CPU-CSR	1.48	0.46	3.25×	+66.7%	✓
99.9%	CPU-CSR	0.51	0.28	1.83×	+59.7%	✓
22/22 PASS · ROLV beats vendor at every sparsity from 5% upward, monotonically through the 70% crossover · matrix 2000×2000 · batch=500 · 100 iters per cell · per-iteration ms reported · ATOL≤0.05 + perturbation gate every cell. At 0% (100% dense, the hardest case for any acceleration method): ROLV runs at parity (0.99×) with the highly-tuned dense vendor library — honestly disclosed; the floor is preserved by construction

AMD EPYC 7B13 server CPU · ROLVHybrid auto-calibrated per sparsity · rocBLAS dense baseline below 70%, CPU-CSR sparse baseline at 70% and above · max_abs_err 9.87×10^-7 · A hash: 82371dc0... · V hash: 3107f98a... · SHA-256 hashes for input matrix, input vector, vendor output, ROLV output. 508/508 total AMD PASS = 486 MI300X + 22 EPYC 7B13.

■ Why this matters — cross-stack AMD coverage

Every EPYC-powered datacenter becomes capable of cost-effective sparse LLM inference without GPU expansion. ROLV is one import line in the existing inference stack — no schema change, no model retraining, no infrastructure investment. ROLV running native against rocBLAS on MI300X and CPU paths on EPYC, integrated into ROCm, gives AMD a unified "ROLV-on-AMD" story across the full AI stack. See AMD one-pager for the partnership view.

AMD Ryzen 7 PRO 8845HS · Zen 4 · consumer / workstation CPU · May 2026

Same ROLVswitch™ dispatcher and MoE-native bucketed harness that ran on Intel i7 (Tiger Lake) ran unchanged on AMD Ryzen 7 PRO 8845HS — same algorithm, same code, same correctness gates. 9/9 PASS, peak 13.19× vs vendor_best, 92.4% energy savings, bit-exact correctness (ATOL = 0.000000) on every cell. This validates hardware portability across two distinct CPU architectures with identical algorithms.

MoE-native bucketed harness on three production MoE configurations: Mixtral-8×7B, Qwen3-MoE, and Llama-4-Scout. Three batch sizes each (32 / 64 / 128). MKL_DEBUG_CPU_TYPE=5 workaround applied for fair vendor baseline on AMD (Intel's MKL deliberately degrades dense performance on AMD CPUs). Four SHA-256 hashes per cell. Perturbation gate every cell.

■ AMD Ryzen 7 PRO 8845HS · Zen 4 · 8 cores · 64 GB RAM · 9/9 PASS

Model	Sp%	Batch	vendor_best	Vendor ms	ROLV ms	Speedup	Energy↓	PASS
Mixtral-8×7B	75%	32	cuBLAS dense	287.7	63.1	4.56×	+78.1%	✓
Mixtral-8×7B	75%	64	cuBLAS dense	297.7	70.5	4.22×	+76.3%	✓
Mixtral-8×7B	75%	128	cuBLAS dense	424.6	101.0	4.21×	+76.2%	✓
Qwen3-MoE	93.75%	32	MKL Sparse	98.3	11.0	8.96×	+88.8%	✓
Qwen3-MoE ★ PEAK	93.75%	64	MKL Sparse	150.5	11.4	13.19× PEAK	+92.4%	✓
Qwen3-MoE	93.75%	128	MKL Sparse	261.4	20.5	12.75×	+92.2%	✓
Llama-4-Scout	93.75%	32	MKL Sparse	158.2	22.9	6.92×	+85.5%	✓
Llama-4-Scout	93.75%	64	MKL Sparse	261.0	26.6	9.79×	+89.8%	✓
Llama-4-Scout	93.75%	128	MKL Sparse	472.4	40.0	11.81×	+91.5%	✓
9/9 PASS · Average 8.49× speedup, 85.7% average energy saved · ROLVswitch selected moe_block strategy on every cell · ATOL = 0.000000 (bit-exact) on every cell · Prereq v2.1 conformant harness · full results in benchmark PDF addendum A6

AMD Ryzen 7 PRO 8845HS w/ Radeon 780M · Windows 11 Pro · 8 physical cores / 16 logical · 64 GB RAM · torch 2.12.0 + scipy 1.17.1 · ROLV vs vendor_best (the faster of cuBLAS dense or MKL Sparse, auto-selected per cell) · 1,000 measurement iterations per cell · four SHA-256 hashes + perturbation test every cell. 2,462/2,462 total PASS now spans 5 CPU architectures (Intel i7, Intel Xeon, AMD EPYC, AMD Ryzen Zen 4, ARM Neoverse V2) + 4 GPU/accelerator surfaces.

■ The cross-architecture story — identical algorithm, two CPU vendors

Hardware portability is no longer a claim; it is a measurement. Intel i7 Tiger Lake and AMD Ryzen 7 PRO 8845HS Zen 4 ran the same harness, the same models, the same correctness gates — with the only difference being which vendor baseline became the chosen vendor_best. On AMD, MKL Sparse wins as vendor_best more often (Intel's MKL deliberately degrades dense performance on AMD CPUs). On Intel, MKL Dense wins more often. ROLV beats whichever vendor option wins. The no-regression contract holds across hardware vendor stacks. This is the only operator surface measured here that runs on configurations exceeding the INT_MAX limit of CSR sparse formats — vendor sparse libraries cannot execute these matrices at all. ROLV does.

Google Cloud Axion · ARM Neoverse V2 · first ROLV result on ARM architecture

Same ROLV operator, zero code changes — ARM just works. Peak 5.12× vs CPU-CSR at 70% sparsity, 81% energy savings, 22/22 PASS. ARM is the third architecture (after x86 and AMD64) where ROLV is verified, validating the cross-platform mathematics claim. Same algorithm runs on every processor with a matrix-multiply unit.

Google Cloud C4A instance · aarch64 · Axion (Neoverse V2) · 22 sparsity levels. OpenBLAS dense baseline at low sparsity, CPU-CSR sparse baseline above 70%. ROLV is mathematics — not architecture-specific code — so the same algorithm validated on x86 (Intel, AMD) runs natively on ARM with no porting work.

■ Google Axion ARM (Neoverse V2) · sparsity sweep · 22/22 PASS

Sparsity	Baseline	Vendor ms	ROLV ms	Speedup	Energy↓	PASS
0%	OpenBLAS	198.3	198.2	1.00×	ref	✓
30%	OpenBLAS	198.2	140.5	1.41×	+29%	✓
50%	OpenBLAS	198.1	102.1	1.94×	+48%	✓
60%	OpenBLAS	199.9	82.0	2.44×	+59%	✓
70%	CPU-CSR	317.8	62.0	5.12× PEAK	+81%	✓
80%	CPU-CSR	210.9	41.8	5.04×	+81%	✓
90%	CPU-CSR	106.4	21.5	4.95×	+80%	✓
95%	CPU-CSR	53.7	11.9	4.50×	+78%	✓
99%	CPU-CSR	11.3	3.7	3.01×	+66%	✓
22 sparsity levels swept (0–99.9%) · matrix 3000×3000 · batch=1000 · iters=1000 · ATOL PASS every cell · perturbation PASS every cell · First ROLV result on ARM architecture

Google Cloud Axion (C4A instance) · ARM Neoverse V2 · aarch64 · OpenBLAS dense baseline · CPU-CSR sparse baseline ≥70% · A hash: 82371dc0... · V hash: 2f47fc31... · SHA-256 hashes for input matrix, input vector, vendor output, ROLV output every cell.

■ Why this matters — cross-architecture portability

ARM is the dominant architecture in mobile (every iPhone, every Android), edge compute, and increasingly datacenter (AWS Graviton, Google Axion, NVIDIA Grace, Apple Silicon Mac/iPad/iPhone). The same ROLV operator that delivers 42× on NVIDIA H200 and 74× on AMD MI300X also delivers 5× on a generic ARM cloud server with zero porting work. This is what "ROLV is mathematics, not architecture-specific code" means in measured numbers.

Watertight Benchmark Format

Every claim, every cell, every baseline. Disclosed.

Each ROLV benchmark cell prints a 12-step audit-grade record. Inputs are SHA-256 hashed before computation. Outputs are SHA-256 hashed after. Multiple independent baselines are timed in the same session on the same hardware against the same inputs. Variance is disclosed (cv%). All accuracy gates are checked and reported. Nothing is summarised away. There is no cherry-picked baseline.

[1] IDENTIFICATION

Model name, weight source (REAL HuggingFace vs synthetic), exact layer name (e.g. model.layers.0.mlp.gate_proj), matrix dimensions M×K, batch size, target hardware. Reviewer can reproduce the exact cell.

[2] SPARSITY DETECTION

Natural sparsity %, active rows / total rows, active cols / total cols, FLOPs reduction %, RSMT™ threshold, dispatcher’s vendor selection, ROLVswitch™ path chosen. The reduction in compute is measured, not estimated.

[3] INPUT HASHES

SHA-256 of weight matrix A and SHA-256 of input vector V before any computation begins. Reviewer can verify these hashes match the public HuggingFace weights they pulled themselves. Forecloses any “you used different inputs” objection.

[4] OUTPUT HASHES

SHA-256 of vendor reference output Y_baseline and SHA-256 of ROLV output Y_ROLV. When path A wins, hashes are bit-identical. When path B wins, hashes differ but ATOL gate confirms numerical equivalence within tolerance.

[5] MULTI-BASELINE TIMING

Side-by-side ms/iter, p50, p99, stdev (cv%), GFLOPs/s, tokens/s for cuBLAS-FP32, cuBLAS-FP8, cuSPARSE, and ROLV in the same run. No baseline is hidden. cv% is published — if a measurement was noisy, it shows.

[6] CORRECTNESS GATES

max_abs_err, mean_abs_err, max_rel_err%, mean_rel_err%, ATOL gate (≤0.05 on column-normalised fp64), cosine gate (≥0.999), and perturbation gate. PASS / FAIL printed per gate. Cell is FAIL if any gate fails.

[7] APPLES-TO-APPLES SPEEDUP

ROLV vs the same-precision vendor baseline (e.g. ROLV path B INT8 vs INT8 cuBLASLt). When the apples-to-apples baseline is faster than ROLV, that is published unflinchingly — e.g. INT8 cuBLASLt 0.84× at sp=0% on H200. No flattering of comparisons.

[8] PRODUCTION DEPLOYMENT BASELINES

ROLV vs the modern stacks actually used in production: TensorRT-LLM INT8, FP8 cuBLAS on Hopper/Blackwell, structured 2:4 sparse, INT8 cuBLASLt, FP16, BF16. Six baselines, all timed in the same session, all reported.

[9] FP32 PRECISION REFERENCE

Speedup vs cuBLAS-FP32 separately reported, both as iter speedup and total speedup including build cost amortised over a single inference. Energy saved %, FLOPs saved %.

[10] cuSPARSE SPARSE VENDOR REFERENCE

Speedup vs cuSPARSE separately reported. cuSPARSE is the NVIDIA-tuned sparse vendor library — the fairest direct sparse-vs-sparse comparison. Energy saved % vs cuSPARSE published.

[11] MEMORY vs COMPUTE EFFICIENCY

Weight bytes (dense vs ROLV), memory reduction %, bytes/FLOP ratio, compute density ratio. Addresses the “memory-bound LLM” reviewer critique — ROLV reduces both bytes moved and FLOPs performed in the same proportion. Speedup is from doing strictly less work, not from clever data layout.

[12] FUSED-KERNEL DISCLOSURE

Lists fused-kernel paths available on the box (FlashAttention, Triton, FP8 hardware, etc.). Reviewer knows which vendor optimisations were enabled in the comparison. No silent disabling of vendor fast paths.

No breadcrumbs, no asterisks, no “up to” framing. Every cell is a complete record. Every claim on this site traces back to a JSON file with the 12 fields above. Reviewer reproduces the exact case on their hardware, computes the four hashes themselves, and compares to the published values. If the hashes match and the gates pass, the speedup number is what it is. The benchmark format leaves no room for selective reporting.

Calculators

Measure. Switch. Save.

Quantify ROLV's impact on your infrastructure. The two primary calculators below cover capital and operating expense; below them, three advanced tools for deeper analysis.

▲ Capex Savings Calculator

Processor type

Number of units

Price per unit ($)

Model sparsity

Current capex

$3.0B

Units saved

80,000

Capex saved

$2.4B

Speedup from published ROLV benchmarks on real model weights.

▲ Opex Savings — Energy Calculator

Number of GPUs

GPU power (W)

$/kWh

Utilisation %

PUE

Model sparsity %

Total cost/yr

$76.5M

Saved/yr

$35M

3-year saving

$105M

CO² avoided/yr

117,000 t

Energy savings based on ROLV benchmark results at stated sparsity level.

Advanced tools

      △ ROLV Unit™ — Measure True Compute Efficiency
      ▶
    

The ROLV Unit™ is a normalised measure of compute efficiency. Unlike TFLOPS (which measures peak theoretical throughput) or tokens/s (which conflates hardware and software), the ROLV Unit measures useful compute — work that contributes to the model output.

1 ROLV Unit = 1 TFLOP of useful compute per second, at full precision, verified by SHA-256 hash.

Your TFLOPS (vendor spec)

Model sparsity %

Number of GPUs

Your Compute in ROLV Units

Without ROLV

562 RU

wasted on zero rows

With ROLV

2,250 RU

all compute is useful

Cluster efficiency gain

4.0× more useful compute — same hardware

ROLV Unit = TFLOPS of useful, hash-verified compute. Vendor TFLOPS counts all hardware operations regardless of contribution to the output.

      ▶ ROLVswitch™ & VRAM — Crossover & Memory Calculator
      ▶
    

ROLVswitch™ finds the exact sparsity where ROLV beats dense, and whether your matrix fits in VRAM.

Matrix rows (M)

Matrix cols (K)

VRAM limit (GB)

Dtype

Index bytes

Sparsity %

ROLVswitch Analysis

Switch to ROLV above

—

VRAM analysis

—

At your sparsity

—

      ■ RSMT™ — Sparse Storage Threshold Calculator
      ▶
    

RSMT™ finds the exact sparsity threshold where sparse storage beats dense for your dtype.

Value type (bytes)

Index type (bytes)

Actual sparsity %

Loading...

Why RSMT™ Matters

The crossover point depends entirely on your dtype. With bfloat16 (2 bytes) and int32 indices (4 bytes), sparse format costs 3× more bytes per stored value than dense. The crossover only favours sparse when the storage savings overcome the index overhead.

Your MoE models at bfloat16

Mixtral-8×7B: 75% ✓ well above crossover
Qwen3-30B-A3B: 93.8% ✓ far above crossover
Llama-4-Scout: 93.8% ✓ far above crossover
DeepSeek-V3: 96.9% ✓ extreme advantage

RSMT™ is computed analytically — no approximation.

Enterprise & Institutional Evaluation

Evaluate on your own hardware.
NDA-gated. Hardware-locked. Signed every run.

Two deployment tiers for serious evaluation on your own models, your own data, your own processors. If you just want to see ROLV working end-to-end first, the live benchmark above runs in under two minutes with no install. All enterprise runs are RolvKey™-signed — SHA-256 over your speedup, processor fingerprint, and a time-bounded attestation.

Recommended

Secure Container

RolvKey™ authenticated.
Hardware-locked Docker.

Evaluation licence + NDA. Container binds to your processor fingerprint at first run — will not execute on any other machine. Optional Intel SGX hardware encryption for regulated environments.

Contact rolv@rolv.ai →

Direct Hardware

No Docker.
Single authenticated file.

Bare-metal servers and air-gapped environments where Docker is not permitted. Processor-bound binary with live heartbeat attestation. Evaluation licence + NDA required.

Contact rolv@rolv.ai →

RolvKey™ — New IP — Patent Pending

A second invention, born from protecting the first.

In building the secure distribution system for ROLV Primitive© we developed a novel software protection architecture that we believe has standalone commercial value entirely apart from ROLV itself.

RolvKey™ uses a proprietary multi-layer mathematical key derivation system. Every key exchange is unique and time-bounded to a window of seconds. A captured response is worthless moments later. An attacker who somehow breaks the first layer immediately faces a second independent layer, then a third — each seeded with a completely different secret.

The only viable attack requires simultaneously compromising multiple independent systems within a narrow time window. For any commercial adversary this is not a realistic threat model.

Market opportunity

Every software company shipping proprietary compiled code faces the same distribution security problem. Current solutions — hardware dongles, standard license servers, code obfuscation — have well-documented weaknesses. The academic literature identified this specific application — software distribution key management and API attestation — as commercially unsolved. RolvKey™ addresses it.

Live right now

RolvKey™ is protecting ROLV Primitive© today. Every Docker container download, every key exchange, every benchmark run on every machine worldwide is secured by this system. It has been exercised thousands of times in production.

Licensing and partnership enquiries: rolv@rolv.ai

Independent Verification

Every result is independently verifiable.

4 SHA-256 hashes per case. Perturbation test on every result. ATOL=0.05 + cosine≥0.999 on column-normalised fp64. 1,684/1,684 GPU PASS · 573/573 CPU PASS (incl. comprehensive 4-model × 7-layer × 7-sparsity sweep, 196/196). Download the full validation kit with harness code, raw outputs, and reproduction instructions.

↓ Full Benchmark PDF

About the Founder

One bike ride. Six months. A primitive that beats NVIDIA's own libraries.

R

Rolv Eitrem Heggenhougen — Norwegian-born entrepreneur, mathematics graduate, serial founder with companies built across Europe and the United States. In May 2025, on a bike ride in Fort Lauderdale, he saw that AI matrix operations were doing enormous amounts of unnecessary work. He could see it mathematically. He refused to stop until he had proven it.

"Imagination is the only limitation to innovation."

Contact

Contact Us

rolv@rolv.ai

Patent Pending ·
ROLV LLC · Fort Lauderdale, FL

Get in touch

How ROLV Primitive© Works

The core insight

Three-phase operation

Why it beats cuSPARSE

Correctness guarantee

Real model weights · SHA-256 verified · Perturbation PASS every case

Dense GPU — Pruned weights

Any hardware. Any OS. Any framework.

■ NVIDIA GPUs

■ AMD GPUs

■ CPU

■ Google TPU

■ Custom ASICs & FPGAs

■ Emerging & Future Platforms

About the Founder

The Bike Ride That Changed Everything

A Quiet Talent for Math — Even If He Pretended Not to Care

Entrepreneurship Across Borders

Six Months From Insight to Validation

A Lifetime of Unconventional Ideas

What Drives Him Now

Llama-3.1-8B + Mistral-7B-Instruct · complete sweep

■ Llama-3.1-8B · gate_proj 14336×4096 · 7 sparsity levels

■ Mistral-7B-Instruct · gate_proj 14336×4096 · 7 sparsity levels

■ All-layer peaks · both models · 99% sparsity

Same inference. 95% less energy. Mathematical AI inference acceleration. 5–42× vs vendor libraries. Bit-equivalent output. SHA-256 verified.

The fastest path to half the electricity bill.

Works on every platform. Today and tomorrow.

Your device vs ROLV. Side by side. Right now.

ROLV Makes AI Available to Anyone,Anywhere with a PC.

Energy savings, measured at the GPU.

Measured 67–93% GPU energy reduction via nvidia-smi hardware power telemetry — not derived, not modelled.

Why this matters now

Full results. Every claim verified.

■ H200 NVL · Llama-3.1-8B + Mistral-7B-Instruct · 112/112 PASS

■ e2e harness · production-model coverage · 280/280 PASS

■ H200 NVL · Tier 0 sweep summary · 5 models · 952/952 PASS

■ H200 NVL · Production-model E2E harness · 5 models · 280/280 PASS

■ B200 · MoE real models at natural routing sparsity

■ vs cuSPARSE (NVIDIA sparse vendor reference)

■ AMD MI300X · 10-model portfolio · 486/486 PASS

■ Google Colab Intel Xeon @ 2.20 GHz · 4 cores · 3 production models · 105/105 PASS

■ Google Colab Intel Xeon @ 2.20 GHz · 2 cores · 5 small models · 125/125 PASS

■ Intel i7-1165G7 (4-core consumer laptop) · Tier 0 sweep · 3 models · 63/63 PASS

■ Per-model CPU results (Intel i7 production layer shapes)

■ Comprehensive sweep · 4 models × 7 layers × 7 sparsities · 196/196 PASS

■ AMD EPYC 7B13 · full sparsity sweep · 22/22 PASS

■ Why this matters — cross-stack AMD coverage

■ AMD Ryzen 7 PRO 8845HS · Zen 4 · 8 cores · 64 GB RAM · 9/9 PASS

■ The cross-architecture story — identical algorithm, two CPU vendors

■ Google Axion ARM (Neoverse V2) · sparsity sweep · 22/22 PASS

■ Why this matters — cross-architecture portability

Every claim, every cell, every baseline. Disclosed.

Measure. Switch. Save.

Evaluate on your own hardware.NDA-gated. Hardware-locked. Signed every run.

RolvKey™ authenticated.Hardware-locked Docker.

No Docker.Single authenticated file.

A second invention, born from protecting the first.

Every result is independently verifiable.

One bike ride. Six months. A primitive that beats NVIDIA's own libraries.

Contact Us

Same inference. 95% less energy.
Mathematical AI inference acceleration. 5–42× vs vendor libraries. Bit-equivalent output. SHA-256 verified.

ROLV Makes AI Available to Anyone,
Anywhere with a PC.

Evaluate on your own hardware.
NDA-gated. Hardware-locked. Signed every run.

RolvKey™ authenticated.
Hardware-locked Docker.

No Docker.
Single authenticated file.