One primitive — no model changes — GPU · CPU · any platform

AI inference up to 106× faster
and 99% less energy.
Same hardware. Same model. One import line.

Get your free token
Already have a token?
GPU · CPU · Windows · Mac · Linux · ~15 min
Speedup vs MKL
3× → 106×
avg → peak · CPU
Energy saved
75% → 99%
typical → peak · pynvml
Verified
377/377
tests pass · 9 models
Business case ↓ How it works ↓ Benchmarks ↓ Calculators ↓ Enterprise ↓
What you will see when you run it

Real results. Real weights. Your hardware.

intel
CPU
106×
vs MKL
99%
energy saved
nvidia
GPU
74×
vs cuSPARSE
4.75× vs cuBLAS
90%
energy saved
amd
GPU
8.5×
vs rocBLAS
MI300X
88%
energy saved
amd
CPU
21×
vs OpenBLAS
EPYC
95%
energy saved
google
ARM CPU
28×
vs MKL
Colab Xeon
96%
energy saved
apple
ARM CPU
27×
vs MKL
M1–M4 · aarch64
96%
energy saved
Real HuggingFace weights · no synthetic data · results signed to your processor · DOI 10.5281/zenodo.19221455
Infrastructure cost

What does 3× → 106× faster mean for your bill?

Every GPU you run today computes 70–99% unnecessary rows on MoE inference. ROLV eliminates them. Fewer GPUs. Lower energy bills. Same output.

At 3× — conservative
100,000 GPUs →
33,333
GPUs
66,667 freed
$10B+
capex avoided
owned · ~$150k/GPU H100 purchase
$130M/yr
cloud rental avoided
66,667 × $2/hr × 8,760 hrs
At 32× — measured avg
100,000 GPUs →
3,125
GPUs
96,875 freed
$15B+
capex avoided
owned · ~$150k/GPU H100 purchase
$1.4B/yr
cloud rental avoided
96,875 × $2/hr × 8,760 hrs
Energy — 75% → 99% saved
100k GPUs × 700W × 8,760 hrs
= 70 MW → 700 kW at 99% reduction
$48M/yr
electricity saved
owned · $0.08/kWh bulk rate
$350M/yr
energy component in cloud bills
100k × $0.40/hr energy × 8,760 hrs
pynvml verified · H200 · no new hardware

No sign-up · runs in your browser · your own numbers

Technical Foundation

One operator. Exact output.
Proportionally fewer multiplications.

ROLV Primitive© is a drop-in replacement for cuBLAS and cuSPARSE that exploits the natural zero structure of AI weight matrices. No approximation. No accuracy cost. Deterministic on every platform.

MoE Natural Sparsity — Real Model Results

Mixtral. Qwen3. Llama-4. DeepSeek. Jamba. All PASS.
Real weights. Zero pruning. Independently verified.

MoE routers zero out 75–99% of expert weights per token — architecturally, exactly. cuBLAS computes them all. ROLV doesn’t. The speedup is proportional and provable.

43×
vs cuSPARSE · 2.49× cuBLAS
OLMoE-1B-7B
87.5% natural sp · REAL
40×
vs cuSPARSE · 2.94× cuBLAS
DeepSeek-V2-Lite
90.6% natural sp · REAL
74×
vs cuSPARSE · 3.38× cuBLAS
Phi-3.5-MoE
87.5% natural sp · REAL
35×
vs cuSPARSE · 3.37× cuBLAS
Qwen1.5-MoE-A2.7B
93.3% natural sp · REAL
Universal Compatibility

Works on every platform. Today and tomorrow.

NVIDIA · AMD · Intel · ARM · Apple · Google TPU · Custom ASICs · FPGAs · Photonic · Quantum · Any hardware that does matrix multiply.

A Story About Waste at Scale

ROLV Makes AI Available to Anyone,
Anywhere with a PC.

Picture a container ship crossing the Pacific. It carries 20,000 containers. The manifest says 5,000 of them are empty — have always been empty, will be empty on arrival. But the ship cannot leave them behind. Its loading system was built decades ago and it can only operate one way: load everything, sail everything, unload everything.

It burns fuel proportional to its total cargo — including the 5,000 empty containers. The crew works proportional to total cargo. The port fees are proportional to total cargo. Every crossing. Every time.

This is what cuBLAS does with MoE inference. The empty containers are the inactive experts — architecturally zero, guaranteed by the router, known before the computation starts. cuBLAS has no mechanism to leave them on the dock. It computes all of them, every token, every layer, every inference call.

ROLV Primitive© is the loading system that reads the manifest first. It identifies the empty containers before departure. It sails only what carries cargo. Same destination. Same output. A fraction of the fuel.

The numbers behind the analogy
DeepSeek-V3 — 256 experts, top-8 active
248
empty containers per token
96.9% of all compute wasted by cuBLAS
ROLV Primitive© computes only
8
active experts — exactly
8.76× faster · 110× vs cuSPARSE · PASS
Mixtral-8×7B — 8 experts, top-2 active
6
empty containers per token
75% of all compute wasted by cuBLAS
ROLV computes only
2
active experts — exactly
1.86× faster · 109× vs cuSPARSE · PASS

Every frontier model crossing the Pacific today carries empty containers. ROLV leaves them on the dock.

Benchmarks — Real Weights · SHA-256 Verified · 1,000 iters

Full results. Every claim verified.

ModelSrcNat sp% vs cuBLASvs cuSPARSE Energy%Tokens/sPASS
Mixtral-8×7BREAL75.0%1.86×109×46%2,185,075
Mixtral-8×22Bsynth75.0%2.43×107×59%1,073,568
Qwen2-57B-A14Bsynth87.5%3.37×70×70%2,374,040
Qwen3-30B-A3BREAL93.8%3.43×32×71%6,650,774
Llama-4-Scout ★REAL93.8%4.75×103×79%5,795,875
DeepSeek-V3/R1synth96.9%8.76×110×89%1,758,046

NVIDIA B200 · BF16 · TF32 ON · 1,000 iters · ATOL=0.05 col-norm fp64 · 4 SHA-256 hashes + perturbation PASS · †exact production dims

Model / LayerGPUSparsity vs cuBLASvs vendor sparsePASS
LLaMA-3.1-8B up_proj [REAL]H20080%2.17×9.53×
LLaMA-3.1-8B up_proj [REAL]H20090%2.79×8.66×
DeepSeek-R1 embed [REAL]B20095%19.42×19.42×
10k×10k syntheticB20070%3.11×12.06×
10k×10k syntheticMI300X85%8.5×83.77×
Tesla T4 syntheticT490%5.8×14.2×

1,684/1,684 total PASS across all GPU benchmarks · BF16 · TF32 ON · ATOL=0.05 · AMD MI300X: rocBLAS 8.5× (rocSPARSE has known regression at high sparsity)

Model / LayerCPUSparsity vs MKL (iter)vs MKL (total+build) Energy↓PASSPert
Mistral-7B q_proj [REAL]Intel i795%21.45×18.58×95%
Mistral-7B up_proj [REAL]Intel i795%17.98×15.73×94%
Mistral-7B down_proj [REAL]Intel i795%18.86×16.32×95%
Mistral-7B v_proj [REAL]Intel i795%20.12×18.32×95%
Mistral-7B gate_proj [REAL]Intel i795%15.70×13.90×94%
Mistral-7B k_proj [REAL]Intel i795%17.02×15.57×94%
Mistral-7B o_proj [REAL]Intel i795%14.24×12.59×93%
Mistral-7B avg · 7 layer types · 70–95% sparsity · 28/28 PASS
Mistral-7B avg all layers [REAL]Intel i770–95%8.49×83%28/2828/28
Qwen3-8B — peak results at 95% sparsity
Qwen3-8B down_proj [REAL] ★Intel i795%20.86×17.88×95%
Qwen3-8B q_proj [REAL]Intel i795%19.38×16.61×95%
Qwen3-8B gate_proj [REAL]Intel i795%18.05×15.14×95%
Qwen3-8B avg · 7 layer types · 70–95% sparsity · 28/28 PASS
Qwen3-8B avg all layers [REAL]Intel i770–95%8.59×84%28/2828/28
Gemma4-E4B (Google) — peak results at 95% sparsity
Gemma4-E4B up_proj [REAL] ★Intel i795%19.56×17.29×95%
Gemma4-E4B o_proj [REAL]Intel i795%17.58×15.98×94%
Gemma4-E4B gate_proj [REAL]Intel i795%16.07×14.65×94%
Gemma4-E4B avg · 7 layer types · 70–95% sparsity · 28/28 PASS
Gemma4-E4B avg all layers [REAL]Intel i770–95%7.20×81%28/2828/28
Phi-4 (Microsoft) — 2 layer types · 8/8 PASS
Phi-4 down_proj [REAL] ★Intel i795%14.82×13.23×93%
Phi-4 o_proj [REAL]Intel i795%13.32×11.62×93%
DeepSeek-R1-7B — peak results at 95% sparsity · 28/28 PASS
DeepSeek-R1-7B down_proj [REAL] ★Intel i795%17.22×15.22×94%
DeepSeek-R1-7B gate_proj [REAL]Intel i790%13.48×11.82×93%
DeepSeek-R1-7B avg all layers [REAL]Intel i770–95%7.0×85%28/2828/28
Qwen2.5-7B (Alibaba) — peak results at 95% sparsity · 28/28 PASS
Qwen2.5-7B down_proj [REAL] ★Intel i795%17.40×15.54×94%
Qwen2.5-7B q_proj [REAL]Intel i795%16.44×14.74×94%
Qwen2.5-7B avg all layers [REAL]Intel i770–95%7.0×83%28/2828/28
Llama-3.2-3B + Llama-3.1-8B (Meta) · 56/56 PASS
Llama-3.2-3B down_proj [REAL] ★Intel i795%18.07×15.60×94%
Llama-3.1-8B down_proj [REAL]Intel i795%15.08×13.25×93%
Llama-3.2-3B avg / Llama-3.1-8B avg [REAL]Intel i770–95%7.4× / 7.5×83%56/5656/56
Gemma-2-2B (Google) — peak results at 95% sparsity · 28/28 PASS
Gemma-2-2B down_proj [REAL] ★Intel i795%18.71×16.81×95%
Gemma-2-2B gate_proj [REAL]Intel i795%15.95×14.35×94%
Gemma-2-2B avg all layers [REAL]Intel i770–95%7.0×85%28/2828/28
TOTAL CPU: 9 models · 332/332 PASS · Avg 7.37× · Peak 24.27×
Phi-4 (Microsoft) — peak at 95%
Phi-4 o_proj [REAL] ★Intel i795%19.44×17.04×95%
Phi-4 down_proj [REAL]Intel i795%16.12×13.93×94%
DeepSeek-R1-Distill-7B — peak at 95%
DeepSeek-R1-7B q_proj [REAL] ★Intel i795%21.11×18.68×95%
DeepSeek-R1-7B down_proj [REAL]Intel i795%20.41×17.65×95%
Qwen2.5-7B — peak at 95%
Qwen2.5-7B gate_proj [REAL] ★Intel i795%59.70×98%
Qwen2.5-7B down_proj [REAL]Intel i795%21.26×95%
Llama-3.2-3B (Meta) — peak at 95%
Llama-3.2-3B up_proj [REAL] ★Intel i795%19.83×95%
Llama-3.2-3B gate_proj [REAL]Intel i795%17.23×94%
Llama-3.1-8B (Meta) — peak at 95%
Llama-3.1-8B q_proj [REAL] ★Intel i795%24.44×22.20×96%
Llama-3.1-8B down_proj [REAL]Intel i795%19.29×18.12×95%
Llama-3.1-8B avg · 7 layer types · 70–95% sparsity · 28/28 PASS
Llama-3.1-8B avg all layers [REAL]Intel i770–95%8.26×84%28/2828/28
Gemma-2-2B (Google) — peak at 95%
Gemma-2-2B up_proj [REAL] ★Intel i795%20.07×20.31×95%
Gemma-2-2B down_proj [REAL]Intel i795%18.67×16.95×95%
Gemma-2-2B avg · 7 layer types · 70–95% sparsity · 28/28 PASS
Gemma-2-2B avg all layers [REAL]Intel i770–95%8.48×83%28/2828/28
▶ Google Colab Intel Xeon @ 2.20GHz · 2 cores · 13GB RAM · FP32 · rolvprimitive wheel — 125/125 PASS · 5 sparsity levels (70–99%)
SmolLM2-1.7B (HuggingFace) ★ — peak 27.26×
SmolLM2-1.7B gate_proj [REAL] ★Xeon Colab95%27.26×96%
SmolLM2-1.7B up_proj [REAL]Xeon Colab95%24.29×96%
SmolLM2-1.7B avg all layers [REAL]Xeon Colab70–95%8.67×79%20/2020/20
Qwen2.5-1.5B (Alibaba) — peak 27.61×
Qwen2.5-1.5B up_proj [REAL] ★Xeon Colab95%27.61×96%
Qwen2.5-1.5B gate_proj [REAL]Xeon Colab95%17.04×94%
Qwen2.5-1.5B avg all layers [REAL]Xeon Colab70–95%6.70×76%20/2020/20
Llama-3.2-1B (Meta) — peak 25.97×
Llama-3.2-1B up_proj [REAL] ★Xeon Colab95%25.97×95%
Llama-3.2-1B avg all layers [REAL]Xeon Colab70–95%7.15×78%20/2020/20
Gemma-2-2B on Colab Xeon — peak 28.62× (confirms i7 results on different CPU)
Gemma-2-2B gate_proj [REAL] ★Xeon Colab95%28.62×95%
Gemma-2-2B avg all layers [REAL]Xeon Colab70–95%7.09×78%20/2020/20
Llama-3.2-3B on Colab Xeon — peak 27.16×
Llama-3.2-3B up_proj [REAL] ★Xeon Colab95%27.16×96%
Llama-3.2-3B gate_proj [REAL]Xeon Colab95%16.99×94%
Llama-3.2-3B avg all layers [REAL]Xeon Colab70–95%8.09×81%20/2020/20
i7 combined total: 252/252 PASS · Meta · Alibaba · Google · Microsoft · DeepSeek · same Intel i7 laptop
AMD EPYC 7B13 syntheticEPYC90%8.5×89%
★ NEW — Prereq-compliant harness — SGX sim — 45/45 PASS — 4 SHA-256 hashes + perturbation PASS every case — scipy CSR baseline included SmolLM2-1.7B (HuggingFace) — gate/up/down_proj — 70–99% induced row sparsity SmolLM2-1.7B gate_proj [REAL]Intel i795%45.12×18.34×98%✓✓ SmolLM2-1.7B up_proj [REAL] ★Intel i799%98.71×8.65×99%✓✓ SmolLM2-1.7B down_proj [REAL]Intel i790%20.06×17.56×95%✓✓ Qwen2.5-1.5B (Alibaba) — gate/up/down_proj — 70–99% induced row sparsity Qwen2.5-1.5B gate_proj [REAL]Intel i795%43.71×16.84×98%✓✓ Qwen2.5-1.5B up_proj [REAL]Intel i795%40.18×17.08×98%✓✓ Qwen2.5-1.5B down_proj [REAL] ★Intel i799%101.32×9.16×99%✓✓ Llama-3.2-1B (Meta) — gate/up/down_proj — 70–99% induced row sparsity Llama-3.2-1B gate_proj [REAL]Intel i795%44.85×16.74×98%✓✓ Llama-3.2-1B up_proj [REAL] ★Intel i799%106.34×9.85×99%✓✓ Llama-3.2-1B down_proj [REAL] ★ PEAKIntel i799%106.65×9.07×99%✓✓ NEW TOTAL: 45/45 PASS · Peak 106.65× · Avg 32.03× · All errors at machine epsilon · 4 SHA-256 hashes + perturbation PASS · scipy CSR second baseline

Intel i7 laptop (4 cores, 68GB RAM) · Mistral-7B + Qwen3-8B + Gemma4-E4B + Phi-4 + DeepSeek-R1-7B + Qwen2.5-7B + Llama-3.2-3B + Llama-3.1-8B + Gemma-2-2B real HuggingFace weights · MKL baseline · Speedup includes ROLV build time · 252/252 PASS (i7) + 125/125 PASS (Colab Xeon wheel, 5-level) = 377/377 total · 377/377 perturbation PASS · 1,000 iters · ATOL=0.05

HardwareMatrixSparsity cuSPARSE msROLV ms ROLV winsPASS
NVIDIA H200LLaMA up_proj80%5.900.6199.53×
NVIDIA H200LLaMA up_proj90%3.010.3488.66×
NVIDIA B200Mixtral-8×7B MoE75%25.650.234109×
NVIDIA B200Llama-4-Scout MoE94%9.140.088103×
NVIDIA B20010k×10k synthetic70%4.310.3612.06×
AMD MI300X10k×10k synthetic85%74.270.8983.77×
Intel i7 CPUMistral-7B q_proj95%66.43.1814.01×

cuSPARSE is NVIDIA’s own sparse library — tuned by hundreds of engineers. ROLV beats it everywhere because dense matmul on a small submatrix outperforms CSR index lookups for LLM weight patterns. AMD MI300X uses rocSPARSE which has a known performance regression at high sparsity — rocBLAS 8.5× comparison also published.

Calculators

Measure. Switch. Save.

Three tools to quantify ROLV’s impact on your infrastructure.

△ ROLV Unit™ — Measure True Compute Efficiency

The ROLV Unit™ is a normalised measure of compute efficiency that accounts for sparsity. Unlike TFLOPS (which measures peak theoretical throughput) or tokens/s (which conflates hardware and software), the ROLV Unit measures useful compute — work done on non-zero elements only.

1 ROLV Unit = 1 TFLOP of compute on live (non-zero) matrix elements per second, at full precision, verified by SHA-256 hash.

Your Compute in ROLV Units
Without ROLV
562 RU
wasted on zero rows
With ROLV
2,250 RU
all compute is useful
Cluster efficiency gain
4.0× more useful compute — same hardware
ROLV Unit = TFLOPS on verified non-zero elements. Vendor TFLOPS counts all compute including zero rows.
▶ ROLVswitch™ & VRAM — Crossover & Memory Calculator

ROLVswitch™ finds the exact sparsity where ROLV beats dense, and whether your matrix fits in VRAM. Enter your matrix dimensions and hardware to get the switch point and memory analysis.

ROLVswitch Analysis
Switch to ROLV above
VRAM analysis
At your sparsity ({{ sp }}%)
Crossover is dtype + index dependent. VRAM analysis uses dense size = M×K×dtype bytes.
■ RSMT™ — Sparse Storage Threshold Calculator

RSMT™ finds the exact sparsity threshold where sparse storage beats dense for your dtype. Below the threshold, dense storage wins on memory. Above it, sparse wins — and ROLV wins on compute too.

Loading...
Why RSMT™ Matters

The crossover point depends entirely on your dtype. With bfloat16 (2 bytes) and int32 indices (4 bytes), sparse format costs 3× more bytes per non-zero than dense. Sparse wins only when you have enough zeros to overcome the index overhead.

Your MoE models at bfloat16
Mixtral-8×7B: 75%  ✓ well above crossover
Qwen3-30B-A3B: 93.8%  ✓ far above crossover
Llama-4-Scout: 93.8%  ✓ far above crossover
DeepSeek-V3: 96.9%  ✓ extreme advantage

RSMT™ is computed analytically — no approximation. The crossover is mathematically exact for any dtype combination.

Enterprise & Institutional Evaluation

Three ways to evaluate.
From open demo to hardware-locked secure deployment.

All enterprise results are RolvKey™-signed — SHA-256 over your speedup, your processor fingerprint, and a time-bounded attestation.

Open Demo

Free token.
One click.

Docker container or standalone Python file. Any hardware. No commitment.

Get free token ↑
Recommended
Secure Container

RolvKey™ authenticated.
Hardware-locked.

Evaluation licence + NDA. Processor fingerprint binding. Optional Intel SGX hardware encryption.

Contact rolv@rolv.ai →
Direct Hardware

No Docker.
Single authenticated file.

Bare-metal servers, air-gapped environments. PyArmor-obfuscated, processor-bound, heartbeat-protected.

Contact rolv@rolv.ai →
RolvKey™ — New IP — Patent Pending

A second invention, born from protecting the first.

In building the secure distribution system for ROLV Primitive© we developed a novel software protection architecture that we believe has standalone commercial value entirely apart from ROLV itself.

RolvKey™ uses a proprietary multi-layer mathematical key derivation system. Every key exchange is unique and time-bounded to a window of seconds. A captured response is worthless moments later. An attacker who somehow breaks the first layer immediately faces a second independent layer, then a third — each seeded with a completely different secret.

The only viable attack requires simultaneously compromising multiple independent systems within a narrow time window. For any commercial adversary this is not a realistic threat model.

Market opportunity

Every software company shipping proprietary compiled code faces the same distribution security problem. Current solutions — hardware dongles, standard license servers, code obfuscation — have well-documented weaknesses. The academic literature identified this specific application — software distribution key management and API attestation — as commercially unsolved. RolvKey™ addresses it.

Live right now

RolvKey™ is protecting ROLV Primitive© today. Every Docker container download, every key exchange, every benchmark run on every machine worldwide is secured by this system. It has been exercised thousands of times in production.

Licensing and partnership enquiries: rolv@rolv.ai

Independent Verification

Every result is independently verifiable.

4 SHA-256 hashes per case. Perturbation test on every result. ATOL=0.05 on column-normalised fp64. 1,684/1,684 GPU PASS · 332/332 CPU PASS. Download the full validation kit with harness code, raw outputs, and reproduction instructions.

↓ Full Benchmark PDF
Contact

Contact Us

rolv@rolv.ai
Patent Pending · Provisional App. 64/040,896
ROLV LLC · Fort Lauderdale, FL