Software-Only · No Hardware Changes · No Model Retraining · 3 Patents Pending

Cut AI infrastructure costs —
capex and opex — from a single software primitive.

Drop-in replacement for cuBLAS and cuSPARSE. Works on every GPU, CPU, and accelerator. Zero pruning. Zero model changes. Zero retraining.

+86–+854%
Faster than cuBLAS (1.86–9.54×)
GPU · MoE real weights · 16 MoE models confirmed real weights
2–28×
Faster than MKL
CPU · laptop · no GPU required
75–99%
Energy reduction
Fewer joules per token · pynvml verified
$0
Hardware investment
Software only · deploy today
See the business case ↓ View benchmarks ↓ ↓ Validation kit
The Business Case

How many GPUs can you not buy?

▲ Capex Savings
A hyperscaler buys 100,000 GPUs at $30K each = $3.0B capex.
At ROLV’s conservative 3× speedup, you need 33,333 GPUs to do the same work.
At (Llama-4-Scout class), you need just 20,000.
Saved at 3×
$2.0B
66,667 fewer GPUs
Saved at 5×
$2.4B
80,000 fewer GPUs
$30K/GPU conservative · H200 ~$30-40K · B200 ~$40K+
▲ Opex Savings — Energy
100,000 H200s at 700W, 80% utilisation, PUE 1.3, $0.12/kWh
costs $76.5M/year in electricity alone — before cooling overhead.
ROLV reduces active compute by 46–99% depending on model.
Saved/yr (46% — Mixtral)
$35M
117,000 t CO² avoided
Saved/yr (88% — DeepSeek)
$67M
225,000 t CO² avoided
100K H200 · 700W · 80% util · PUE 1.3 · $0.12/kWh

ROLV Primitive© replaces cuBLAS and cuSPARSE — NVIDIA’s own compute libraries — with a fundamentally better approach for sparse AI workloads. On DeepSeek-V3 and DeepSeek-R1 (real weights, H200), ROLV is 7.15× faster than cuBLAS and 53× faster than cuSPARSE†. On Kimi-K2-Instruct (real weights, H200): 8.74× faster than cuBLAS, 89% energy saved. On NVIDIA hardware. SHA-256 verified, perturbation PASS every case.

Technical Foundation

One operator. Exact output.
Proportionally fewer multiplications.

ROLV Primitive© is a drop-in replacement for cuBLAS and cuSPARSE that exploits the natural zero structure of AI weight matrices. No approximation. No accuracy cost. Deterministic on every platform.

MoE Natural Sparsity — Real Model Results

Mixtral. Qwen3. Llama-4. DeepSeek. Jamba. All PASS.
Real weights. Zero pruning. Independently verified.

MoE routers zero out 75–99% of expert weights per token — architecturally, exactly. cuBLAS computes them all. ROLV doesn’t. The speedup is proportional and provable.

1.86×
Mixtral-8×7B
75% natural sp · REAL · −47% energy
3.43×
Qwen3-30B-A3B
93.8% natural sp · REAL · −71% energy
8.74×
Llama-4-Scout ★
93.8% natural sp · REAL · −79% energy
4.47×
Gemma4-26B-A4B
97.9% natural sp · REAL · H200 · −89% energy
2.49×
vs cuBLAS  ·  42× vs cuSPARSE
OLMoE-1B-7B
87.5% natural sp · REAL · −59.9% energy
2.94×
vs cuBLAS  ·  40× vs cuSPARSE
DeepSeek-V2-Lite
90.6% natural sp · REAL · −66.0% energy
3.38×
vs cuBLAS  ·  74× vs cuSPARSE
Phi-3.5-MoE
87.5% natural sp · REAL · −70.4% energy
3.37×
vs cuBLAS  ·  35× vs cuSPARSE
Qwen1.5-MoE-A2.7B
93.3% natural sp · REAL · −70.3% energy
Universal Compatibility

Works on every platform. Today and tomorrow.

NVIDIA · AMD · Intel · ARM · Apple · Google TPU · Custom ASICs · FPGAs · Photonic · Quantum · Any hardware that does matrix multiply.

A Story About Waste at Scale

ROLV Makes AI Available to Anyone,
Anywhere with a PC.

Picture a container ship crossing the Pacific. It carries 20,000 containers. The manifest says 5,000 of them are empty — have always been empty, will be empty on arrival. But the ship cannot leave them behind. Its loading system was built decades ago and it can only operate one way: load everything, sail everything, unload everything.

It burns fuel proportional to its total cargo — including the 5,000 empty containers. The crew works proportional to total cargo. The port fees are proportional to total cargo. Every crossing. Every time.

This is what cuBLAS does with MoE inference. The empty containers are the inactive experts — architecturally zero, guaranteed by the router, known before the computation starts. cuBLAS has no mechanism to leave them on the dock. It computes all of them, every token, every layer, every inference call.

ROLV Primitive© is the loading system that reads the manifest first. It identifies the empty containers before departure. It sails only what carries cargo. Same destination. Same output. A fraction of the fuel.

The numbers behind the analogy
DeepSeek-V3 — 256 experts, top-8 active
248
empty containers per token
96.9% of all compute wasted by cuBLAS
ROLV Primitive© computes only
8
active experts — exactly
8.74× faster · 53× vs cuSPARSE† · PASS
Mixtral-8×7B — 8 experts, top-2 active
6
empty containers per token
75% of all compute wasted by cuBLAS
ROLV computes only
2
active experts — exactly
1.86× faster · 109× vs cuSPARSE · PASS

Every frontier model crossing the Pacific today carries empty containers. ROLV leaves them on the dock.

Benchmarks — Real Weights · SHA-256 Verified · 1,000 iters

Full results. Every claim verified.

ModelSrcNat sp% vs cuBLASvs cuSPARSE Energy%Tokens/sPASS
Snowflake-Arctic ★★ ★★synth98.4%9.54×36×91%3,919,474
Llama-4-Mavericksynth99.2%9.32×16׆91%667,899
Kimi-K2-Instruct ★ ★REAL97.9%8.74×43׆89%597,568
Kimi-K2.5synth97.9%8.59×43׆88%587,180
DeepSeek-V3-0324REAL96.9%7.15×53׆85%733,410
DeepSeek-V3REAL96.9%7.15×53׆85%734,848
DeepSeek-R1REAL96.9%7.15×53׆85%733,962
Qwen3-235B-A22Bsynth93.8%4.35×65×75%893,012
Llama-4-Scout ★REAL93.8%4.75×103×79%5,795,875
Gemma4-26B-A4BREAL93.8%4.47×53×78%2,398,905
Qwen2-57B-A14BREAL87.5%4.40×90×77%2,357,882
Qwen3-30B-A3BREAL93.8%3.43×32×71%6,650,774
Phi-3.5-MoEREAL87.5%3.38×74×70%2,430,602
Qwen1.5-MoE-A2.7BREAL93.3%3.37×35×70%4,834,346
DeepSeek-V2-LiteREAL90.6%2.94×40×66%3,959,777
OLMoE-1B-7BREAL87.5%2.49×43×60%4,580,013
Mixtral-8×7BREAL75.0%1.86×109×46%2,185,075
Mixtral-8×22BREAL75.0%1.36×76×27%646,556
DBRXsynth75.0%1.31×73×23%473,230

H200 + B200 · BF16 · TF32 ON · 1,000 iters · ATOL=0.05 col-norm fp64 · 4 SHA-256 hashes + perturbation PASS every case · ★ peak REAL · ★★ peak overall · †cuSPARSE active submatrix (INT_MAX exceeded; ROLV handles full matrix)

Model / LayerGPUSparsity vs cuBLASvs vendor sparsePASS
LLaMA-3.1-8B up_proj [REAL]H20080%2.17×9.53×
LLaMA-3.1-8B up_proj [REAL]H20090%2.79×8.66×
DeepSeek-R1 embed [REAL]B20095%19.42×19.42×
10k×10k syntheticB20070%3.11×12.06×
10k×10k syntheticMI300X85%8.5×83.77×
Tesla T4 syntheticT490%5.8×14.2×

1,684/1,684 total PASS across all GPU benchmarks · BF16 · TF32 ON · ATOL=0.05 · AMD MI300X: rocBLAS 8.5× (rocSPARSE has known regression at high sparsity)

Model / LayerCPUSparsity vs MKL (iter)vs MKL (total+build) Energy↓PASSPert
Mistral-7B q_proj [REAL]Intel i795%21.45×18.58×95%
Mistral-7B up_proj [REAL]Intel i795%17.98×15.73×94%
Mistral-7B down_proj [REAL]Intel i795%18.86×16.32×95%
Mistral-7B v_proj [REAL]Intel i795%20.12×18.32×95%
Mistral-7B gate_proj [REAL]Intel i795%15.70×13.90×94%
Mistral-7B k_proj [REAL]Intel i795%17.02×15.57×94%
Mistral-7B o_proj [REAL]Intel i795%14.24×12.59×93%
Mistral-7B avg · 7 layer types · 70–95% sparsity · 28/28 PASS
Mistral-7B avg all layers [REAL]Intel i770–95%8.49×83%28/2828/28
Qwen3-8B — peak results at 95% sparsity
Qwen3-8B down_proj [REAL] ★Intel i795%20.86×17.88×95%
Qwen3-8B q_proj [REAL]Intel i795%19.38×16.61×95%
Qwen3-8B gate_proj [REAL]Intel i795%18.05×15.14×95%
Qwen3-8B avg · 7 layer types · 70–95% sparsity · 28/28 PASS
Qwen3-8B avg all layers [REAL]Intel i770–95%8.59×84%28/2828/28
Gemma4-E4B (Google) — peak results at 95% sparsity
Gemma4-E4B up_proj [REAL] ★Intel i795%19.56×17.29×95%
Gemma4-E4B o_proj [REAL]Intel i795%17.58×15.98×94%
Gemma4-E4B gate_proj [REAL]Intel i795%16.07×14.65×94%
Gemma4-E4B avg · 7 layer types · 70–95% sparsity · 28/28 PASS
Gemma4-E4B avg all layers [REAL]Intel i770–95%7.20×81%28/2828/28
Phi-4 (Microsoft) — 2 layer types · 8/8 PASS
Phi-4 down_proj [REAL] ★Intel i795%14.82×13.23×93%
Phi-4 o_proj [REAL]Intel i795%13.32×11.62×93%
DeepSeek-R1-7B — peak results at 95% sparsity · 28/28 PASS
DeepSeek-R1-7B down_proj [REAL] ★Intel i795%17.22×15.22×94%
DeepSeek-R1-7B gate_proj [REAL]Intel i790%13.48×11.82×93%
DeepSeek-R1-7B avg all layers [REAL]Intel i770–95%7.0×85%28/2828/28
Qwen2.5-7B (Alibaba) — peak results at 95% sparsity · 28/28 PASS
Qwen2.5-7B down_proj [REAL] ★Intel i795%17.40×15.54×94%
Qwen2.5-7B q_proj [REAL]Intel i795%16.44×14.74×94%
Qwen2.5-7B avg all layers [REAL]Intel i770–95%7.0×83%28/2828/28
Llama-3.2-3B + Llama-3.1-8B (Meta) · 56/56 PASS
Llama-3.2-3B down_proj [REAL] ★Intel i795%18.07×15.60×94%
Llama-3.1-8B down_proj [REAL]Intel i795%15.08×13.25×93%
Llama-3.2-3B avg / Llama-3.1-8B avg [REAL]Intel i770–95%7.4× / 7.5×83%56/5656/56
Gemma-2-2B (Google) — peak results at 95% sparsity · 28/28 PASS
Gemma-2-2B down_proj [REAL] ★Intel i795%18.71×16.81×95%
Gemma-2-2B gate_proj [REAL]Intel i795%15.95×14.35×94%
Gemma-2-2B avg all layers [REAL]Intel i770–95%7.0×85%28/2828/28
TOTAL CPU: 9 models · 332/332 PASS · Avg 7.37× · Peak 24.27×
Phi-4 (Microsoft) — peak at 95%
Phi-4 o_proj [REAL] ★Intel i795%19.44×17.04×95%
Phi-4 down_proj [REAL]Intel i795%16.12×13.93×94%
DeepSeek-R1-Distill-7B — peak at 95%
DeepSeek-R1-7B q_proj [REAL] ★Intel i795%21.11×18.68×95%
DeepSeek-R1-7B down_proj [REAL]Intel i795%20.41×17.65×95%
Qwen2.5-7B — peak at 95%
Qwen2.5-7B gate_proj [REAL] ★Intel i795%59.70×98%
Qwen2.5-7B down_proj [REAL]Intel i795%21.26×95%
Llama-3.2-3B (Meta) — peak at 95%
Llama-3.2-3B up_proj [REAL] ★Intel i795%19.83×95%
Llama-3.2-3B gate_proj [REAL]Intel i795%17.23×94%
Llama-3.1-8B (Meta) — peak at 95%
Llama-3.1-8B q_proj [REAL] ★Intel i795%24.44×22.20×96%
Llama-3.1-8B down_proj [REAL]Intel i795%19.29×18.12×95%
Llama-3.1-8B avg · 7 layer types · 70–95% sparsity · 28/28 PASS
Llama-3.1-8B avg all layers [REAL]Intel i770–95%8.26×84%28/2828/28
Gemma-2-2B (Google) — peak at 95%
Gemma-2-2B up_proj [REAL] ★Intel i795%20.07×20.31×95%
Gemma-2-2B down_proj [REAL]Intel i795%18.67×16.95×95%
Gemma-2-2B avg · 7 layer types · 70–95% sparsity · 28/28 PASS
Gemma-2-2B avg all layers [REAL]Intel i770–95%8.48×83%28/2828/28
▶ Google Colab Intel Xeon @ 2.20GHz · 2 cores · 13GB RAM · FP32 · rolvprimitive wheel — 125/125 PASS · 5 sparsity levels (70–99%)
SmolLM2-1.7B (HuggingFace) ★ — peak 27.26×
SmolLM2-1.7B gate_proj [REAL] ★Xeon Colab95%27.26×96%
SmolLM2-1.7B up_proj [REAL]Xeon Colab95%24.29×96%
SmolLM2-1.7B avg all layers [REAL]Xeon Colab70–95%8.67×79%20/2020/20
Qwen2.5-1.5B (Alibaba) — peak 27.61×
Qwen2.5-1.5B up_proj [REAL] ★Xeon Colab95%27.61×96%
Qwen2.5-1.5B gate_proj [REAL]Xeon Colab95%17.04×94%
Qwen2.5-1.5B avg all layers [REAL]Xeon Colab70–95%6.70×76%20/2020/20
Llama-3.2-1B (Meta) — peak 25.97×
Llama-3.2-1B up_proj [REAL] ★Xeon Colab95%25.97×95%
Llama-3.2-1B avg all layers [REAL]Xeon Colab70–95%7.15×78%20/2020/20
Gemma-2-2B on Colab Xeon — peak 28.62× (confirms i7 results on different CPU)
Gemma-2-2B gate_proj [REAL] ★Xeon Colab95%28.62×95%
Gemma-2-2B avg all layers [REAL]Xeon Colab70–95%7.09×78%20/2020/20
Llama-3.2-3B on Colab Xeon — peak 27.16×
Llama-3.2-3B up_proj [REAL] ★Xeon Colab95%27.16×96%
Llama-3.2-3B gate_proj [REAL]Xeon Colab95%16.99×94%
Llama-3.2-3B avg all layers [REAL]Xeon Colab70–95%8.09×81%20/2020/20
i7 combined total: 252/252 PASS · Meta · Alibaba · Google · Microsoft · DeepSeek · same Intel i7 laptop
AMD EPYC 7B13 syntheticEPYC90%8.5×89%

Intel i7 laptop (4 cores, 68GB RAM) · Mistral-7B + Qwen3-8B + Gemma4-E4B + Phi-4 + DeepSeek-R1-7B + Qwen2.5-7B + Llama-3.2-3B + Llama-3.1-8B + Gemma-2-2B real HuggingFace weights · MKL baseline · Speedup includes ROLV build time · 252/252 PASS (i7) + 125/125 PASS (Colab Xeon wheel, 5-level) = 377/377 total · 377/377 perturbation PASS · 1,000 iters · ATOL=0.05

HardwareMatrixSparsity cuSPARSE msROLV ms ROLV winsPASS
NVIDIA H200LLaMA up_proj80%5.900.6199.53×
NVIDIA H200LLaMA up_proj90%3.010.3488.66×
NVIDIA B200Mixtral-8×7B MoE75%25.650.234109×
NVIDIA B200Llama-4-Scout MoE94%9.140.088103×
NVIDIA B20010k×10k synthetic70%4.310.3612.06×
AMD MI300X10k×10k synthetic85%74.270.8983.77×
Intel i7 CPUMistral-7B q_proj95%66.43.1814.01×

cuSPARSE is NVIDIA’s own sparse library — tuned by hundreds of engineers. ROLV beats it everywhere because dense matmul on a small submatrix outperforms CSR index lookups for LLM weight patterns. AMD MI300X uses rocSPARSE which has a known performance regression at high sparsity — rocBLAS 8.5× comparison also published.

Calculators

Measure. Switch. Save.

Three tools to quantify ROLV’s impact on your infrastructure.

△ ROLV Unit™ — Measure True Compute Efficiency

The ROLV Unit™ is a normalised measure of compute efficiency that accounts for sparsity. Unlike TFLOPS (which measures peak theoretical throughput) or tokens/s (which conflates hardware and software), the ROLV Unit measures useful compute — work done on non-zero elements only.

1 ROLV Unit = 1 TFLOP of compute on live (non-zero) matrix elements per second, at full precision, verified by SHA-256 hash.

Your Compute in ROLV Units
Without ROLV
562 RU
wasted on zero rows
With ROLV
2,250 RU
all compute is useful
Cluster efficiency gain
4.0× more useful compute — same hardware
ROLV Unit = TFLOPS on verified non-zero elements. Vendor TFLOPS counts all compute including zero rows.
▶ ROLVswitch™ & VRAM — Crossover & Memory Calculator

ROLVswitch™ finds the exact sparsity where ROLV beats dense, and whether your matrix fits in VRAM. Enter your matrix dimensions and hardware to get the switch point and memory analysis.

ROLVswitch Analysis
Switch to ROLV above
VRAM analysis
At your sparsity ({{ sp }}%)
Crossover is dtype + index dependent. VRAM analysis uses dense size = M×K×dtype bytes.
■ RSMT™ — Sparse Storage Threshold Calculator

RSMT™ finds the exact sparsity threshold where sparse storage beats dense for your dtype. Below the threshold, dense storage wins on memory. Above it, sparse wins — and ROLV wins on compute too.

Loading...
Why RSMT™ Matters

The crossover point depends entirely on your dtype. With bfloat16 (2 bytes) and int32 indices (4 bytes), sparse format costs 3× more bytes per non-zero than dense. Sparse wins only when you have enough zeros to overcome the index overhead.

Your MoE models at bfloat16
Mixtral-8×7B: 75%  ✓ well above crossover
Qwen3-30B-A3B: 93.8%  ✓ far above crossover
Llama-4-Scout: 93.8%  ✓ far above crossover
DeepSeek-V3: 96.9%  ✓ extreme advantage

RSMT™ is computed analytically — no approximation. The crossover is mathematically exact for any dtype combination.

Independent Verification

Every result is independently verifiable.

4 SHA-256 hashes per case. Perturbation test on every result. ATOL=0.05 on column-normalised fp64. 1,684/1,684 GPU PASS · 332/332 CPU PASS. Download the full validation kit with harness code, raw outputs, and reproduction instructions.

↓ Download Validation Kit ↓ Full Benchmark PDF
R
Founder & CEO

Rolv Eitrem Heggenhougen

Born in Norway. Built companies across Europe and the United States. In May 2025, during a bike ride in Fort Lauderdale, he asked whether AI matrix operations could be made dramatically faster — and refused to stop until they were. Six months later, ROLV Primitive© was independently validated by the University of Miami. Three patents pending.

“Imagination is the only limitation to innovation.”

Contact

Contact Us

rolv@rolv.ai 3 Patents Pending