Software-Only · No Hardware Changes · No Model Retraining · 3 Patents Pending

Cut AI infrastructure costs —
capex and opex — from a single software primitive.

Drop-in replacement for cuBLAS and cuSPARSE. Works on every GPU, CPU, and accelerator. Zero pruning. Zero model changes. Zero retraining.

2–9×
Faster than cuBLAS & cuSPARSE
6 MoE models · NVIDIA own libraries beaten
75–99%
Energy reduction
Fewer joules per token · pynvml verified
$0
Hardware investment
Software only · deploy today
See the business case ↓ View benchmarks ↓ ↓ Validation kit
The Business Case

How many GPUs can you not buy?

▲ Capex Savings
A hyperscaler buys 100,000 GPUs at $30K each = $3.0B capex.
At ROLV’s conservative 3× speedup, you need 33,333 GPUs to do the same work.
At (Llama-4-Scout class), you need just 20,000.
Saved at 3×
$2.0B
66,667 fewer GPUs
Saved at 5×
$2.4B
80,000 fewer GPUs
$30K/GPU conservative · H200 ~$30-40K · B200 ~$40K+
▲ Opex Savings — Energy
100,000 H200s at 700W, 80% utilisation, PUE 1.3, $0.12/kWh
costs $76.5M/year in electricity alone — before cooling overhead.
ROLV reduces active compute by 46–99% depending on model.
Saved/yr (46% — Mixtral)
$35M
117,000 t CO² avoided
Saved/yr (88% — DeepSeek)
$67M
225,000 t CO² avoided
100K H200 · 700W · 80% util · PUE 1.3 · $0.12/kWh

ROLV Primitive© replaces cuBLAS and cuSPARSE — NVIDIA’s own compute libraries — with a fundamentally better approach for sparse AI workloads. On MoE models like DeepSeek-V3, ROLV is 8.76× faster than cuBLAS and 110× faster than cuSPARSE. On NVIDIA hardware. Verified with SHA-256 hashes and perturbation tests.

Technical Foundation

One operator. Exact output.
Proportionally fewer multiplications.

ROLV Primitive© is a drop-in replacement for cuBLAS and cuSPARSE that exploits the natural zero structure of AI weight matrices. No approximation. No accuracy cost. Deterministic on every platform.

MoE Natural Sparsity — Real Model Results

Mixtral. Qwen3. Llama-4. All PASS.
Real weights. Zero pruning. Independently verified.

MoE routers zero out 75–99% of expert weights per token — architecturally, exactly. cuBLAS computes them all. ROLV doesn’t. The speedup is proportional and provable.

1.86×
Mixtral-8×7B
75% natural sp · REAL
3.43×
Qwen3-30B-A3B
93.8% natural sp · REAL
4.75×
Llama-4-Scout ★
93.8% natural sp · REAL
4.40×
Qwen2-57B-A14B
87.5% natural sp · REAL · H200
Universal Compatibility

Works on every platform. Today and tomorrow.

NVIDIA · AMD · Intel · ARM · Apple · Google TPU · Custom ASICs · FPGAs · Photonic · Quantum · Any hardware that does matrix multiply.

A Story About Waste at Scale

ROLV Makes AI Available to Anyone,
Anywhere with a PC.

Picture a container ship crossing the Pacific. It carries 20,000 containers. The manifest says 5,000 of them are empty — have always been empty, will be empty on arrival. But the ship cannot leave them behind. Its loading system was built decades ago and it can only operate one way: load everything, sail everything, unload everything.

It burns fuel proportional to its total cargo — including the 5,000 empty containers. The crew works proportional to total cargo. The port fees are proportional to total cargo. Every crossing. Every time.

This is what cuBLAS does with MoE inference. The empty containers are the inactive experts — architecturally zero, guaranteed by the router, known before the computation starts. cuBLAS has no mechanism to leave them on the dock. It computes all of them, every token, every layer, every inference call.

ROLV Primitive© is the loading system that reads the manifest first. It identifies the empty containers before departure. It sails only what carries cargo. Same destination. Same output. A fraction of the fuel.

The numbers behind the analogy
DeepSeek-V3 — 256 experts, top-8 active
248
empty containers per token
96.9% of all compute wasted by cuBLAS
ROLV Primitive© computes only
8
active experts — exactly
8.76× faster · 110× vs cuSPARSE · PASS
Mixtral-8×7B — 8 experts, top-2 active
6
empty containers per token
75% of all compute wasted by cuBLAS
ROLV computes only
2
active experts — exactly
1.86× faster · 109× vs cuSPARSE · PASS

Every frontier model crossing the Pacific today carries empty containers. ROLV leaves them on the dock.

Benchmarks — Real Weights · SHA-256 Verified · 1,000 iters

Full results. Every claim verified.

ModelSrcNat sp% vs cuBLASvs cuSPARSE Energy%Tokens/sPASS
Mixtral-8×7BREAL75.0%1.86×109×46%2,185,075
Mixtral-8×22Bsynth75.0%2.43×107×59%1,073,568
Qwen2-57B-A14Bsynth87.5%3.37×70×70%2,374,040
Qwen3-30B-A3BREAL93.8%3.43×32×71%6,650,774
Llama-4-Scout ★REAL93.8%4.75×103×79%5,795,875
DeepSeek-V3/R1synth96.9%8.76×110×89%1,758,046

NVIDIA B200 · BF16 · TF32 ON · 1,000 iters · ATOL=0.05 col-norm fp64 · 4 SHA-256 hashes + perturbation PASS · †exact production dims

Model / LayerGPUSparsity vs cuBLASvs vendor sparsePASS
LLaMA-3.1-8B up_proj [REAL]H20080%2.17×9.53×
LLaMA-3.1-8B up_proj [REAL]H20090%2.79×8.66×
DeepSeek-R1 embed [REAL]B20095%19.42×19.42×
10k×10k syntheticB20070%3.11×12.06×
10k×10k syntheticMI300X85%8.5×83.77×
Tesla T4 syntheticT490%5.8×14.2×

1,684/1,684 total PASS across all GPU benchmarks · BF16 · TF32 ON · ATOL=0.05 · AMD MI300X: rocBLAS 8.5× (rocSPARSE has known regression at high sparsity)

Model / LayerCPUSparsity vs MKL (iter)vs MKL (total+build) Energy↓PASSPert
Mistral-7B q_proj [REAL]Intel i795%21.45×18.58×95%
Mistral-7B up_proj [REAL]Intel i795%17.98×15.73×94%
Mistral-7B down_proj [REAL]Intel i795%18.86×16.32×95%
Mistral-7B v_proj [REAL]Intel i795%20.12×18.32×95%
Mistral-7B gate_proj [REAL]Intel i795%15.70×13.90×94%
Mistral-7B k_proj [REAL]Intel i795%17.02×15.57×94%
Mistral-7B o_proj [REAL]Intel i795%14.24×12.59×93%
Mistral-7B avg · 7 layer types · 70–95% sparsity · 28/28 PASS
Mistral-7B avg all layers [REAL]Intel i770–95%8.49×83%28/2828/28
Qwen3-8B — peak results at 95% sparsity
Qwen3-8B down_proj [REAL] ★Intel i795%20.86×17.88×95%
Qwen3-8B q_proj [REAL]Intel i795%19.38×16.61×95%
Qwen3-8B gate_proj [REAL]Intel i795%18.05×15.14×95%
Qwen3-8B avg · 7 layer types · 70–95% sparsity · 28/28 PASS
Qwen3-8B avg all layers [REAL]Intel i770–95%8.59×84%28/2828/28
Combined: 56/56 PASS · two model families · same Intel i7 laptop
AMD EPYC 7B13 syntheticEPYC90%8.5×89%

Intel i7 laptop (4 cores, 68GB RAM) · Mistral-7B + Qwen3-8B real HuggingFace weights · MKL baseline · Speedup includes ROLV build time · 56/56 PASS · 56/56 perturbation PASS · 1,000 iters · ATOL=0.05

HardwareMatrixSparsity cuSPARSE msROLV ms ROLV winsPASS
NVIDIA H200LLaMA up_proj80%5.900.6199.53×
NVIDIA H200LLaMA up_proj90%3.010.3488.66×
NVIDIA B200Mixtral-8×7B MoE75%25.650.234109×
NVIDIA B200Llama-4-Scout MoE94%9.140.088103×
NVIDIA B20010k×10k synthetic70%4.310.3612.06×
AMD MI300X10k×10k synthetic85%74.270.8983.77×
Intel i7 CPUMistral-7B q_proj95%66.43.1814.01×

cuSPARSE is NVIDIA’s own sparse library — tuned by hundreds of engineers. ROLV beats it everywhere because dense matmul on a small submatrix outperforms CSR index lookups for LLM weight patterns. AMD MI300X uses rocSPARSE which has a known performance regression at high sparsity — rocBLAS 8.5× comparison also published.

Calculators

Measure. Switch. Save.

Three tools to quantify ROLV’s impact on your infrastructure.

△ ROLV Unit™ — Measure True Compute Efficiency

The ROLV Unit™ is a normalised measure of compute efficiency that accounts for sparsity. Unlike TFLOPS (which measures peak theoretical throughput) or tokens/s (which conflates hardware and software), the ROLV Unit measures useful compute — work done on non-zero elements only.

1 ROLV Unit = 1 TFLOP of compute on live (non-zero) matrix elements per second, at full precision, verified by SHA-256 hash.

Your Compute in ROLV Units
Without ROLV
562 RU
wasted on zero rows
With ROLV
2,250 RU
all compute is useful
Cluster efficiency gain
4.0× more useful compute — same hardware
ROLV Unit = TFLOPS on verified non-zero elements. Vendor TFLOPS counts all compute including zero rows.
▶ ROLVswitch™ & VRAM — Crossover & Memory Calculator

ROLVswitch™ finds the exact sparsity where ROLV beats dense, and whether your matrix fits in VRAM. Enter your matrix dimensions and hardware to get the switch point and memory analysis.

ROLVswitch Analysis
Switch to ROLV above
VRAM analysis
At your sparsity ({{ sp }}%)
Crossover is dtype + index dependent. VRAM analysis uses dense size = M×K×dtype bytes.
■ RSMT™ — Sparse Storage Threshold Calculator

RSMT™ finds the exact sparsity threshold where sparse storage beats dense for your dtype. Below the threshold, dense storage wins on memory. Above it, sparse wins — and ROLV wins on compute too.

Loading...
Why RSMT™ Matters

The crossover point depends entirely on your dtype. With bfloat16 (2 bytes) and int32 indices (4 bytes), sparse format costs 3× more bytes per non-zero than dense. Sparse wins only when you have enough zeros to overcome the index overhead.

Your MoE models at bfloat16
Mixtral-8×7B: 75%  ✓ well above crossover
Qwen3-30B-A3B: 93.8%  ✓ far above crossover
Llama-4-Scout: 93.8%  ✓ far above crossover
DeepSeek-V3: 96.9%  ✓ extreme advantage

RSMT™ is computed analytically — no approximation. The crossover is mathematically exact for any dtype combination.

Independent Verification

Every result is independently verifiable.

4 SHA-256 hashes per case. Perturbation test on every result. ATOL=0.05 on column-normalised fp64. 1,684/1,684 GPU PASS · 56/56 CPU PASS. Download the full validation kit with harness code, raw outputs, and reproduction instructions.

↓ Download Validation Kit ↓ Full Benchmark PDF
R
Founder & CEO

Rolv Eitrem Heggenhougen

Born in Norway. Built companies across Europe and the United States. In May 2025, during a bike ride in Fort Lauderdale, he asked whether AI matrix operations could be made dramatically faster — and refused to stop until they were. Six months later, ROLV Primitive© was independently validated by the University of Miami. Three patents pending.

“Imagination is the only limitation to innovation.”

Contact

Contact Us

rolv@rolv.ai 3 Patents Pending