Drop-in replacement for cuBLAS and cuSPARSE. Works on every GPU, CPU, and accelerator. Zero pruning. Zero model changes. Zero retraining.
ROLV Primitive© replaces cuBLAS and cuSPARSE — NVIDIA’s own compute libraries — with a fundamentally better approach for sparse AI workloads. On DeepSeek-V3 and DeepSeek-R1 (real weights, H200), ROLV is 7.15× faster than cuBLAS and 53× faster than cuSPARSE†. On Kimi-K2-Instruct (real weights, H200): 8.74× faster than cuBLAS, 89% energy saved. On NVIDIA hardware. SHA-256 verified, perturbation PASS every case.
ROLV Primitive© is a drop-in replacement for cuBLAS and cuSPARSE that exploits the natural zero structure of AI weight matrices. No approximation. No accuracy cost. Deterministic on every platform.
MoE routers zero out 75–99% of expert weights per token — architecturally, exactly. cuBLAS computes them all. ROLV doesn’t. The speedup is proportional and provable.
NVIDIA · AMD · Intel · ARM · Apple · Google TPU · Custom ASICs · FPGAs · Photonic · Quantum · Any hardware that does matrix multiply.
Picture a container ship crossing the Pacific. It carries 20,000 containers. The manifest says 5,000 of them are empty — have always been empty, will be empty on arrival. But the ship cannot leave them behind. Its loading system was built decades ago and it can only operate one way: load everything, sail everything, unload everything.
It burns fuel proportional to its total cargo — including the 5,000 empty containers. The crew works proportional to total cargo. The port fees are proportional to total cargo. Every crossing. Every time.
This is what cuBLAS does with MoE inference. The empty containers are the inactive experts — architecturally zero, guaranteed by the router, known before the computation starts. cuBLAS has no mechanism to leave them on the dock. It computes all of them, every token, every layer, every inference call.
ROLV Primitive© is the loading system that reads the manifest first. It identifies the empty containers before departure. It sails only what carries cargo. Same destination. Same output. A fraction of the fuel.
Every frontier model crossing the Pacific today carries empty containers. ROLV leaves them on the dock.
| Model | Src | Nat sp% | vs cuBLAS | vs cuSPARSE | Energy% | Tokens/s | PASS |
|---|---|---|---|---|---|---|---|
| Snowflake-Arctic ★★ ★★ | synth | 98.4% | 9.54× | 36× | 91% | 3,919,474 | ✓ |
| Llama-4-Maverick | synth | 99.2% | 9.32× | 16׆ | 91% | 667,899 | ✓ |
| Kimi-K2-Instruct ★ ★ | REAL | 97.9% | 8.74× | 43׆ | 89% | 597,568 | ✓ |
| Kimi-K2.5 | synth | 97.9% | 8.59× | 43׆ | 88% | 587,180 | ✓ |
| DeepSeek-V3-0324 | REAL | 96.9% | 7.15× | 53׆ | 85% | 733,410 | ✓ |
| DeepSeek-V3 | REAL | 96.9% | 7.15× | 53׆ | 85% | 734,848 | ✓ |
| DeepSeek-R1 | REAL | 96.9% | 7.15× | 53׆ | 85% | 733,962 | ✓ |
| Qwen3-235B-A22B | synth | 93.8% | 4.35× | 65× | 75% | 893,012 | ✓ |
| Llama-4-Scout ★ | REAL | 93.8% | 4.75× | 103× | 79% | 5,795,875 | ✓ |
| Gemma4-26B-A4B | REAL | 93.8% | 4.47× | 53× | 78% | 2,398,905 | ✓ |
| Qwen2-57B-A14B | REAL | 87.5% | 4.40× | 90× | 77% | 2,357,882 | ✓ |
| Qwen3-30B-A3B | REAL | 93.8% | 3.43× | 32× | 71% | 6,650,774 | ✓ |
| Phi-3.5-MoE | REAL | 87.5% | 3.38× | 74× | 70% | 2,430,602 | ✓ |
| Qwen1.5-MoE-A2.7B | REAL | 93.3% | 3.37× | 35× | 70% | 4,834,346 | ✓ |
| DeepSeek-V2-Lite | REAL | 90.6% | 2.94× | 40× | 66% | 3,959,777 | ✓ |
| OLMoE-1B-7B | REAL | 87.5% | 2.49× | 43× | 60% | 4,580,013 | ✓ |
| Mixtral-8×7B | REAL | 75.0% | 1.86× | 109× | 46% | 2,185,075 | ✓ |
| Mixtral-8×22B | REAL | 75.0% | 1.36× | 76× | 27% | 646,556 | ✓ |
| MiniMax-M2.5 — custom architecture · full matrix · cuSPARSE CAN run · ROLV wins 25× | |||||||
| MiniMax-M2.5 ★ | REAL ✓ | 96.9% | 3.95× | 25× | 77% | 1,314,909 | ✓ |
| DBRX | synth | 75.0% | 1.31× | 73× | 23% | 473,230 | ✓ |
H200 + B200 · BF16 · TF32 ON · 1,000 iters · ATOL=0.05 col-norm fp64 · 4 SHA-256 hashes + perturbation PASS every case · ★ peak REAL · ★★ peak overall · †cuSPARSE active submatrix (INT_MAX exceeded; ROLV handles full matrix)
| Model / Layer | GPU | Sparsity | vs cuBLAS | vs vendor sparse | PASS |
|---|---|---|---|---|---|
| LLaMA-3.1-8B up_proj [REAL] | H200 | 80% | 2.17× | 9.53× | ✓ |
| LLaMA-3.1-8B up_proj [REAL] | H200 | 90% | 2.79× | 8.66× | ✓ |
| DeepSeek-R1 embed [REAL] | B200 | 95% | 19.42× | 19.42× | ✓ |
| 10k×10k synthetic | B200 | 70% | 3.11× | 12.06× | ✓ |
| 10k×10k synthetic | MI300X | 85% | 8.5× | 83.77× | ✓ |
| Tesla T4 synthetic | T4 | 90% | 5.8× | 14.2× | ✓ |
1,684/1,684 total PASS across all GPU benchmarks · BF16 · TF32 ON · ATOL=0.05 · AMD MI300X: rocBLAS 8.5× (rocSPARSE has known regression at high sparsity)
| Model / Layer | CPU | Sparsity | vs MKL (iter) | vs MKL (total+build) | Energy↓ | PASS | Pert |
|---|---|---|---|---|---|---|---|
| Mistral-7B q_proj [REAL] | Intel i7 | 95% | 21.45× | 18.58× | 95% | ✓ | ✓ |
| Mistral-7B up_proj [REAL] | Intel i7 | 95% | 17.98× | 15.73× | 94% | ✓ | ✓ |
| Mistral-7B down_proj [REAL] | Intel i7 | 95% | 18.86× | 16.32× | 95% | ✓ | ✓ |
| Mistral-7B v_proj [REAL] | Intel i7 | 95% | 20.12× | 18.32× | 95% | ✓ | ✓ |
| Mistral-7B gate_proj [REAL] | Intel i7 | 95% | 15.70× | 13.90× | 94% | ✓ | ✓ |
| Mistral-7B k_proj [REAL] | Intel i7 | 95% | 17.02× | 15.57× | 94% | ✓ | ✓ |
| Mistral-7B o_proj [REAL] | Intel i7 | 95% | 14.24× | 12.59× | 93% | ✓ | ✓ |
| Mistral-7B avg · 7 layer types · 70–95% sparsity · 28/28 PASS | |||||||
| Mistral-7B avg all layers [REAL] | Intel i7 | 70–95% | 8.49× | — | 83% | 28/28 | 28/28 |
| Qwen3-8B — peak results at 95% sparsity | |||||||
| Qwen3-8B down_proj [REAL] ★ | Intel i7 | 95% | 20.86× | 17.88× | 95% | ✓ | ✓ |
| Qwen3-8B q_proj [REAL] | Intel i7 | 95% | 19.38× | 16.61× | 95% | ✓ | ✓ |
| Qwen3-8B gate_proj [REAL] | Intel i7 | 95% | 18.05× | 15.14× | 95% | ✓ | ✓ |
| Qwen3-8B avg · 7 layer types · 70–95% sparsity · 28/28 PASS | |||||||
| Qwen3-8B avg all layers [REAL] | Intel i7 | 70–95% | 8.59× | — | 84% | 28/28 | 28/28 |
| Gemma4-E4B (Google) — peak results at 95% sparsity | |||||||
| Gemma4-E4B up_proj [REAL] ★ | Intel i7 | 95% | 19.56× | 17.29× | 95% | ✓ | ✓ |
| Gemma4-E4B o_proj [REAL] | Intel i7 | 95% | 17.58× | 15.98× | 94% | ✓ | ✓ |
| Gemma4-E4B gate_proj [REAL] | Intel i7 | 95% | 16.07× | 14.65× | 94% | ✓ | ✓ |
| Gemma4-E4B avg · 7 layer types · 70–95% sparsity · 28/28 PASS | |||||||
| Gemma4-E4B avg all layers [REAL] | Intel i7 | 70–95% | 7.20× | — | 81% | 28/28 | 28/28 |
| Phi-4 (Microsoft) — 2 layer types · 8/8 PASS | |||||||
| Phi-4 down_proj [REAL] ★ | Intel i7 | 95% | 14.82× | 13.23× | 93% | ✓ | ✓ |
| Phi-4 o_proj [REAL] | Intel i7 | 95% | 13.32× | 11.62× | 93% | ✓ | ✓ |
| DeepSeek-R1-7B — peak results at 95% sparsity · 28/28 PASS | |||||||
| DeepSeek-R1-7B down_proj [REAL] ★ | Intel i7 | 95% | 17.22× | 15.22× | 94% | ✓ | ✓ |
| DeepSeek-R1-7B gate_proj [REAL] | Intel i7 | 90% | 13.48× | 11.82× | 93% | ✓ | ✓ |
| DeepSeek-R1-7B avg all layers [REAL] | Intel i7 | 70–95% | 7.0× | — | 85% | 28/28 | 28/28 |
| Qwen2.5-7B (Alibaba) — peak results at 95% sparsity · 28/28 PASS | |||||||
| Qwen2.5-7B down_proj [REAL] ★ | Intel i7 | 95% | 17.40× | 15.54× | 94% | ✓ | ✓ |
| Qwen2.5-7B q_proj [REAL] | Intel i7 | 95% | 16.44× | 14.74× | 94% | ✓ | ✓ |
| Qwen2.5-7B avg all layers [REAL] | Intel i7 | 70–95% | 7.0× | — | 83% | 28/28 | 28/28 |
| Llama-3.2-3B + Llama-3.1-8B (Meta) · 56/56 PASS | |||||||
| Llama-3.2-3B down_proj [REAL] ★ | Intel i7 | 95% | 18.07× | 15.60× | 94% | ✓ | ✓ |
| Llama-3.1-8B down_proj [REAL] | Intel i7 | 95% | 15.08× | 13.25× | 93% | ✓ | ✓ |
| Llama-3.2-3B avg / Llama-3.1-8B avg [REAL] | Intel i7 | 70–95% | 7.4× / 7.5× | — | 83% | 56/56 | 56/56 |
| Gemma-2-2B (Google) — peak results at 95% sparsity · 28/28 PASS | |||||||
| Gemma-2-2B down_proj [REAL] ★ | Intel i7 | 95% | 18.71× | 16.81× | 95% | ✓ | ✓ |
| Gemma-2-2B gate_proj [REAL] | Intel i7 | 95% | 15.95× | 14.35× | 94% | ✓ | ✓ |
| Gemma-2-2B avg all layers [REAL] | Intel i7 | 70–95% | 7.0× | — | 85% | 28/28 | 28/28 |
| TOTAL CPU: 9 models · 332/332 PASS · Avg 7.37× · Peak 24.27× | |||||||
| Phi-4 (Microsoft) — peak at 95% | |||||||
| Phi-4 o_proj [REAL] ★ | Intel i7 | 95% | 19.44× | 17.04× | 95% | ✓ | ✓ |
| Phi-4 down_proj [REAL] | Intel i7 | 95% | 16.12× | 13.93× | 94% | ✓ | ✓ |
| DeepSeek-R1-Distill-7B — peak at 95% | |||||||
| DeepSeek-R1-7B q_proj [REAL] ★ | Intel i7 | 95% | 21.11× | 18.68× | 95% | ✓ | ✓ |
| DeepSeek-R1-7B down_proj [REAL] | Intel i7 | 95% | 20.41× | 17.65× | 95% | ✓ | ✓ |
| Qwen2.5-7B — peak at 95% | |||||||
| Qwen2.5-7B gate_proj [REAL] ★ | Intel i7 | 95% | 59.70× | — | 98% | ✓ | ✓ |
| Qwen2.5-7B down_proj [REAL] | Intel i7 | 95% | 21.26× | — | 95% | ✓ | ✓ |
| Llama-3.2-3B (Meta) — peak at 95% | |||||||
| Llama-3.2-3B up_proj [REAL] ★ | Intel i7 | 95% | 19.83× | — | 95% | ✓ | ✓ |
| Llama-3.2-3B gate_proj [REAL] | Intel i7 | 95% | 17.23× | — | 94% | ✓ | ✓ |
| Llama-3.1-8B (Meta) — peak at 95% | |||||||
| Llama-3.1-8B q_proj [REAL] ★ | Intel i7 | 95% | 24.44× | 22.20× | 96% | ✓ | ✓ |
| Llama-3.1-8B down_proj [REAL] | Intel i7 | 95% | 19.29× | 18.12× | 95% | ✓ | ✓ |
| Llama-3.1-8B avg · 7 layer types · 70–95% sparsity · 28/28 PASS | |||||||
| Llama-3.1-8B avg all layers [REAL] | Intel i7 | 70–95% | 8.26× | — | 84% | 28/28 | 28/28 |
| Gemma-2-2B (Google) — peak at 95% | |||||||
| Gemma-2-2B up_proj [REAL] ★ | Intel i7 | 95% | 20.07× | 20.31× | 95% | ✓ | ✓ |
| Gemma-2-2B down_proj [REAL] | Intel i7 | 95% | 18.67× | 16.95× | 95% | ✓ | ✓ |
| Gemma-2-2B avg · 7 layer types · 70–95% sparsity · 28/28 PASS | |||||||
| Gemma-2-2B avg all layers [REAL] | Intel i7 | 70–95% | 8.48× | — | 83% | 28/28 | 28/28 |
| ▶ Google Colab Intel Xeon @ 2.20GHz · 4 cores · 54.8GB RAM · FP32 · rolvprimitive wheel — 105/105 PASS · 5 sparsity levels (70–99%) · 3 models · 7 layers each | |||||||
| Llama-3.1-8B (Meta) — ★★ CPU peak 77.38× (o_proj, 99%) — 35/35 PASS · exact FP32 all cases | |||||||
| Llama-3.1-8B o_proj [REAL] ★★ | Xeon Colab | 99% | 77.38× | — | 98.7% | ✓ | ✓ |
| Llama-3.1-8B gate_proj [REAL] | Xeon Colab | 90% | 10.67× | — | 90.6% | ✓ | ✓ |
| Llama-3.1-8B down_proj [REAL] | Xeon Colab | 95% | 22.36× | — | 95.5% | ✓ | ✓ |
| Llama-3.1-8B avg all layers [REAL] | Xeon Colab | 70–99% | ~8–77× | — | 99% | 35/35 | 35/35 |
| Qwen3-8B (Alibaba) — ★ CPU peak 73.22× (up_proj, 99%) — 35/35 PASS · exact FP32 all cases | |||||||
| Qwen3-8B up_proj [REAL] ★ | Xeon Colab | 99% | 73.22× | — | 98.6% | ✓ | ✓ |
| Qwen3-8B gate_proj [REAL] | Xeon Colab | 90% | 10.83× | — | 90.8% | ✓ | ✓ |
| Qwen3-8B down_proj [REAL] | Xeon Colab | 95% | 20.20× | — | 95.1% | ✓ | ✓ |
| Qwen3-8B avg all layers [REAL] | Xeon Colab | 70–99% | ~7–73× | — | 99% | 35/35 | 35/35 |
| Qwen2.5-7B (Alibaba) — peak 64.21× (down_proj, 99%) — 35/35 PASS | |||||||
| Qwen2.5-7B down_proj [REAL] ★ | Xeon Colab | 99% | 64.21× | — | 98.4% | ✓ | ✓ |
| Qwen2.5-7B gate_proj [REAL] | Xeon Colab | 95% | 17.80× | — | 94.4% | ✓ | ✓ |
| Qwen2.5-7B o_proj [REAL] | Xeon Colab | 90% | 11.03× | — | 90.9% | ✓ | ✓ |
| Qwen2.5-7B avg all layers [REAL] | Xeon Colab | 70–99% | ~4–64× | — | 99% | 35/35 | 35/35 |
| ▶ Google Colab Intel Xeon @ 2.20GHz · 2 cores · 13GB RAM · FP32 · rolvprimitive wheel — 125/125 PASS · 5 sparsity levels (70–99%) · smaller models | |||||||
| SmolLM2-1.7B (HuggingFace) ★ — peak 27.26× | |||||||
| SmolLM2-1.7B gate_proj [REAL] ★ | Xeon Colab | 95% | 27.26× | — | 96% | ✓ | ✓ |
| SmolLM2-1.7B up_proj [REAL] | Xeon Colab | 95% | 24.29× | — | 96% | ✓ | ✓ |
| SmolLM2-1.7B avg all layers [REAL] | Xeon Colab | 70–95% | 8.67× | — | 79% | 20/20 | 20/20 |
| Qwen2.5-1.5B (Alibaba) — peak 27.61× | |||||||
| Qwen2.5-1.5B up_proj [REAL] ★ | Xeon Colab | 95% | 27.61× | — | 96% | ✓ | ✓ |
| Qwen2.5-1.5B gate_proj [REAL] | Xeon Colab | 95% | 17.04× | — | 94% | ✓ | ✓ |
| Qwen2.5-1.5B avg all layers [REAL] | Xeon Colab | 70–95% | 6.70× | — | 76% | 20/20 | 20/20 |
| Llama-3.2-1B (Meta) — peak 25.97× | |||||||
| Llama-3.2-1B up_proj [REAL] ★ | Xeon Colab | 95% | 25.97× | — | 95% | ✓ | ✓ |
| Llama-3.2-1B avg all layers [REAL] | Xeon Colab | 70–95% | 7.15× | — | 78% | 20/20 | 20/20 |
| Gemma-2-2B on Colab Xeon — peak 28.62× (confirms i7 results on different CPU) | |||||||
| Gemma-2-2B gate_proj [REAL] ★ | Xeon Colab | 95% | 28.62× | — | 95% | ✓ | ✓ |
| Gemma-2-2B avg all layers [REAL] | Xeon Colab | 70–95% | 7.09× | — | 78% | 20/20 | 20/20 |
| Llama-3.2-3B on Colab Xeon — peak 27.16× | |||||||
| Llama-3.2-3B up_proj [REAL] ★ | Xeon Colab | 95% | 27.16× | — | 96% | ✓ | ✓ |
| Llama-3.2-3B gate_proj [REAL] | Xeon Colab | 95% | 16.99× | — | 94% | ✓ | ✓ |
| Llama-3.2-3B avg all layers [REAL] | Xeon Colab | 70–95% | 8.09× | — | 81% | 20/20 | 20/20 |
| i7 combined total: 252/252 PASS · Meta · Alibaba · Google · Microsoft · DeepSeek · same Intel i7 laptop | |||||||
| AMD EPYC 7B13 synthetic | EPYC | 90% | 8.5× | — | 89% | ✓ | ✓ |
Intel i7 laptop (4 cores, 68GB RAM) · Mistral-7B + Qwen3-8B + Gemma4-E4B + Phi-4 + DeepSeek-R1-7B + Qwen2.5-7B + Llama-3.2-3B + Llama-3.1-8B + Gemma-2-2B real HuggingFace weights · MKL baseline · Speedup includes ROLV build time · 252/252 PASS (i7) + 230/230 PASS (Colab Xeon wheel, 5-level) = 482/482 total · 482/482 perturbation PASS · CPU peak 77.38× · 1,000 iters · ATOL=0.05
| Hardware | Matrix | Sparsity | cuSPARSE ms | ROLV ms | ROLV wins | PASS |
|---|---|---|---|---|---|---|
| NVIDIA H200 | LLaMA up_proj | 80% | 5.90 | 0.619 | 9.53× | ✓ |
| NVIDIA H200 | LLaMA up_proj | 90% | 3.01 | 0.348 | 8.66× | ✓ |
| NVIDIA B200 | Mixtral-8×7B MoE | 75% | 25.65 | 0.234 | 109× | ✓ |
| NVIDIA B200 | Llama-4-Scout MoE | 94% | 9.14 | 0.088 | 103× | ✓ |
| NVIDIA B200 | 10k×10k synthetic | 70% | 4.31 | 0.36 | 12.06× | ✓ |
| AMD MI300X | 10k×10k synthetic | 85% | 74.27 | 0.89 | 83.77× | ✓ |
| Intel i7 CPU | Mistral-7B q_proj | 95% | 66.4 | 3.18 | 14.01× | ✓ |
cuSPARSE is NVIDIA’s own sparse library — tuned by hundreds of engineers. ROLV beats it everywhere because dense matmul on a small submatrix outperforms CSR index lookups for LLM weight patterns. AMD MI300X uses rocSPARSE which has a known performance regression at high sparsity — rocBLAS 8.5× comparison also published.
Sparse Compute Advisor — three integrated calculators: cuSPARSE INT_MAX ceiling, path selector, and speedup vs the correct vendor baseline for your sparsity level.
4 SHA-256 hashes per case. Perturbation test on every result. ATOL=0.05 on column-normalised fp64. 1,684/1,684 GPU PASS · 332/332 CPU PASS. Download the full validation kit with harness code, raw outputs, and reproduction instructions.
Born in Norway. Built companies across Europe and the United States. In May 2025, during a bike ride in Fort Lauderdale, he asked whether AI matrix operations could be made dramatically faster — and refused to stop until they were. Six months later, ROLV Primitive© was independently validated by the University of Miami. Three patents pending.
“Imagination is the only limitation to innovation.”
Read the full story →