Drop-in replacement for cuBLAS and cuSPARSE. Works on every GPU, CPU, and accelerator. Zero pruning. Zero model changes. Zero retraining.
ROLV Primitive© replaces cuBLAS and cuSPARSE — NVIDIA’s own compute libraries — with a fundamentally better approach for sparse AI workloads. On MoE models like DeepSeek-V3, ROLV is 8.76× faster than cuBLAS and 110× faster than cuSPARSE. On NVIDIA hardware. Verified with SHA-256 hashes and perturbation tests.
ROLV Primitive© is a drop-in replacement for cuBLAS and cuSPARSE that exploits the natural zero structure of AI weight matrices. No approximation. No accuracy cost. Deterministic on every platform.
MoE routers zero out 75–99% of expert weights per token — architecturally, exactly. cuBLAS computes them all. ROLV doesn’t. The speedup is proportional and provable.
NVIDIA · AMD · Intel · ARM · Apple · Google TPU · Custom ASICs · FPGAs · Photonic · Quantum · Any hardware that does matrix multiply.
Picture a container ship crossing the Pacific. It carries 20,000 containers. The manifest says 5,000 of them are empty — have always been empty, will be empty on arrival. But the ship cannot leave them behind. Its loading system was built decades ago and it can only operate one way: load everything, sail everything, unload everything.
It burns fuel proportional to its total cargo — including the 5,000 empty containers. The crew works proportional to total cargo. The port fees are proportional to total cargo. Every crossing. Every time.
This is what cuBLAS does with MoE inference. The empty containers are the inactive experts — architecturally zero, guaranteed by the router, known before the computation starts. cuBLAS has no mechanism to leave them on the dock. It computes all of them, every token, every layer, every inference call.
ROLV Primitive© is the loading system that reads the manifest first. It identifies the empty containers before departure. It sails only what carries cargo. Same destination. Same output. A fraction of the fuel.
Every frontier model crossing the Pacific today carries empty containers. ROLV leaves them on the dock.
| Model | Src | Nat sp% | vs cuBLAS | vs cuSPARSE | Energy% | Tokens/s | PASS |
|---|---|---|---|---|---|---|---|
| Mixtral-8×7B | REAL | 75.0% | 1.86× | 109× | 46% | 2,185,075 | ✓ |
| Mixtral-8×22B | synth | 75.0% | 2.43× | 107× | 59% | 1,073,568 | ✓ |
| Qwen2-57B-A14B | synth | 87.5% | 3.37× | 70× | 70% | 2,374,040 | ✓ |
| Qwen3-30B-A3B | REAL | 93.8% | 3.43× | 32× | 71% | 6,650,774 | ✓ |
| Llama-4-Scout ★ | REAL | 93.8% | 4.75× | 103× | 79% | 5,795,875 | ✓ |
| DeepSeek-V3/R1 | synth | 96.9% | 8.76× | 110× | 89% | 1,758,046 | ✓ |
NVIDIA B200 · BF16 · TF32 ON · 1,000 iters · ATOL=0.05 col-norm fp64 · 4 SHA-256 hashes + perturbation PASS · †exact production dims
| Model / Layer | GPU | Sparsity | vs cuBLAS | vs vendor sparse | PASS |
|---|---|---|---|---|---|
| LLaMA-3.1-8B up_proj [REAL] | H200 | 80% | 2.17× | 9.53× | ✓ |
| LLaMA-3.1-8B up_proj [REAL] | H200 | 90% | 2.79× | 8.66× | ✓ |
| DeepSeek-R1 embed [REAL] | B200 | 95% | 19.42× | 19.42× | ✓ |
| 10k×10k synthetic | B200 | 70% | 3.11× | 12.06× | ✓ |
| 10k×10k synthetic | MI300X | 85% | 8.5× | 83.77× | ✓ |
| Tesla T4 synthetic | T4 | 90% | 5.8× | 14.2× | ✓ |
1,684/1,684 total PASS across all GPU benchmarks · BF16 · TF32 ON · ATOL=0.05 · AMD MI300X: rocBLAS 8.5× (rocSPARSE has known regression at high sparsity)
| Model / Layer | CPU | Sparsity | vs MKL (iter) | vs MKL (total+build) | Energy↓ | PASS | Pert |
|---|---|---|---|---|---|---|---|
| Mistral-7B q_proj [REAL] | Intel i7 | 95% | 21.45× | 18.58× | 95% | ✓ | ✓ |
| Mistral-7B up_proj [REAL] | Intel i7 | 95% | 17.98× | 15.73× | 94% | ✓ | ✓ |
| Mistral-7B down_proj [REAL] | Intel i7 | 95% | 18.86× | 16.32× | 95% | ✓ | ✓ |
| Mistral-7B v_proj [REAL] | Intel i7 | 95% | 20.12× | 18.32× | 95% | ✓ | ✓ |
| Mistral-7B gate_proj [REAL] | Intel i7 | 95% | 15.70× | 13.90× | 94% | ✓ | ✓ |
| Mistral-7B k_proj [REAL] | Intel i7 | 95% | 17.02× | 15.57× | 94% | ✓ | ✓ |
| Mistral-7B o_proj [REAL] | Intel i7 | 95% | 14.24× | 12.59× | 93% | ✓ | ✓ |
| Mistral-7B avg · 7 layer types · 70–95% sparsity · 28/28 PASS | |||||||
| Mistral-7B avg all layers [REAL] | Intel i7 | 70–95% | 8.49× | — | 83% | 28/28 | 28/28 |
| Qwen3-8B — peak results at 95% sparsity | |||||||
| Qwen3-8B down_proj [REAL] ★ | Intel i7 | 95% | 20.86× | 17.88× | 95% | ✓ | ✓ |
| Qwen3-8B q_proj [REAL] | Intel i7 | 95% | 19.38× | 16.61× | 95% | ✓ | ✓ |
| Qwen3-8B gate_proj [REAL] | Intel i7 | 95% | 18.05× | 15.14× | 95% | ✓ | ✓ |
| Qwen3-8B avg · 7 layer types · 70–95% sparsity · 28/28 PASS | |||||||
| Qwen3-8B avg all layers [REAL] | Intel i7 | 70–95% | 8.59× | — | 84% | 28/28 | 28/28 |
| Combined: 56/56 PASS · two model families · same Intel i7 laptop | |||||||
| AMD EPYC 7B13 synthetic | EPYC | 90% | 8.5× | — | 89% | ✓ | ✓ |
Intel i7 laptop (4 cores, 68GB RAM) · Mistral-7B + Qwen3-8B real HuggingFace weights · MKL baseline · Speedup includes ROLV build time · 56/56 PASS · 56/56 perturbation PASS · 1,000 iters · ATOL=0.05
| Hardware | Matrix | Sparsity | cuSPARSE ms | ROLV ms | ROLV wins | PASS |
|---|---|---|---|---|---|---|
| NVIDIA H200 | LLaMA up_proj | 80% | 5.90 | 0.619 | 9.53× | ✓ |
| NVIDIA H200 | LLaMA up_proj | 90% | 3.01 | 0.348 | 8.66× | ✓ |
| NVIDIA B200 | Mixtral-8×7B MoE | 75% | 25.65 | 0.234 | 109× | ✓ |
| NVIDIA B200 | Llama-4-Scout MoE | 94% | 9.14 | 0.088 | 103× | ✓ |
| NVIDIA B200 | 10k×10k synthetic | 70% | 4.31 | 0.36 | 12.06× | ✓ |
| AMD MI300X | 10k×10k synthetic | 85% | 74.27 | 0.89 | 83.77× | ✓ |
| Intel i7 CPU | Mistral-7B q_proj | 95% | 66.4 | 3.18 | 14.01× | ✓ |
cuSPARSE is NVIDIA’s own sparse library — tuned by hundreds of engineers. ROLV beats it everywhere because dense matmul on a small submatrix outperforms CSR index lookups for LLM weight patterns. AMD MI300X uses rocSPARSE which has a known performance regression at high sparsity — rocBLAS 8.5× comparison also published.
Three tools to quantify ROLV’s impact on your infrastructure.
The ROLV Unit™ is a normalised measure of compute efficiency that accounts for sparsity. Unlike TFLOPS (which measures peak theoretical throughput) or tokens/s (which conflates hardware and software), the ROLV Unit measures useful compute — work done on non-zero elements only.
1 ROLV Unit = 1 TFLOP of compute on live (non-zero) matrix elements per second, at full precision, verified by SHA-256 hash.
ROLVswitch™ finds the exact sparsity where ROLV beats dense, and whether your matrix fits in VRAM. Enter your matrix dimensions and hardware to get the switch point and memory analysis.
RSMT™ finds the exact sparsity threshold where sparse storage beats dense for your dtype. Below the threshold, dense storage wins on memory. Above it, sparse wins — and ROLV wins on compute too.
The crossover point depends entirely on your dtype. With bfloat16 (2 bytes) and int32 indices (4 bytes), sparse format costs 3× more bytes per non-zero than dense. Sparse wins only when you have enough zeros to overcome the index overhead.
RSMT™ is computed analytically — no approximation. The crossover is mathematically exact for any dtype combination.
4 SHA-256 hashes per case. Perturbation test on every result. ATOL=0.05 on column-normalised fp64. 1,684/1,684 GPU PASS · 56/56 CPU PASS. Download the full validation kit with harness code, raw outputs, and reproduction instructions.
Born in Norway. Built companies across Europe and the United States. In May 2025, during a bike ride in Fort Lauderdale, he asked whether AI matrix operations could be made dramatically faster — and refused to stop until they were. Six months later, ROLV Primitive© was independently validated by the University of Miami. Three patents pending.
“Imagination is the only limitation to innovation.”