NVIDIA · AMD · Intel · ARM · Apple · Google TPU · Custom ASICs · FPGAs · Photonic · Quantum · Any hardware that does matrix multiply.
The left panel runs standard matrix multiply in your browser — your actual hardware. The right panel runs ROLV on our server with identical inputs. Both signed and explained.
Picture a container ship crossing the Pacific. It carries 20,000 containers. The manifest says 5,000 of them are empty — have always been empty, will be empty on arrival. But the ship cannot leave them behind. Its loading system was built decades ago and it can only operate one way: load everything, sail everything, unload everything.
It burns fuel proportional to its total cargo — including the 5,000 empty containers. The crew works proportional to total cargo. The port fees are proportional to total cargo. Every crossing. Every time.
This is what cuBLAS does with MoE inference. The empty containers are the inactive experts — architecturally zero, guaranteed by the router, known before the computation starts. cuBLAS has no mechanism to leave them on the dock. It computes all of them, every token, every layer, every inference call.
ROLV Primitive© is the loading system that reads the manifest first. It identifies the empty containers before departure. It sails only what carries cargo. Same destination. Same output. A fraction of the fuel.
Every frontier model crossing the Pacific today carries empty containers. ROLV leaves them on the dock.
| Model | Src | Nat sp% | vs cuBLAS | vs cuSPARSE | Energy% | Tokens/s | PASS |
|---|---|---|---|---|---|---|---|
| Mixtral-8×7B | REAL | 75.0% | 1.86× | 109× | 46% | 2,185,075 | ✓ |
| Mixtral-8×22B | synth | 75.0% | 2.43× | 107× | 59% | 1,073,568 | ✓ |
| Qwen2-57B-A14B | synth | 87.5% | 3.37× | 70× | 70% | 2,374,040 | ✓ |
| Qwen3-30B-A3B | REAL | 93.8% | 3.43× | 32× | 71% | 6,650,774 | ✓ |
| Llama-4-Scout ★ | REAL | 93.8% | 4.75× | 103× | 79% | 5,795,875 | ✓ |
| DeepSeek-V3/R1 | synth | 96.9% | 8.76× | 110× | 89% | 1,758,046 | ✓ |
NVIDIA B200 · BF16 · TF32 ON · 1,000 iters · ATOL=0.05 col-norm fp64 · 4 SHA-256 hashes + perturbation PASS
| Model / Layer | GPU | Sparsity | vs cuBLAS | vs vendor sparse | PASS |
|---|---|---|---|---|---|
| LLaMA-3.1-8B up_proj [REAL] | H200 | 80% | 2.17× | 9.53× | ✓ |
| LLaMA-3.1-8B up_proj [REAL] | H200 | 90% | 2.79× | 8.66× | ✓ |
| DeepSeek-R1 embed [REAL] | B200 | 95% | 19.42× | 19.42× | ✓ |
| 10k×10k synthetic | B200 | 70% | 3.11× | 12.06× | ✓ |
| 10k×10k synthetic | MI300X | 85% | 8.5× | 83.77× | ✓ |
| Tesla T4 synthetic | T4 | 90% | 5.8× | 14.2× | ✓ |
1,684/1,684 total PASS · BF16 · TF32 ON · ATOL=0.05 · AMD MI300X: rocBLAS 8.5× (rocSPARSE has known regression at high sparsity)
| Model / Layer | CPU | Sparsity | vs MKL (iter) | vs MKL (total+build) | Energy↓ | PASS |
|---|---|---|---|---|---|---|
| Mistral-7B q_proj [REAL] | Intel i7 | 95% | 21.45× | 18.58× | 95% | ✓ |
| Qwen3-8B down_proj [REAL] ★ | Intel i7 | 95% | 20.86× | 17.88× | 95% | ✓ |
| Gemma4-E4B up_proj [REAL] ★ | Intel i7 | 95% | 19.56× | 17.29× | 95% | ✓ |
| Llama-3.1-8B q_proj [REAL] ★ | Intel i7 | 95% | 24.44× | 22.20× | 96% | ✓ |
| Qwen2.5-7B gate_proj [REAL] ★ | Intel i7 | 95% | 59.70× | — | 98% | ✓ |
| SmolLM2-1.7B · Qwen2.5-1.5B · Llama-3.2-1B on Colab Xeon · 125/125 PASS at 70–99% induced sparsity | ||||||
| SmolLM2-1.7B gate_proj [REAL] | Xeon Colab | 95% | 27.26× | — | 96% | ✓ |
| Llama-3.2-1B down_proj [REAL] ★ PEAK | Intel i7 | 99% | 106.65× | 9.07× | 99% | ✓ |
| TOTAL CPU: 9 models · 332/332 PASS · Avg 7.37× · Peak 106.65× | ||||||
Intel i7 laptop (4 cores, 68GB RAM) · Mistral-7B + Qwen3-8B + Gemma4-E4B + Phi-4 + DeepSeek-R1-7B + Qwen2.5-7B + Llama-3.2-3B + Llama-3.1-8B + Gemma-2-2B real HuggingFace weights · MKL baseline · Speedup includes ROLV build time · 252/252 PASS (i7) + 125/125 PASS (Colab Xeon wheel, 5-level) = 377/377 total · 377/377 perturbation PASS · 1,000 iters · ATOL=0.05
| Hardware | Matrix | Sparsity | cuSPARSE ms | ROLV ms | ROLV wins | PASS |
|---|---|---|---|---|---|---|
| NVIDIA H200 | LLaMA up_proj | 80% | 5.90 | 0.619 | 9.53× | ✓ |
| NVIDIA H200 | LLaMA up_proj | 90% | 3.01 | 0.348 | 8.66× | ✓ |
| NVIDIA B200 | Mixtral-8×7B MoE | 75% | 25.65 | 0.234 | 109× | ✓ |
| NVIDIA B200 | Llama-4-Scout MoE | 94% | 9.14 | 0.088 | 103× | ✓ |
| NVIDIA B200 | 10k×10k synthetic | 70% | 4.31 | 0.36 | 12.06× | ✓ |
| AMD MI300X | 10k×10k synthetic | 85% | 74.27 | 0.89 | 83.77× | ✓ |
| Intel i7 CPU | Mistral-7B q_proj | 95% | 66.4 | 3.18 | 14.01× | ✓ |
cuSPARSE is NVIDIA’s own sparse library — tuned by hundreds of engineers. ROLV beats it everywhere because dense matmul on a small submatrix outperforms CSR index lookups for LLM weight patterns. AMD MI300X uses rocSPARSE which has a known performance regression at high sparsity — rocBLAS 8.5× comparison also published.
Quantify ROLV's impact on your infrastructure. The two primary calculators below cover capital and operating expense; below them, three advanced tools for deeper analysis.
The ROLV Unit™ is a normalised measure of compute efficiency that accounts for sparsity. Unlike TFLOPS (which measures peak theoretical throughput) or tokens/s (which conflates hardware and software), the ROLV Unit measures useful compute — work done on non-zero elements only.
1 ROLV Unit = 1 TFLOP of compute on live (non-zero) matrix elements per second, at full precision, verified by SHA-256 hash.
ROLVswitch™ finds the exact sparsity where ROLV beats dense, and whether your matrix fits in VRAM.
RSMT™ finds the exact sparsity threshold where sparse storage beats dense for your dtype.
The crossover point depends entirely on your dtype. With bfloat16 (2 bytes) and int32 indices (4 bytes), sparse format costs 3× more bytes per non-zero than dense. Sparse wins only when you have enough zeros to overcome the index overhead.
RSMT™ is computed analytically — no approximation.
Two deployment tiers for serious evaluation on your own models, your own data, your own processors. If you just want to see ROLV working end-to-end first, the live benchmark above runs in under two minutes with no install. All enterprise runs are RolvKey™-signed — SHA-256 over your speedup, processor fingerprint, and a time-bounded attestation.
Evaluation licence + NDA. Container binds to your processor fingerprint at first run — will not execute on any other machine. Optional Intel SGX hardware encryption for regulated environments.
Contact rolv@rolv.ai →Bare-metal servers and air-gapped environments where Docker is not permitted. Processor-bound binary with live heartbeat attestation. Evaluation licence + NDA required.
Contact rolv@rolv.ai →In building the secure distribution system for ROLV Primitive© we developed a novel software protection architecture that we believe has standalone commercial value entirely apart from ROLV itself.
RolvKey™ uses a proprietary multi-layer mathematical key derivation system. Every key exchange is unique and time-bounded to a window of seconds. A captured response is worthless moments later. An attacker who somehow breaks the first layer immediately faces a second independent layer, then a third — each seeded with a completely different secret.
The only viable attack requires simultaneously compromising multiple independent systems within a narrow time window. For any commercial adversary this is not a realistic threat model.
Every software company shipping proprietary compiled code faces the same distribution security problem. Current solutions — hardware dongles, standard license servers, code obfuscation — have well-documented weaknesses. The academic literature identified this specific application — software distribution key management and API attestation — as commercially unsolved. RolvKey™ addresses it.
RolvKey™ is protecting ROLV Primitive© today. Every Docker container download, every key exchange, every benchmark run on every machine worldwide is secured by this system. It has been exercised thousands of times in production.
Licensing and partnership enquiries: rolv@rolv.ai
4 SHA-256 hashes per case. Perturbation test on every result. ATOL=0.05 on column-normalised fp64. 1,684/1,684 GPU PASS · 332/332 CPU PASS. Download the full validation kit with harness code, raw outputs, and reproduction instructions.