Tested on NVIDIA H200, B200, Tesla T4, Intel CPU, and AMD EPYC across 22 sparsity levels, real LLaMA-3.1-8B weights, and real production weight matrices. ROLV Primitive© beats cuBLAS from just 5% sparsity on GPU, and beats MKL from 0% on CPU. Energy measured directly via pynvml.
ROLV Primitive© is a software operator that restructures matrix arithmetic to skip zero-valued multiply-accumulate operations. At high sparsity levels — where 90% or more of a weight matrix is zero — this approach delivers substantial reductions in compute time and energy consumption.
Works best when matrices are genuinely sparse. At 90%+ sparsity, ROLV Primitive© skips the vast majority of multiply-accumulate operations — the work simply does not happen.
No hardware modifications. No new chips. No changes to model weights or architecture. Runs on existing CPU and GPU infrastructure.
Fewer operations means less energy. At 90%+ sparsity, energy savings scale proportionally with the work eliminated — a direct consequence of doing less arithmetic.
Tested on NVIDIA H200, NVIDIA B200, Intel CPU, and AMD EPYC 7B13 across 22 sparsity levels each, real LLaMA-3.1-8B weights from HuggingFace, and 4 real production LLM weight matrix pairs on Tesla T4. Each result includes 4 SHA-256 hashes and a perturbation test. Energy measured directly via pynvml on GPU; proxy on CPU.
MLP up_proj layer (14336×4096) from Meta LLaMA-3.1-8B downloaded directly from HuggingFace. Magnitude row pruning at four sparsity levels. Max error 3.9×10⁻⁶ — 250× tighter than ATOL=0.001. All four perturbation tests pass.
| Sparsity | Live rows | Compr. | Best vendor ms | ROLV ms | vs vendor | vs cuBLAS | Energy† | Pass |
|---|---|---|---|---|---|---|---|---|
| 80% | 2,867 | 5× | 5.8984 cuSPARSE | 0.6190 | 9.53× | 4.01× | +89.5% | ✓ |
| 90% | 1,434 | 10× | 3.0077 cuSPARSE | 0.3475 | 8.66× | 7.14× | +88.4% | ✓ |
| 95% ★ | 717 | 20× | 1.5547 cuSPARSE | 0.2265 | 6.86× | 10.96× | +85.4% | ✓ |
| 99% | 143 | 100× | 0.4415 cuSPARSE | 0.1720 | 2.57× | 14.43× | +61.0% | ✓ |
★ = best ratio vs dense. † = time-ratio proxy (pynvml unavailable in this run — clearly labelled). H200 · LLaMA-3.1-8B layers[0].mlp.up_proj (14336×4096) · Batch=1024 · 100 iters · CUDA Events · CRCS · 4/4 perturbation PASS
Tested on LLaMA-2-7B (NeuralMagic 50% pre-pruned variants) and LLaMA-3.1-8B across 4 layers and 6 sparsity levels each. Pre-pruned models from NeuralMagic use SparseGPT + knowledge distillation — full accuracy recovery. All 72 test cases PASS. Below 70%: vs cuBLAS. Above 70%: vs cuSPARSE.
| Model | Layer | Sparsity | vs | Speedup | Energy | Pass |
|---|---|---|---|---|---|---|
| LLaMA-2-7B (GSM8K, 50% pre-pruned) | embed_tokens | 70% | cuSPARSE | 15.85× | +100% | ✓ |
| LLaMA-2-7B (GSM8K, 50% pre-pruned) | v_proj | 95% | cuSPARSE | 8.05× | +87.6% | ✓ |
| LLaMA-2-7B (Dolphin, 50% pre-pruned) | embed_tokens | 70% | cuSPARSE | 15.75× | +100% | ✓ |
| LLaMA-3.1-8B | embed_tokens | 95% ★ | cuSPARSE | 18.76× | +100% | ✓ |
| LLaMA-3.1-8B | q_proj | 80% | cuSPARSE | 6.81× | +85.3% | ✓ |
| LLaMA-3.1-8B | k_proj | 70% | cuSPARSE | 3.79× | +73.6% | ✓ |
★ = peak. Italic = cuSPARSE (correct baseline above 70%). Pre-pruned models: NeuralMagic SparseGPT + knowledge distillation, full accuracy recovery. NVIDIA H200 · 72/72 correctness PASS · 72/72 perturbation PASS · 4 SHA-256 hashes per case.
22 sparsity levels. Below 70%: vs cuBLAS. Above 70%: vs cuSPARSE (the correct baseline — what engineers actually use at high sparsity). Hybrid operator auto-calibrates per level. Energy via pynvml.
| Sparsity | vs | Vendor ms | ROLV ms | Speedup | Energy | Pass |
|---|---|---|---|---|---|---|
| 0% ← | cuBLAS | 10.26 | 10.26 | 1.00× | +1.8% | ✓ |
| 5% | cuBLAS | 10.26 | 10.03 | 1.02× | +2.5% | ✓ |
| 50% | cuBLAS | 10.26 | 5.57 | 1.84× | +46.6% | ✓ |
| 70% | cuSPARSE | 45.37 | 3.41 | 13.29× | +93.4% | ✓ |
| 80% ★ | cuSPARSE | 30.33 | 2.22 | 13.64× | +93.5% | ✓ |
| 90% | cuSPARSE | 15.23 | 1.16 | 13.18× | +94.2% | ✓ |
| 97% | cuSPARSE | 4.65 | 0.388 | 12.00× | +100% | ✓ |
| 99.9% | cuSPARSE | 0.381 | 0.114 | 3.33× | +70.0% | ✓ |
★ = peak. ← = crossover at 0% (matches cuBLAS cold). Italic = cuSPARSE (correct baseline above 70%). A hash: b2687223 · V hash: f8b47533 · 20/22 perturbation PASS
ROLV stores only the live rows. LLaMA-3.1-8B up_proj at 99% sparsity: 2.34 MB vs 235 MB dense — 100× reduction. Measured directly from CRCS build.
The operator is built once from a weight matrix and then used repeatedly for inference. Build time is amortised across thousands of inference calls.
The ROLV Primitive© is exact on its compressed submatrix — no approximation is introduced by ROLV Primitive© itself. The only source of output error is pruning, which zeroes low-magnitude rows before ROLV Primitive© is built.
This is expected and standard for compressed inference. The goal is to operate within a defined tolerance budget while maximising speed and energy savings. All published results include correctness metrics alongside speedup figures.
Find the exact sparsity threshold where sparse storage beats dense for your dtype.
Finds the exact sparsity where vendor dense hits VRAM congestion first — your switch point to sparse.
For technical enquiries, access to benchmark data, or discussions about the technology, please reach out directly.