rolvsparse© beats cuBLAS and cuSPARSE across the full 80–99% sparsity range on real LLaMA-3.1-8B weight matrices. Every result is correctness-verified to max error 3.9×10⁻⁶ — 250× tighter than the standard ATOL=0.001 threshold. Software-only. No hardware changes.
rolvsparse© is a software operator that restructures matrix arithmetic to skip zero-valued multiply-accumulate operations. At high sparsity levels — where 90% or more of a weight matrix is zero — this approach delivers substantial reductions in compute time and energy consumption.
Works best when matrices are genuinely sparse. At 90%+ sparsity, the operator skips the vast majority of multiply-accumulate operations — the work simply does not happen.
No hardware modifications. No new chips. No changes to model weights or architecture. Runs on existing CPU and GPU infrastructure.
Fewer operations means less energy. At 90%+ sparsity, energy savings scale proportionally with the work eliminated — a direct consequence of doing less arithmetic.
The operator is built once from a weight matrix and then used repeatedly for inference. Build time is amortised across thousands of inference calls.
On correctness: The CRCS operator is exact on its compressed submatrix — no approximation is introduced by the operator itself. The only source of output error is pruning, which zeroes low-magnitude rows before the operator is built. On live rows, measured max error is 3.9×10⁻⁶ — 250× tighter than the standard ATOL=0.001 threshold. All published results include correctness metrics and four SHA-256 hashes for independent verification.
All benchmarks use real weight matrices from Meta LLaMA-3.1-8B, downloaded from HuggingFace. The primary comparison is always the best available vendor operator — cuBLAS or cuSPARSE, whichever is faster at that sparsity level. Four public SHA-256 hashes per case. Correctness verified on live rows to max error 3.9×10⁻⁶ — 250× tighter than ATOL=0.001.
On NVIDIA H200, cuSPARSE is slower than cuBLAS at moderate sparsity — a known kernel-selection limitation on this hardware. ROLV beats whichever is faster. Note that ROLV also saves VRAM proportionally to sparsity: at 90% sparse, the compressed representation is 10× smaller than the dense matrix, enabling larger models or larger batches on the same GPU.
| Sparsity | Best vendor | Vendor ms | ROLV ms | Speedup | Energy saved | VRAM vs dense | Max err | Pass |
|---|---|---|---|---|---|---|---|---|
| 80% | cuSPARSE | 5.8984 | 0.6190 | 9.53× | +89.5% | 5× less | 3.9×10⁻⁶ | ✓ |
| 90% | cuBLAS | 2.4821 | 0.3475 | 7.14× | +88.4% | 10× less | 3.9×10⁻⁶ | ✓ |
| 95% | cuSPARSE | 1.5547 | 0.2265 | 6.86× | +85.4% | 20× less | 3.8×10⁻⁶ | ✓ |
| 99% | cuSPARSE | 0.4415 | 0.1720 | 2.57× | +61.0% | 100× less | 3.3×10⁻⁶ | ✓ |
NVIDIA H200 · Meta LLaMA-3.1-8B · layers[0].mlp.up_proj (14336×4096) · Batch=1024 · 100 iters · CUDA Event timing · CRCS strategy · best-of(cuBLAS, cuSPARSE) baseline · correctness on live rows · perturbation test passed on all 4 runs
ROLV stores only the live rows and columns of the pruned matrix. VRAM usage scales directly with sparsity — at 90% sparse, the compressed representation is 10× smaller. This enables running larger models or larger batch sizes on the same hardware, multiplying the effective throughput-per-GPU-dollar.
Row-pruned and block-sparse patterns compared against CPU-CSR (MKL-sparse). Batch=500, 1000 iterations, ATOL=0.001. Max error 2.5×10⁻⁷.
CPU · row-pruned & block-sparse · 2000×2000 · Batch=500 · 1000 iters · vs CPU-CSR (MKL-sparse) · ATOL=0.001 · 4 SHA-256 hashes per case
Methodology: Primary comparison is always the best available vendor operator — whichever of cuBLAS or cuSPARSE is faster at that sparsity level on that hardware. GPU timing uses CUDA Events (warmup=10, iters=100). Correctness is measured on live rows only in raw fp32 — pruned rows are intentionally zero by design. Four SHA-256 hashes per run. Perturbation test: modifying one live weight changes the output hash, confirming real computation. Full JSON with all hashes and per-iteration timings available on request.
rolvsparse© is not a general-purpose dense operator. It is a specialist tool for workloads where sparsity is high and the operator is applied repeatedly — conditions common in production AI inference.
MoE architectures activate a small fraction of experts per token — often fewer than 5%. The inactive expert weight matrices are naturally 95%+ sparse at inference time. rolvsparse© is designed for exactly this structure.
Post-training pruning at 90%+ sparsity creates weight matrices with the right structure for rolvsparse©. The operator complements magnitude pruning, structured pruning, and similar compression workflows.
CSR format begins to compete with dense MKL/cuBLAS around 75–80% sparsity. Below this threshold, dense operators typically win. rolvsparse© is honest about this boundary.
Below 70% sparsity, dense cuBLAS and MKL consistently outperform sparse operators on modern hardware. rolvsparse© does not claim otherwise and does not benchmark in this regime.
rolvsparse© is covered by three US patent applications currently pending. The filings cover the core operator methodology and its application to AI inference workloads.
Core operator methodology for sparse matrix acceleration in AI inference pipelines.
Energy reduction techniques through elimination of zero-valued arithmetic operations.
Adaptive operator selection and parameter tuning for varying sparsity conditions.
All applications filed in the United States. Patent-pending status. Details available to qualified parties under NDA.
For technical enquiries, access to benchmark data, or discussions about the technology, please reach out directly.
Rolv E. Heggenhougen
rolv LLC · rolv.ai
@rolveitrem