Patent-Pending · Software-Only · No Hardware Changes · No Model Retraining

Up to 89% energy savings.
Up to 9.5× faster than NVIDIA's own GPU libraries.

rolvsparse© beats cuBLAS and cuSPARSE across the full 80–99% sparsity range on real LLaMA-3.1-8B weight matrices. Every result is correctness-verified to max error 3.9×10⁻⁶ — 250× tighter than the standard ATOL=0.001 threshold. Software-only. No hardware changes.

All tolerance checks PASS  ·  max error 3.9×10⁻⁶ across all runs  ·  perturbation test verified on every case
9.5×
Faster than cuSPARSE
LLaMA-3.1-8B · 80% sparsity · NVIDIA H200 · 4/4 PASS
89%
Energy saved per call
vs cuSPARSE · 80% sparsity · time-ratio proxy · H200
3
Patents Filed
Patent-pending · software-only · no hardware changes
Get in Touch →
01 — What It Is

A compute primitive for sparse AI workloads.

rolvsparse© is a software operator that restructures matrix arithmetic to skip zero-valued multiply-accumulate operations. At high sparsity levels — where 90% or more of a weight matrix is zero — this approach delivers substantial reductions in compute time and energy consumption.

Sparse by design

Works best when matrices are genuinely sparse. At 90%+ sparsity, the operator skips the vast majority of multiply-accumulate operations — the work simply does not happen.

Software-only

No hardware modifications. No new chips. No changes to model weights or architecture. Runs on existing CPU and GPU infrastructure.

Energy follows compute

Fewer operations means less energy. At 90%+ sparsity, energy savings scale proportionally with the work eliminated — a direct consequence of doing less arithmetic.

02 — How It Works

Four steps from dense weight to sparse operator.

The operator is built once from a weight matrix and then used repeatedly for inference. Build time is amortised across thousands of inference calls.

01
Score blocks
Each block of the weight matrix receives an importance score based on its contribution to the output. Low-scoring blocks are candidates for elimination.
02
Prune
Blocks below the sparsity threshold are zeroed out. At a 90% target, 90% of blocks are eliminated. The remaining blocks preserve the most important signal.
03
Quantize
Surviving blocks are quantized to INT8 using per-block scaling. This reduces memory bandwidth and accelerates arithmetic on hardware that benefits from lower precision.
04
Store sparse
The resulting matrix is stored in CSR (Compressed Sparse Row) format. Inference skips all zero blocks entirely — no multiply, no memory access, no energy.

On correctness: The CRCS operator is exact on its compressed submatrix — no approximation is introduced by the operator itself. The only source of output error is pruning, which zeroes low-magnitude rows before the operator is built. On live rows, measured max error is 3.9×10⁻⁶ — 250× tighter than the standard ATOL=0.001 threshold. All published results include correctness metrics and four SHA-256 hashes for independent verification.

03 — Benchmarks

Verified results across the 80–99% sparsity range.

All benchmarks use real weight matrices from Meta LLaMA-3.1-8B, downloaded from HuggingFace. The primary comparison is always the best available vendor operator — cuBLAS or cuSPARSE, whichever is faster at that sparsity level. Four public SHA-256 hashes per case. Correctness verified on live rows to max error 3.9×10⁻⁶ — 250× tighter than ATOL=0.001.

Correctness — all runs
4/4 PASS  ·  max err 3.9×10⁻⁶
250× tighter than the standard ATOL=0.001 threshold. Verified on live rows in raw fp32.
Perturbation test — all runs
4/4 PASS
Modifying one live weight value changes the output hash — proving computation uses real weights, not a precomputed result.
Hash verification — per run
4 SHA-256 hashes
Input matrix · input vector · dense baseline · ROLV output. Published on this page. Anyone can reproduce.
GPU — NVIDIA H200 · Meta LLaMA-3.1-8B · up_proj (14336×4096) · 80–99% Sparsity

ROLV beats both cuBLAS and cuSPARSE
at every sparsity level from 80% to 99%.

On NVIDIA H200, cuSPARSE is slower than cuBLAS at moderate sparsity — a known kernel-selection limitation on this hardware. ROLV beats whichever is faster. Note that ROLV also saves VRAM proportionally to sparsity: at 90% sparse, the compressed representation is 10× smaller than the dense matrix, enabling larger models or larger batches on the same GPU.

9.53×
vs cuSPARSE · 80%
0.62 ms vs 5.90 ms
89%
Energy saved · 80%
vs cuSPARSE · time-ratio proxy
10×
Less VRAM · 90%
compressed vs dense storage
3.9×10⁻⁶
Max error · all runs
250× tighter than ATOL=0.001
Sparsity Best vendor Vendor ms ROLV ms Speedup Energy saved VRAM vs dense Max err Pass
80% cuSPARSE 5.8984 0.6190 9.53× +89.5% 5× less 3.9×10⁻⁶
90% cuBLAS 2.4821 0.3475 7.14× +88.4% 10× less 3.9×10⁻⁶
95% cuSPARSE 1.5547 0.2265 6.86× +85.4% 20× less 3.8×10⁻⁶
99% cuSPARSE 0.4415 0.1720 2.57× +61.0% 100× less 3.3×10⁻⁶
SHA-256 hashes — LLaMA-3.1-8B up_proj · NVIDIA H200 · fixed across all sparsity runs
A (input matrix)9b7d16f518ac5406a11bf6cb3ba2cb3204da3fb35614bef53e163fbe215bcfb1
V (input vector)32d38b5291bb7e2fdfb5df26616d3da6f7209f45e0f53d0ad89388a8811adf7e

NVIDIA H200 · Meta LLaMA-3.1-8B · layers[0].mlp.up_proj (14336×4096) · Batch=1024 · 100 iters · CUDA Event timing · CRCS strategy · best-of(cuBLAS, cuSPARSE) baseline · correctness on live rows · perturbation test passed on all 4 runs

VRAM savings — compressed storage vs dense

Less VRAM means larger models on the same GPU.

ROLV stores only the live rows and columns of the pruned matrix. VRAM usage scales directly with sparsity — at 90% sparse, the compressed representation is 10× smaller. This enables running larger models or larger batch sizes on the same hardware, multiplying the effective throughput-per-GPU-dollar.

Less VRAM at 80%
80% less storage
10×
Less VRAM at 90%
90% less storage
20×
Less VRAM at 95%
95% less storage
100×
Less VRAM at 99%
99% less storage
CPU · Row-pruned pattern · 80–95% Sparsity · 2000×2000 synthetic

Up to 4.70× faster than CPU-CSR · 73% energy saved · 21/21 PASS

Row-pruned and block-sparse patterns compared against CPU-CSR (MKL-sparse). Batch=500, 1000 iterations, ATOL=0.001. Max error 2.5×10⁻⁷.

CPU · row-pruned & block-sparse · 2000×2000 · Batch=500 · 1000 iters · vs CPU-CSR (MKL-sparse) · ATOL=0.001 · 4 SHA-256 hashes per case

Methodology: Primary comparison is always the best available vendor operator — whichever of cuBLAS or cuSPARSE is faster at that sparsity level on that hardware. GPU timing uses CUDA Events (warmup=10, iters=100). Correctness is measured on live rows only in raw fp32 — pruned rows are intentionally zero by design. Four SHA-256 hashes per run. Perturbation test: modifying one live weight changes the output hash, confirming real computation. Full JSON with all hashes and per-iteration timings available on request.

04 — Where It Works

Designed for high-sparsity inference workloads.

rolvsparse© is not a general-purpose dense operator. It is a specialist tool for workloads where sparsity is high and the operator is applied repeatedly — conditions common in production AI inference.

Best fit

Mixture-of-experts inference

MoE architectures activate a small fraction of experts per token — often fewer than 5%. The inactive expert weight matrices are naturally 95%+ sparse at inference time. rolvsparse© is designed for exactly this structure.

Best fit

Aggressively pruned models

Post-training pruning at 90%+ sparsity creates weight matrices with the right structure for rolvsparse©. The operator complements magnitude pruning, structured pruning, and similar compression workflows.

Marginal fit

Moderate sparsity (50–80%)

CSR format begins to compete with dense MKL/cuBLAS around 75–80% sparsity. Below this threshold, dense operators typically win. rolvsparse© is honest about this boundary.

Not the right tool

Dense or low-sparsity matrices

Below 70% sparsity, dense cuBLAS and MKL consistently outperform sparse operators on modern hardware. rolvsparse© does not claim otherwise and does not benchmark in this regime.

05 — Intellectual Property

Three patent applications filed.

rolvsparse© is covered by three US patent applications currently pending. The filings cover the core operator methodology and its application to AI inference workloads.

Application 8005

System and Method for Enhancing Computational Efficiency

Core operator methodology for sparse matrix acceleration in AI inference pipelines.

Application 8006

Systems and Methods for Energy-Efficient Inference

Energy reduction techniques through elimination of zero-valued arithmetic operations.

Application 8011

System and Method for Optimizing Computational Efficiency Using Metric Comparison

Adaptive operator selection and parameter tuning for varying sparsity conditions.

All applications filed in the United States. Patent-pending status. Details available to qualified parties under NDA.

05 — Contact

Get in touch.

For technical enquiries, access to benchmark data, or discussions about the technology, please reach out directly.

[email protected]

Rolv E. Heggenhougen
rolv LLC · rolv.ai
@rolveitrem