Patent-Pending · Software-Only · No Hardware Changes · No Model Retraining

A sparse matrix operator
built for high-sparsity AI inference.

rolvsparse© accelerates matrix operations on highly sparse weight matrices. It targets the 90%+ sparsity regime — the natural operating point of mixture-of-experts models and aggressively pruned transformers. Software-only. Runs on existing hardware.

90%+
Target Sparsity
Designed for the regime where sparse operators decisively win
Software
Only
No new hardware · no model retraining · drop-in operator
3
Patents Filed
Patent-pending in the United States
Get in Touch →
01 — What It Is

A compute primitive for sparse AI workloads.

rolvsparse© is a software operator that restructures matrix arithmetic to skip zero-valued multiply-accumulate operations. At high sparsity levels — where 90% or more of a weight matrix is zero — this approach delivers substantial reductions in compute time and energy consumption.

Sparse by design

Works best when matrices are genuinely sparse. At 90%+ sparsity, the operator skips the vast majority of multiply-accumulate operations — the work simply does not happen.

Software-only

No hardware modifications. No new chips. No changes to model weights or architecture. Runs on existing CPU and GPU infrastructure.

Energy follows compute

Fewer operations means less energy. At 90%+ sparsity, energy savings scale proportionally with the work eliminated — a direct consequence of doing less arithmetic.

02 — How It Works

Four steps from dense weight to sparse operator.

The operator is built once from a weight matrix and then used repeatedly for inference. Build time is amortised across thousands of inference calls.

01
Score blocks
Each block of the weight matrix receives an importance score based on its contribution to the output. Low-scoring blocks are candidates for elimination.
02
Prune
Blocks below the sparsity threshold are zeroed out. At a 90% target, 90% of blocks are eliminated. The remaining blocks preserve the most important signal.
03
Quantize
Surviving blocks are quantized to INT8 using per-block scaling. This reduces memory bandwidth and accelerates arithmetic on hardware that benefits from lower precision.
04
Store sparse
The resulting matrix is stored in CSR (Compressed Sparse Row) format. Inference skips all zero blocks entirely — no multiply, no memory access, no energy.

On correctness: rolvsparse© is an approximate operator. Pruning removes weight information, which introduces output error proportional to the sparsity level. This is expected and standard for compressed inference — the goal is to operate within a defined tolerance budget (typically normalized output error under 0.10) while maximising speed and energy savings. All published results include correctness metrics alongside speedup figures.

03 — Benchmarks

Verified results at 99% sparsity.

All benchmarks use real open-source model weights downloaded directly from HuggingFace. Compared against dense matrix multiply as baseline. Results include normalized output error to verify correctness. Full JSON with hashes available on request.

GPU — NVIDIA B200 · Mistral-7B · 99% Sparsity · 4 MLP Layers

Up to 8.09× speedup · up to 87.6% energy saved · 12/12 PASS

4 MLP gate projection layers (14336×4096 each). 99% sparsity target. Batch=2048, 1000 iterations. Normalized output error 0.007–0.008 across all layers — well within tolerance.

8.09×
Peak Speedup
layer0 · 99.1% sparse
87.6%
Peak Energy Saved
layer0 · vs dense baseline
6.29×
Avg Speedup
across 4 layers
12/12
Correctness PASS
all sparsity levels
Layer Sparsity Speedup Energy Saved Norm Error Result
layer0.gate_proj 99.1% 8.09× 87.6% 0.0073 PASS
layer1.gate_proj 99.2% 7.73× 87.1% 0.0075 PASS
layer2.gate_proj 99.0% 6.43× 84.5% 0.0077 PASS
layer3.gate_proj 99.0% 2.89× 65.3% 0.0078 PASS

NVIDIA B200 · Mistral-7B-Instruct-v0.3 · Batch=2048 · 1000 iters · torch CSR · vs dense cuBLAS baseline · hash-verified

CPU — Intel i7 Desktop · Mistral-7B · 99% Sparsity · Layer 0

5.19× speedup · 80.7% energy saved · PASS

Same model, same layer, run on a standard desktop CPU. Demonstrates the operator works across hardware. Batch=512, 1000 iterations. Normalized output error 0.0073.

5.19×
Speedup vs Dense
80.7%
Energy Saved
99.1%
Actual Sparsity
PASS
Correctness

Intel i7 desktop CPU · Mistral-7B-Instruct-v0.3 · layer0.gate_proj (14336×4096) · Batch=512 · 1000 iters · scipy CSR · vs dense MKL baseline · hash-verified

Methodology: All results compare ROLV against dense matrix multiply on the same hardware. Correctness is measured using normalized column error (ATOL=0.10). Each run outputs four SHA-256 hashes — input matrix, input vector, dense baseline output, and ROLV output — to verify real computation on real weights. Full benchmark JSON with all hashes available on request.

04 — Where It Works

Designed for high-sparsity inference workloads.

rolvsparse© is not a general-purpose dense operator. It is a specialist tool for workloads where sparsity is high and the operator is applied repeatedly — conditions common in production AI inference.

Best fit

Mixture-of-experts inference

MoE architectures activate a small fraction of experts per token — often fewer than 5%. The inactive expert weight matrices are naturally 95%+ sparse at inference time. rolvsparse© is designed for exactly this structure.

Best fit

Aggressively pruned models

Post-training pruning at 90%+ sparsity creates weight matrices with the right structure for rolvsparse©. The operator complements magnitude pruning, structured pruning, and similar compression workflows.

Marginal fit

Moderate sparsity (50–80%)

CSR format begins to compete with dense MKL/cuBLAS around 75–80% sparsity. Below this threshold, dense operators typically win. rolvsparse© is honest about this boundary.

Not the right tool

Dense or low-sparsity matrices

Below 70% sparsity, dense cuBLAS and MKL consistently outperform sparse operators on modern hardware. rolvsparse© does not claim otherwise and does not benchmark in this regime.

05 — Intellectual Property

Three patent applications filed.

rolvsparse© is covered by three US patent applications currently pending. The filings cover the core operator methodology and its application to AI inference workloads.

Application 8005

System and Method for Enhancing Computational Efficiency

Core operator methodology for sparse matrix acceleration in AI inference pipelines.

Application 8006

Systems and Methods for Energy-Efficient Inference

Energy reduction techniques through elimination of zero-valued arithmetic operations.

Application 8011

System and Method for Optimizing Computational Efficiency Using Metric Comparison

Adaptive operator selection and parameter tuning for varying sparsity conditions.

All applications filed in the United States. Patent-pending status. Details available to qualified parties under NDA.

05 — Contact

Get in touch.

For technical enquiries, access to benchmark data, or discussions about the technology, please reach out directly.

[email protected]

Rolv E. Heggenhougen
rolv LLC · rolv.ai
@rolveitrem