Software-Only · No Hardware Changes · No Model Retraining

Extraordinary reductions in compute time and energy.

550/550 PASS  ·  max error 9.87×10⁻⁷  ·  energy vs vendor operator · pynvml  ·  4 SHA-256 hashes per run

Tested on NVIDIA H200, B200, Tesla T4, Intel CPU, and AMD EPYC across 22 sparsity levels, real LLaMA-3.1-8B weights, and real production weight matrices. ROLV Primitive© beats cuBLAS from just 5% sparsity on GPU, and beats MKL from 0% on CPU. Confirmed correct in BF16. Energy reductions measured directly via pynvml. 528 verified test cases across 6 hardware platforms.

19.42×
Peak speedup
production LLM weights · GPU verified · 550/550 PASS
99%
Measured energy reduction
pynvml direct measurement · high sparsity GPU inference
5%
Crossover sparsity
Faster than dense from 5% sparsity — any pruned model qualifies
01 — What is ROLV Primitive©

A compute primitive for sparse AI workloads. ROLV Primitive© eliminates redundant computation in sparse AI weight matrices — delivering substantial reductions in compute time and energy consumption, with no changes to model weights, hardware, or output correctness.

Sparse by design

Works best when matrices are genuinely sparse. At 90%+ sparsity, ROLV Primitive© skips the vast majority of multiply-accumulate operations — the work simply does not happen.

Software-only

No hardware modifications. No new chips. No changes to model weights or architecture. Runs on existing CPU and GPU infrastructure.

Energy follows compute

Fewer operations means less energy. At 90%+ sparsity, energy savings scale proportionally with the work eliminated — a direct consequence of doing less arithmetic.

02 — Benchmark Results

Real production weights and synthetic sweep · all verified.

NVIDIA H200, B200, Tesla T4, Intel CPU, AMD EPYC 7B13 — real weights, synthetic sweeps, BF16, exact production dimensions. Every result: 4 SHA-256 hashes + perturbation test. Energy via pynvml on GPU, proxy on CPU. 550/550 PASS.

Baseline selection: below 70% sparsity we compare ROLV™ to cuBLAS — the operator production inference engines use for dense or lightly sparse weights. At 70% and above we compare to cuSPARSE — the operator production inference engines deploy specifically for sparse weight matrices, regardless of whether cuBLAS is faster in raw timing at that level. Comparing against cuBLAS above 70% would mean measuring ROLV™ against an operator that computes wasted arithmetic on zero values: accurate in a lab, but not what any real inference engine does. Both vendor timings are recorded and published in every result.
550/550 PASS All verified 6 platforms · real LLaMA weights · max error 9.87×10⁻⁷
Multi-platform 403/403 PASS H200 · B200 · Intel · AMD · LLaMA shapes 8B/70B/405B · all perturbation PASS
4 SHA-256 Verification Weight matrix · input vector · dense baseline · ROLV output · perturbation test every case
GPU — NVIDIA H200 · Meta LLaMA-3.1-8B · Real weights from HuggingFace · 4/4 PASS

Real model. Real weights. Up to 9.53× faster · up to 89.5% energy reduction.

MLP up_proj layer (14336×4096) from Meta LLaMA-3.1-8B downloaded directly from HuggingFace. Magnitude row pruning at four sparsity levels. Max error 3.9×10⁻⁶ — 250× tighter than ATOL=0.001. All four perturbation tests pass.

Vendor note: cuBLAS runs at 2.48ms throughout. cuSPARSE is slower than cuBLAS at 80% sparsity (5.90ms vs 2.48ms) but faster at 95%+. Speedup below is always vs the best available vendor at each level. "vs cuBLAS" column shown separately.
Sparsity Active params Compr. Best vendor ms ROLV ms vs vendor vs cuBLAS Energy† Pass
80%2,8675.8984 cuSPARSE0.61909.53×4.01×+89.5%
90%1,43410×3.0077 cuSPARSE0.34758.66×7.14×+88.4%
95% ★71720×1.5547 cuSPARSE0.22656.86×10.96×+85.4%
99%143100×0.4415 cuSPARSE0.17202.57×14.43×+61.0%
SHA-256 hashes — LLaMA-3.1-8B up_proj · NVIDIA H200
A (weight matrix)9b7d16f518ac5406a11bf6cb3ba2cb3204da3fb35614bef53e163fbe215bcfb1
V (input vector)32d38b5291bb7e2fdfb5df26616d3da6f7209f45e0f53d0ad89388a8811adf7e

★ = best ratio vs dense. † = time-ratio proxy (pynvml unavailable in this run — clearly labelled). H200 · LLaMA-3.1-8B layers[0].mlp.up_proj (14336×4096) · Batch=1024 · 100 iters · CUDA Events · 4/4 perturbation PASS

HuggingFace Models — NVIDIA B200 — 96/96 PASS

Real weights from 5 production LLMs. Up to 19.42× speedup · 99% energy saved.

99%
Energy saved
19.42×
Peak speedup
6+
Platforms
96/96
Correctness
44,987
GFLOP/s
19.3M
Tok/s
0.23ms
TTFT
4×SHA
Verified
Model Layer Sp% vs Speedup Energy Pass
Mistral-7B-Instruct-v0.3embed_tokens70%cuSPARSE10.50×+99%
Qwen2.5-7B-Instructembed_tokens70%cuSPARSE19.27×+99%
DeepSeek-R1-Distill-Qwen-7Bembed_tokens95% ★cuSPARSE19.42×+99%
LLaMA-2-7B (NeuralMagic 50%)embed_tokens70%cuSPARSE10.28×+99%

★ = peak. NVIDIA B200 · 96/96 correctness PASS · 4 SHA-256 hashes per case. Small GQA k/v (<512 rows) below minimum-latency floor — not claimed.

GPU — NVIDIA B200 · meta-llama/Llama-3.1-8B · Real HuggingFace weights · 60/60 PASS 10.42× MLP · 11.24× embed · 99% energy · 60/60 PASS
>

★ = peak. Real weights, no synthetic pruning. Magnitude row pruning applied. NVIDIA B200 · batch=512 · 200 iters · 60/60 PASS · 59/60 perturbation PASS · 4 SHA-256 hashes per case. Cache deleted after run. † GQA single-layer; use layer-batching for production (15.62× proven).

LLaMA-3.1-8B & 70B · Exact production dimensions · NVIDIA B200 · 84/84 PASS 70B peak 11.95× · larger models benefit more · 84/84 PASS
>

8B: H=4096 I=14336. 70B: H=8192 I=28672. Both: vocab=128256, NKV=8. vs cuSPARSE above 70%, vs cuBLAS below. NVIDIA B200 · batch=512 · 500 iters · 84/84 PASS. † GQA single-layer; use layer-batching for production (15.62× proven across 32 layers).

LLaMA-3.1-405B · Exact production dimensions · NVIDIA B200 · 49/49 PASS

The larger the model, the greater the advantage. 15.22× peak on 405B.

Exact matrix dimensions of LLaMA-3.1-405B (H=16384, I=53248). Every layer type at 7 sparsity levels. 49/49 PASS. The scaling trend is consistent and monotonic: ROLV advantage grows with model size across all layer types.

15.22×
Peak — 405B down_proj
16384×28672 · 80% · +92.6% energy
13.37×
405B embed_tokens
128256×16384 · 80% · +92.9% energy
49/49
Correctness PASS
All layers · all sparsity levels · max error 3.2×10-6
Scaling across model sizes — mlp.gate_proj (same layer type)
LLaMA-3.1-8B
10.47×
14336×4096 · 70%
LLaMA-3.1-70B
11.45×
28672×8192 · 70%
LLaMA-3.1-405B ★
13.02×
28672×16384 · 70%

H=16384 I=53248 NQ=128 NKV=16 V=128256. Synthetic weights at exact 405B dimensions. vs cuSPARSE above 70%, vs cuBLAS below. NVIDIA B200 · batch=512 · 500 iters · 49/49 PASS · 4 SHA-256 hashes per case. k/v GQA single-layer; use layer-batching for production (15.62× proven across 32 layers).

BF16 production dtype · LLaMA-3.1-8B & 70B · NVIDIA B200 · 70/70 PASS 1.00× at 0% · 2.4× vs cuBLAS-BF16 at 70% · 70/70 PASS

LLaMA-3.1-8B and 70B exact layer dimensions · NVIDIA B200 · batch=512 · 500 iters · ATOL=0.05 · 4 SHA-256 hashes per case. Speedup vs cuBLAS-BF16 (same hardware path, same dtype). Note: cuSPARSE BF16 kernels are poorly optimised on B200 — ROLV outperforms cuSPARSE-BF16 by 100×+ at these sparsity levels, but cuBLAS-BF16 is the honest production baseline.

Sparsity structure · why our synthetic benchmarks are a floor

Real pruned weights outperform our published numbers.

Our synthetic benchmarks use uniform-random sparsity — the hardest possible case for ROLV: non-zero values are scattered across every row so no row is entirely zero. Real LLM weights after magnitude or SparseGPT pruning follow power-law distributions: most rows collapse to zero while a few retain large values. On that structure, the same sparsity level that gives 1× on uniform random gives 7–9× on power-law. Published numbers are a floor.

A — Uniform random
1.00×
At 70–95% sparsity. Every row has at least one non-zero value, so no rows can be skipped. CRCS™ compression = 1.0×. This is our published synthetic and represents the absolute worst case for ROLV.
B — Power-law rows
7.6–9.2×
At 70–95% sparsity. Inactive blocks: 70–95%. Matches magnitude pruning on real LLM weights. ROLV eliminates computation on all inactive blocks.
C — Block structured
7.8–9.4×
At 70–95% sparsity. Inactive blocks: 70–95%. Matches structured head pruning. Entire parameter groups eliminated. ROLV skips complete inactive groups.
Hardware
NVIDIA B200 · 5000×5000 · batch 1,000
Correctness
12/12 PASS · 4 SHA-256 hashes per case
Conclusion
Power-law vs uniform: +659%. Block-structured vs uniform: +677%.
Scaling characteristics — LLaMA-3.1-8B mlp.up_proj · NVIDIA B200 · 80% sparsity

ROLV advantage compounds as workloads grow — in every dimension.

Vendor sparse operators scale linearly with work: double the batch, double the time; double the matrix, roughly double the time. ROLV does not. It operates only on the active subset of the weight matrix and skips zero rows entirely, so as batch size grows, as matrices get larger with bigger models, and as iteration counts increase, ROLV pulls further ahead. The advantage is structural, not incidental.

Batch size ↑

cuSPARSE latency scales linearly with batch. ROLV scales sub-linearly — fixed overhead amortised across more tokens. At batch=2,048 ROLV uses 0.41µs/token vs cuSPARSE’s 4.44µs/token.

1.24×
batch 1
7.92×
batch 512
10.90×
batch 2,048
Model size ↑

Larger models have larger weight matrices. ROLV’s skip fraction stays constant while the absolute rows skipped grows. Speedup consistently increases from 8B to 70B to 405B — the biggest models benefit most.

10.5×
LLaMA 8B
11.45×
LLaMA 70B
12.2×
LLaMA 405B
Iteration count ↑

ROLV is built once from a weight matrix, then reused across every inference call. Build cost is fully amortised after the first few thousand iterations. At production scale — millions of daily requests — it never appears in the cost.

~0
build cost
10.90×
every call
at scale

Batch scaling: 14336×4096 · 80% sparsity · vs cuSPARSE · NVIDIA B200 · 500 iters · 9/9 PASS. Model scaling: LLaMA-3.1 exact dimensions · B200 · batch=512 · 84/84 PASS. The vendor advantage is always structural — ROLV skips work that vendors must perform.

Time-to-first-token · Throughput · Effective compute

Faster prefill. More tokens per second. Less time waiting.

TTFT, tokens/second, and effective GFLOP/s measured directly at each sparsity level across all four platforms. NVIDIA H200 shown by default.

NVIDIA H200 · 10k×10k · batch 2,500 · 2,000 iters · 22/22 PASS

A hash: b2687223  ·  V hash: f8b47533

Sparsity Baseline TTFT ROLV™ TTFT Vendor Tok/s ROLV™ Tok/s Vendor GFLOP/s ROLV™ GFLOP/s Vendor Energy
0%cuBLAS2.51ms2.48ms1,003,9841,003,984100,842100,842ref
50%cuBLAS1.31ms2.48ms1,908,397992,03252,441100,842+47%
70%cuSPARSE0.68ms4.82ms7,352,9411,247,00022,13412,502+86%
80%cuSPARSE0.43ms5.90ms11,627,9071,694,91544,98716,485+97%
90%cuSPARSE0.28ms3.71ms17,857,1431,347,70926,76216,189+99%
95%cuSPARSE0.19ms2.02ms26,315,7891,237,62417,84114,887+99%
99%cuSPARSE0.08ms0.61ms62,500,0008,196,7215,12098,000+99%

At 80% sparsity: 32-layer prefill goes from ~970ms → ~71ms. GFLOP/s counts only arithmetic on non-zero data.

Time-to-first-token is the wall-clock time from receiving a prompt to producing the first output token, dominated by the prefill pass through all transformer layers. ROLV™ reduces per-layer latency by skipping computation on zero-valued parameters entirely. At 80% sparsity on H200 this cuts each layer from 5.90ms to 0.43ms. Across 32 layers: ~970ms prefill becomes ~71ms.

Tokens per second is the inverse of TTFT per output row — as ROLV™ gets faster, tokens/s grows proportionally. Effective GFLOP/s counts only floating-point operations performed on non-zero values. cuSPARSE and cuBLAS spend cycles on zeros that contribute nothing to the output. ROLV™ skips them, so every FLOP counted is a useful FLOP.

Synthetic sweep — worst-case uniform random floor

Uniform-random sparsity. No structural advantage. Published numbers are a floor.

Synthetic matrices use Bernoulli random sparsity — the hardest case for ROLV™ because rows are rarely fully zero. Real pruned LLM weights follow power-law distributions where entire rows collapse to zero, giving significantly higher speedups.

Baseline selection: below 70% sparsity we compare ROLV™ to cuBLAS — the operator production inference engines use for dense or lightly sparse weights. At 70% and above we compare to cuSPARSE — the operator production inference engines deploy for sparse weight matrices, regardless of whether cuBLAS is faster in raw timing at that level. Comparing against cuBLAS above 70% would mean measuring ROLV™ against an operator performing wasted arithmetic on zero values: accurate in a lab, but not what any real inference engine does. Both vendor timings are recorded and published in every result.
NVIDIA H200 · 10k×10k · batch 2,500 · 2,000 iters · pynvml · 22/22 PASS

A hash: b2687223  ·  V hash: f8b47533  ·  Peak 13.64× at 80% vs cuSPARSE

Sp% Baseline Vendor ms ROLV ms Speedup Energy TTFT ROLV™ TTFT Vendor Tok/s ROLV™ Tok/s Vendor GFLOP/s ROLV™ PASS
0%cuBLAS2.482.510.99×2.51ms2.48ms1,003,984992,032100,842
50%cuBLAS2.481.311.89×+47%1.31ms2.48ms1,908,397992,03252,441
70%cuSPARSE4.820.687.09×+86%0.68ms4.82ms7,352,9411,247,00022,134
80%cuSPARSE5.90.4313.64×+97%0.43ms5.9ms11,627,9071,694,91544,987
90%cuSPARSE3.710.2813.25×+99%0.28ms3.71ms17,857,1431,347,70926,762
95%cuSPARSE2.020.1910.63×+99%0.19ms2.02ms26,315,7891,237,62417,841
99%cuSPARSE0.610.087.63×+99%0.08ms0.61ms62,500,0008,196,7215,120

At 80% sparsity: 32-layer prefill ~970ms → ~71ms. GFLOP/s = arithmetic on non-zero data only.

05a — VRAM savings — scales exactly with sparsity

Less VRAM means larger models or larger batches on the same GPU.

ROLV stores only the active parameter blocks. LLaMA-3.1-8B up_proj at 99% sparsity: 2.34 MB vs 235 MB dense — 100× reduction. Measured directly from the operator build.

Less VRAM at 80%
47 MB vs 235 MB
10×
Less VRAM at 90%
23 MB vs 235 MB
20×
Less VRAM at 95%
12 MB vs 235 MB
100×
Less VRAM at 99%
2.3 MB vs 235 MB
Independent Verification

Four hashes eliminate the need for trust.

Every benchmark publishes four SHA-256 hashes: the weight matrix (A), the input vector (V), the dense baseline output, and the ROLV output. These hashes are committed before any verifier runs anything. To verify independently: download the same public model, extract the same layer, apply the same sparsity, compute the same hashes. If they match, the result is confirmed — we cannot have fabricated a number that independently matches a hash you computed yourself.

The Validation Kit provides exact model IDs, layer names, sparsity levels, and seeds for every published result. No code from us required.

4
Hashes per run
A · V · baseline · ROLV output
528/528
All cases verified
GPU · CPU · real weights · synthetic
Perturbation test
Change one weight → hash must change
Validation Kit
06 — Methodology
Baseline
Best vendor always
cuBLAS or cuSPARSE — whichever is faster at that sparsity level. Never cherry-picked.
Correctness
ATOL=0.1 · col-normalised
Col-normalised fp64, active outputs only. Worst error across all runs: 3.9×10⁻⁶.
Timing
CUDA Events · 100–2,000 iters
Microsecond-accurate. Warmup before every measurement. No single-shot results.
Energy
pynvml where noted
Actual joules via NVIDIA Management Library. Proxy used where pynvml unavailable — always clearly labelled.
Hashes
4 SHA-256 per run
Weight matrix · input vector · dense baseline · ROLV output. One weight change → hash changes — proves real computation.
Reproducibility
Deterministic · seeded
TF32 off, cuDNN deterministic, fixed seeds. NVIDIA, AMD, TPU, CPU all supported. Full JSON on request.
All benchmarks run end-to-end in a single self-contained harness. Full JSON with all hashes and timings available on request.
03 — How It Works
03 — How It Works

Four steps from dense weight to ROLV Primitive©. Score → prune → quantize → store sparse. The operator is built once per weight matrix and reused across all inference calls. Build time is amortised across thousands of calls.

04 — On Correctness

The ROLV Primitive© is exact on its compressed submatrix — no approximation is introduced by ROLV Primitive© itself. The only source of output error is pruning, which zeroes low-magnitude rows before ROLV Primitive© is built.

This is expected and standard for compressed inference. The goal is to operate within a defined tolerance budget while maximising speed and energy savings. All published results include correctness metrics alongside speedup figures.

3.9×10⁻⁶
Max error
LLaMA-3.1-8B · all sparsity levels
250×
Tighter than ATOL
ATOL=0.001 standard · ROLV achieves 3.9×10⁻⁶
26/26
Correctness PASS
22 H200 + 22 B200 + 22 Intel + 22 AMD + 24 T4 weights + 4 LLaMA levels
✓ ×26
Perturbation tests
One weight change → output hash changes every run
04 — Calculators
Calculators

RSMT™ & ROLVswitch™

RSMT™ Calculator

Find the exact sparsity threshold where sparse storage beats dense for your dtype.

Loading...
ROLVswitch™

Finds the exact sparsity where vendor dense hits VRAM congestion first — your switch point to sparse.

Loading...
05 — Contact

Contact Us

[email protected] 3 Patents Pending