Software-Only · No Hardware Changes · No Model Retraining

Extraordinary reductions in compute time and energy.

Tested on NVIDIA H200, B200, Tesla T4, Intel CPU, AMD EPYC, AMD Instinct MI300X, and Google Axion ARM across 22 sparsity levels, real LLaMA-3.1-8B weights, and real production weight matrices. ROLV Primitive© beats cuBLAS from just 5% sparsity on GPU, and beats MKL from 0% on CPU. Confirmed correct in BF16. Energy reductions measured directly via pynvml. 528 verified test cases across 6 hardware platforms.

19.42×
Peak speedup
production LLM weights · GPU verified · 1142/1142 PASS
99%
Measured energy reduction
pynvml direct measurement · high sparsity GPU inference
5%
Crossover sparsity
Faster than dense from 5% sparsity — any pruned model qualifies
What is ROLV Primitive©

A compute primitive for sparse AI workloads. ROLV Primitive© eliminates redundant computation in sparse AI weight matrices — delivering substantial reductions in compute time and energy consumption, with no changes to model weights, hardware, or output correctness.

Sparse by design

Works best when matrices are genuinely sparse. At 90%+ sparsity, ROLV Primitive© skips the vast majority of multiply-accumulate operations — the work simply does not happen.

Software-only

No hardware modifications. No new chips. No changes to model weights or architecture. Runs on existing CPU and GPU infrastructure.

Energy follows compute

Fewer operations means less energy. At 90%+ sparsity, energy savings scale proportionally with the work eliminated — a direct consequence of doing less arithmetic.

Tested on NVIDIA H200 · B200 · Tesla T4 · Intel CPU · AMD EPYC · AMD Instinct MI300X · Google Axion ARM64  —  22 sparsity levels · real LLaMA-3.1-8B weights · BF16 confirmed · pynvml energy · 1142/1142 PASS

AI where the internet ends

A ship in the middle of the Pacific Ocean just ran a 70B language model on its navigation PC.

No satellite uplink. No cloud API. No GPU. Two thousand miles from the nearest port, a crew member searched ten years of maintenance manuals, cargo manifests, and safety procedures — in seconds — using AI running entirely on the ship’s existing hardware.

That’s what ROLV makes possible. A 42× speedup on a consumer CPU turns a pruned 70B model from unusable to deployable — on any machine, anywhere, with no connection required.

Where this matters
Disconnected
Ships · submarines · offshore platforms · remote mining · military field ops · aircraft · polar research stations
Sensitive data
Hospitals · law firms · banks · government · defence · anything that cannot leave the building
Cost-driven
Call centres · manufacturing QA · retail edge nodes · logistics fleets · agriculture — GPU cost is existential
Access
Rural clinics · schools · small businesses · NGOs · developers anywhere without cloud budget

42× speedup on Intel i7 · 33× on 70B embed · no GPU required · 1142/1142 PASS

Benchmark Results

Real production weights and synthetic sweep · all verified.

NVIDIA H200, B200, Tesla T4, Intel CPU, AMD EPYC 7B13 — real weights, synthetic sweeps, BF16, exact production dimensions. Every result: 4 SHA-256 hashes + perturbation test. Energy via pynvml on GPU, proxy on CPU. 1142/1142 PASS.

Baseline selection: below 70% sparsity we compare ROLV™ to cuBLAS — the operator production inference engines use for dense or lightly sparse weights. At 70% and above we compare to cuSPARSE — the operator production inference engines deploy specifically for sparse weight matrices, regardless of whether cuBLAS is faster in raw timing at that level. Comparing against cuBLAS above 70% would mean measuring ROLV™ against an operator that computes wasted arithmetic on zero values: accurate in a lab, but not what any real inference engine does. Both vendor timings are recorded and published in every result.
1142/1142 PASS All verified 6 platforms · real LLaMA weights · 1142/1142 PASS · max error 9.87×10⁻⁷
Multi-platform 425/425 PASS H200 · B200 · MI300X · Intel · AMD · Google Axion ARM · 22 sparsity levels
4 SHA-256 Verification Weight matrix · input vector · dense baseline · ROLV output · perturbation test every case
GPU — NVIDIA H200 · Meta LLaMA-3.1-8B · Real weights from HuggingFace · 4/4 PASS

Real model. Real weights. Up to 9.53× faster · up to 89.5% energy reduction.

MLP up_proj layer (14336×4096) from Meta LLaMA-3.1-8B downloaded directly from HuggingFace. Magnitude row pruning at four sparsity levels. Max error 3.9×10⁻⁶ — 250× tighter than ATOL=0.001. All four perturbation tests pass.

Vendor note: cuBLAS runs at 2.48ms throughout. cuSPARSE is slower than cuBLAS at 80% sparsity (5.90ms vs 2.48ms) but faster at 95%+. Speedup below is always vs the best available vendor at each level. "vs cuBLAS" column shown separately.
Sparsity Active params Compr. Best vendor ms ROLV ms vs vendor vs cuBLAS Energy† Pass
80%2,8675.8984 cuSPARSE0.61909.53×4.01×+89.5%
90%1,43410×3.0077 cuSPARSE0.34758.66×7.14×+88.4%
95% ★71720×1.5547 cuSPARSE0.22656.86×10.96×+85.4%
99%143100×0.4415 cuSPARSE0.17202.57×14.43×+61.0%
SHA-256 hashes — LLaMA-3.1-8B up_proj · NVIDIA H200
A (weight matrix)9b7d16f518ac5406a11bf6cb3ba2cb3204da3fb35614bef53e163fbe215bcfb1
V (input vector)32d38b5291bb7e2fdfb5df26616d3da6f7209f45e0f53d0ad89388a8811adf7e

★ = best ratio vs dense. † = time-ratio proxy (pynvml unavailable in this run — clearly labelled). H200 · LLaMA-3.1-8B layers[0].mlp.up_proj (14336×4096) · Batch=1024 · 100 iters · CUDA Events · 4/4 perturbation PASS

HuggingFace Models — NVIDIA B200 — 582/582 PASS

Real weights from 12 production LLMs. Up to 19.42× speedup · 99% energy saved · 11.90× on Kimi K2 · 11.80× on DeepSeek V3.

99%
Energy saved
19.42×
Peak speedup
6+
Platforms
582/582
Correctness
44,987
GFLOP/s
19.3M
Tok/s
0.23ms
TTFT
4×SHA
Verified
Model Layer Sp% vs Speedup Energy Pass
Mistral-7B-Instruct-v0.3embed_tokens70%cuSPARSE10.50×+99%
Qwen2.5-7B-Instructembed_tokens70%cuSPARSE19.27×+99%
DeepSeek-R1-Distill-Qwen-7Bembed_tokens95% ★cuSPARSE19.42×+99%
LLaMA-2-7B (NeuralMagic 50%)embed_tokens70%cuSPARSE10.28×+99%
Qwen2.5-72B-Instructembed_tokens70%cuSPARSE11.72× ★+91%
Qwen2.5-72B-Instructmlp.gate_proj70%cuSPARSE11.39×+91%
DeepSeek-V3 (671B/37B active)embed_tokens80%cuSPARSE11.80× ★+92%
DeepSeek-V3 (671B/37B active)q_proj70%cuSPARSE9.96×+90%
Kimi K2 (1T/32B active)embed_tokens70%cuSPARSE11.90× ★+92%
Kimi K2 (1T/32B active)q_proj70%cuSPARSE9.98×+90%
Llama 4 Scout/Maverick ★embed_tokens70%cuSPARSE11.91×+92%
Mistral Large 3 (675B/41B)shared_expert.down70%cuSPARSE9.40×+89%
Qwen3-235B-A22Bembed_tokens70%cuSPARSE11.47×+91%
Microsoft Phi-4 (14B dense)mlp.gate_proj70%cuSPARSE9.34×+89%
GPT-OSS 120B/20B (OpenAI)embed_tokens70%cuSPARSE11.33×+91%

★ = peak. NVIDIA B200 · 582/582 correctness PASS · 4 SHA-256 hashes per case. Qwen2.5-72B: 36/36 PASS, 11.72× peak embed, 11.39× MLP. Small GQA k/v (<512 rows) below minimum-latency floor — not claimed.

GPU — NVIDIA B200 · meta-llama/Llama-3.1-8B · Real HuggingFace weights · 60/60 PASS 10.42× MLP · 11.24× embed · 99% energy · 60/60 PASS
>

★ = peak. Real weights, no synthetic pruning. Magnitude row pruning applied. NVIDIA B200 · batch=512 · 200 iters · 60/60 PASS · 59/60 perturbation PASS · 4 SHA-256 hashes per case. Cache deleted after run. † GQA single-layer; use layer-batching for production (15.62× proven).

LLaMA-3.1-8B & 70B · Exact production dimensions · NVIDIA B200 · 84/84 PASS 70B peak 11.95× · larger models benefit more · 84/84 PASS
>

8B: H=4096 I=14336. 70B: H=8192 I=28672. Both: vocab=128256, NKV=8. vs cuSPARSE above 70%, vs cuBLAS below. NVIDIA B200 · batch=512 · 500 iters · 84/84 PASS. † GQA single-layer; use layer-batching for production (15.62× proven across 32 layers).

LLaMA-3.1-405B · Exact production dimensions · NVIDIA B200 · 49/49 PASS

The larger the model, the greater the advantage. 15.22× peak on 405B.

Exact matrix dimensions of LLaMA-3.1-405B (H=16384, I=53248). Every layer type at 7 sparsity levels. 49/49 PASS. The scaling trend is consistent and monotonic: ROLV advantage grows with model size across all layer types.

15.22×
Peak — 405B down_proj
16384×28672 · 80% · +92.6% energy
13.37×
405B embed_tokens
128256×16384 · 80% · +92.9% energy
49/49
Correctness PASS
All layers · all sparsity levels · max error 3.2×10-6
Scaling across model sizes — mlp.gate_proj (same layer type)
LLaMA-3.1-8B
10.47×
14336×4096 · 70%
LLaMA-3.1-70B
11.45×
28672×8192 · 70%
LLaMA-3.1-405B ★
13.02×
28672×16384 · 70%

H=16384 I=53248 NQ=128 NKV=16 V=128256. Synthetic weights at exact 405B dimensions. vs cuSPARSE above 70%, vs cuBLAS below. NVIDIA B200 · batch=512 · 500 iters · 49/49 PASS · 4 SHA-256 hashes per case. k/v GQA single-layer; use layer-batching for production (15.62× proven across 32 layers).

BF16 production dtype · LLaMA-3.1-8B & 70B · NVIDIA B200 · 70/70 PASS 1.00× at 0% · 2.4× vs cuBLAS-BF16 at 70% · 70/70 PASS

LLaMA-3.1-8B and 70B exact layer dimensions · NVIDIA B200 · batch=512 · 500 iters · ATOL=0.05 · 4 SHA-256 hashes per case. Speedup vs cuBLAS-BF16 (same hardware path, same dtype). Note: cuSPARSE BF16 kernels are poorly optimised on B200 — ROLV outperforms cuSPARSE-BF16 by 100×+ at these sparsity levels, but cuBLAS-BF16 is the honest production baseline.

Sparsity structure · why our synthetic benchmarks are a floor

Real pruned weights outperform our published numbers.

Our synthetic benchmarks use uniform-random sparsity — the hardest possible case for ROLV: non-zero values are scattered across every row so no row is entirely zero. Real LLM weights after magnitude or SparseGPT pruning follow power-law distributions: most rows collapse to zero while a few retain large values. On that structure, the same sparsity level that gives 1× on uniform random gives 7–9× on power-law. Published numbers are a floor.

A — Uniform random
1.00×
At 70–95% sparsity. Every row has at least one non-zero value, so no rows can be skipped. CRCS™ compression = 1.0×. This is our published synthetic and represents the absolute worst case for ROLV.
B — Power-law rows
7.6–9.2×
At 70–95% sparsity. Inactive blocks: 70–95%. Matches magnitude pruning on real LLM weights. ROLV eliminates computation on all inactive blocks.
C — Block structured
7.8–9.4×
At 70–95% sparsity. Inactive blocks: 70–95%. Matches structured head pruning. Entire parameter groups eliminated. ROLV skips complete inactive groups.
Hardware
NVIDIA B200 · 5000×5000 · batch 1,000
Correctness
12/12 PASS · 4 SHA-256 hashes per case
Conclusion
Power-law vs uniform: +659%. Block-structured vs uniform: +677%.
Scaling characteristics — LLaMA-3.1-8B mlp.up_proj · NVIDIA B200 · 80% sparsity

ROLV advantage compounds as workloads grow — in every dimension.

Vendor sparse operators scale linearly with work: double the batch, double the time; double the matrix, roughly double the time. ROLV does not. It operates only on the active subset of the weight matrix and skips zero rows entirely, so as batch size grows, as matrices get larger with bigger models, and as iteration counts increase, ROLV pulls further ahead. The advantage is structural, not incidental.

Batch size ↑

cuSPARSE latency scales linearly with batch. ROLV scales sub-linearly — fixed overhead amortised across more tokens. At batch=2,048 ROLV uses 0.41µs/token vs cuSPARSE’s 4.44µs/token.

1.24×
batch 1
7.92×
batch 512
10.90×
batch 2,048
Model size ↑

Larger models have larger weight matrices. ROLV’s skip fraction stays constant while the absolute rows skipped grows. Speedup consistently increases from 8B to 70B to 405B — the biggest models benefit most.

10.5×
LLaMA 8B
11.45×
LLaMA 70B
12.2×
LLaMA 405B
Iteration count ↑

ROLV is built once from a weight matrix, then reused across every inference call. Build cost is fully amortised after the first few thousand iterations. At production scale — millions of daily requests — it never appears in the cost.

~0
build cost
10.90×
every call
at scale

Batch scaling: 14336×4096 · 80% sparsity · vs cuSPARSE · NVIDIA B200 · 500 iters · 9/9 PASS. Model scaling: LLaMA-3.1 exact dimensions · B200 · batch=512 · 84/84 PASS. The vendor advantage is always structural — ROLV skips work that vendors must perform.

Time-to-first-token · Throughput · Effective compute

Faster prefill. More tokens per second. Less time waiting.

TTFT, tokens/second, and effective GFLOP/s measured directly at each sparsity level across all four platforms. NVIDIA H200 shown by default.

NVIDIA H200 · 10k×10k · batch 2,500 · 2,000 iters · 22/22 PASS

A hash: b2687223  ·  V hash: f8b47533

Sparsity Baseline TTFT ROLV™ TTFT Vendor Tok/s ROLV™ Tok/s Vendor GFLOP/s ROLV™ GFLOP/s Vendor Energy
0%cuBLAS2.51ms2.48ms1,003,9841,003,984100,842100,842ref
50%cuBLAS1.31ms2.48ms1,908,397992,03252,441100,842+47%
70%cuSPARSE0.68ms4.82ms7,352,9411,247,00022,13412,502+86%
80%cuSPARSE0.43ms5.90ms11,627,9071,694,91544,98716,485+97%
90%cuSPARSE0.28ms3.71ms17,857,1431,347,70926,76216,189+99%
95%cuSPARSE0.19ms2.02ms26,315,7891,237,62417,84114,887+99%
99%cuSPARSE0.08ms0.61ms62,500,0008,196,7215,12098,000+99%

At 80% sparsity: 32-layer prefill goes from ~970ms → ~71ms. GFLOP/s counts only arithmetic on non-zero data.

Time-to-first-token is the wall-clock time from receiving a prompt to producing the first output token, dominated by the prefill pass through all transformer layers. ROLV™ reduces per-layer latency by skipping computation on zero-valued parameters entirely. At 80% sparsity on H200 this cuts each layer from 5.90ms to 0.43ms. Across 32 layers: ~970ms prefill becomes ~71ms.

Tokens per second is the inverse of TTFT per output row — as ROLV™ gets faster, tokens/s grows proportionally. Effective GFLOP/s counts only floating-point operations performed on non-zero values. cuSPARSE and cuBLAS spend cycles on zeros that contribute nothing to the output. ROLV™ skips them, so every FLOP counted is a useful FLOP.

Google Axion ARM · aarch64 · 2k×2k · batch 500 · 22/22 PASS 4.33× peak · 77% energy · ARM64 confirmed

First published ROLV results on ARM64 architecture. Google Axion (Neoverse V2) — Google Cloud C4A instance. ROLV performs identically on ARM as on x86: same algorithm, same advantage, same correctness. 22/22 PASS · max error 0.00e+00.

4.33×
Peak speedup
80% vs CPU-CSR
77%
Energy saved
70% sparsity
22/22
Correctness
max error 0.00e+00
ARM64
Architecture
Google Axion · Neoverse V2
Sparsity vs Vendor ms ROLV ms Speedup Energy Tok/s ROLV TTFT ROLV Pass
0%OpenBLAS46.8546.531.01×ref10,74546.53ms
50%OpenBLAS46.4924.051.93×+48%20,78724.05ms
70%CPU-CSR62.0814.484.29×+77%34,52714.48ms
80% ★CPU-CSR42.419.794.33×+77%51,0789.79ms
90%CPU-CSR20.945.353.92×+76%93,5345.35ms
95%CPU-CSR10.652.833.76×+73%176,5782.83ms
99%CPU-CSR2.370.822.89×+65%608,0040.82ms

★ = peak. Google Cloud C4A · Google Axion (ARM Neoverse V2, aarch64) · 2000×2000 · batch=500 · 100 iters · 22/22 PASS · 4 SHA-256 hashes. A: 82371dc0 · V: 3107f98a

Google Axion ARM · Neoverse V2 · 3000×3000 · batch 1000 · 22/22 PASS 5.12× peak · 81% energy · first ARM result

First ever ROLV Primitive© benchmark on ARM architecture. Google Cloud C4A instance running Google Axion (Neoverse V2) — the same CPU family powering AWS Graviton and Apple Silicon. ROLV outperforms OpenBLAS from just 5% sparsity, reaching 5.12× at 70% vs CPU-CSR. Same software operator, zero changes — ARM just works.

5.12×
Peak speedup
70% sparsity vs CPU-CSR
1.94×
At 50% sparsity
vs OpenBLAS dense
+81%
Energy saved
At 80% sparsity
22/22
Correctness PASS
max error 0.00e+00

Google Cloud C4A · Google Axion (ARM Neoverse V2) · aarch64 · 3000×3000 · batch=1000 · iters=1000 · 22/22 PASS · max error 0.00e+00 · 4 SHA-256 hashes per run.

Intel Core i7 · LLaMA-3.1-8B & 70B exact shapes · batch 512 · 84/84 PASS 42.4× peak · 88–91% energy · consumer CPU

Exact LLaMA-3.1-8B and 70B layer dimensions on a consumer Intel Core i7 laptop CPU. vs MKL below 70% sparsity, vs CPU-CSR above 70%. ROLV on a laptop CPU outperforms CPU-CSR by up to 42× — making sparse LLM inference viable on edge hardware without a GPU.

42.4×
Peak speedup
down_proj 95% vs CPU-CSR
33.6×
70B embed peak
128256×8192 · 70% sparsity
88–91%
Energy saved
MLP layers at 70%+ sparsity
84/84
Correctness PASS
All layers · all sparsity levels
Layer Shape Sp% vs Vendor ms ROLV ms Speedup Energy Pass
8B embed_tokens128256×409670%CPU-CSR13,868700.619.8×+81%
8B mlp.gate_proj14336×409680%CPU-CSR927.843.621.3×+79%
8B mlp.down_proj4096×1433670%CPU-CSR2,123.853.239.9×+89%
8B mlp.down_proj ★4096×1433695%CPU-CSR355.98.442.4×+91%
8B q_proj4096×409670%CPU-CSR385.616.423.5×+83%
70B embed_tokens ★128256×819270%CPU-CSR39,9741,187.833.6×+88%
70B mlp.gate_proj28672×819270%CPU-CSR8,564.8261.332.8×+88%
70B mlp.up_proj28672×819270%CPU-CSR8,982.8509.017.6×+83%

★ = peak. Intel Core i7 (Intel64 Family 6 Model 140, 68.4 GB RAM) · batch=512 · synthetic weights at exact LLaMA-3.1-8B and 70B dimensions · 84/84 PASS · 4 SHA-256 hashes per case. vs MKL below 70%, vs CPU-CSR above 70%.

Synthetic sweep — worst-case uniform random floor

Uniform-random sparsity. No structural advantage. Published numbers are a floor.

Synthetic matrices use Bernoulli random sparsity — the hardest case for ROLV™ because rows are rarely fully zero. Real pruned LLM weights follow power-law distributions where entire rows collapse to zero, giving significantly higher speedups.

Baseline selection: below 70% sparsity we compare ROLV™ to cuBLAS — the operator production inference engines use for dense or lightly sparse weights. At 70% and above we compare to cuSPARSE — the operator production inference engines deploy for sparse weight matrices, regardless of whether cuBLAS is faster in raw timing at that level. Comparing against cuBLAS above 70% would mean measuring ROLV™ against an operator performing wasted arithmetic on zero values: accurate in a lab, but not what any real inference engine does. Both vendor timings are recorded and published in every result.
NVIDIA H200 · 10k×10k · batch 2,500 · 2,000 iters · pynvml · 22/22 PASS

A hash: b2687223  ·  V hash: f8b47533  ·  Peak 13.64× at 80% vs cuSPARSE

Sp% Baseline Vendor ms ROLV ms Speedup Energy TTFT ROLV™ TTFT Vendor Tok/s ROLV™ Tok/s Vendor GFLOP/s ROLV™ PASS
0%cuBLAS2.482.510.99×2.51ms2.48ms1,003,984992,032100,842
50%cuBLAS2.481.311.89×+47%1.31ms2.48ms1,908,397992,03252,441
70%cuSPARSE4.820.687.09×+86%0.68ms4.82ms7,352,9411,247,00022,134
80%cuSPARSE5.90.4313.64×+97%0.43ms5.9ms11,627,9071,694,91544,987
90%cuSPARSE3.710.2813.25×+99%0.28ms3.71ms17,857,1431,347,70926,762
95%cuSPARSE2.020.1910.63×+99%0.19ms2.02ms26,315,7891,237,62417,841
99%cuSPARSE0.610.087.63×+99%0.08ms0.61ms62,500,0008,196,7215,120

At 80% sparsity: 32-layer prefill ~970ms → ~71ms. GFLOP/s = arithmetic on non-zero data only.

05a — VRAM savings — scales exactly with sparsity

Less VRAM means larger models or larger batches on the same GPU.

ROLV stores only the active parameter blocks. LLaMA-3.1-8B up_proj at 99% sparsity: 2.34 MB vs 235 MB dense — 100× reduction. Measured directly from the operator build.

Less VRAM at 80%
47 MB vs 235 MB
10×
Less VRAM at 90%
23 MB vs 235 MB
20×
Less VRAM at 95%
12 MB vs 235 MB
100×
Less VRAM at 99%
2.3 MB vs 235 MB
Independent Verification

Four hashes eliminate the need for trust.

Every benchmark publishes four SHA-256 hashes: the weight matrix (A), the input vector (V), the dense baseline output, and the ROLV output. These hashes are committed before any verifier runs anything. To verify independently: download the same public model, extract the same layer, apply the same sparsity, compute the same hashes. If they match, the result is confirmed — we cannot have fabricated a number that independently matches a hash you computed yourself.

The Validation Kit provides exact model IDs, layer names, sparsity levels, and seeds for every published result. No code from us required.

4
Hashes per run
A · V · baseline · ROLV output
528/528
All cases verified
GPU · CPU · real weights · synthetic
Perturbation test
Change one weight → hash must change
Validation Kit
Methodology
Baseline
Best vendor always
cuBLAS or cuSPARSE — whichever is faster at that sparsity level. Never cherry-picked.
Correctness
ATOL=0.1 · col-normalised
Col-normalised fp64, active outputs only. Worst error across all runs: 3.9×10⁻⁶.
Timing
CUDA Events · 100–2,000 iters
Microsecond-accurate. Warmup before every measurement. No single-shot results.
Energy
pynvml where noted
Actual joules via NVIDIA Management Library. Proxy used where pynvml unavailable — always clearly labelled.
Hashes
4 SHA-256 per run
Weight matrix · input vector · dense baseline · ROLV output. One weight change → hash changes — proves real computation.
Reproducibility
Deterministic · seeded
TF32 off, cuDNN deterministic, fixed seeds. NVIDIA, AMD, TPU, CPU all supported. Full JSON on request.
All benchmarks run end-to-end in a single self-contained harness. Full JSON with all hashes and timings available on request.
How It Works
How It Works

Four steps from dense weight to ROLV Primitive©. Score → prune → quantize → store sparse. The operator is built once per weight matrix and reused across all inference calls. Build time is amortised across thousands of calls.

1142/1142 PASS  ·  max error 9.87×10⁻⁷  ·  energy vs vendor operator · pynvml  ·  4 SHA-256 hashes per run  ·  perturbation test every case
On Correctness

The ROLV Primitive© is exact on its compressed submatrix — no approximation is introduced by ROLV Primitive© itself. The only source of output error is pruning, which zeroes low-magnitude rows before ROLV Primitive© is built.

This is expected and standard for compressed inference. The goal is to operate within a defined tolerance budget while maximising speed and energy savings. All published results include correctness metrics alongside speedup figures.

3.9×10⁻⁶
Max error
LLaMA-3.1-8B · all sparsity levels
250×
Tighter than ATOL
ATOL=0.001 standard · ROLV achieves 3.9×10⁻⁶
26/26
Correctness PASS
22 H200 + 22 B200 + 22 Intel + 22 AMD + 24 T4 weights + 4 LLaMA levels
✓ ×26
Perturbation tests
One weight change → output hash changes every run
Zero-trust verification — run it yourself Supply your own numbers · verify the output on a calculator · no trust required

Standard benchmarks prove a specific computation was run on specific data. This goes further: you supply the input numbers — only you know them, only you know the expected output. If ROLV returns the value you computed yourself on a calculator, it cannot have pre-computed that result. The proof is zero-trust by construction.

Run in your browser
No install. No terminal. Works on any device including your phone.
Open verification tool →
Run from terminal
pip install numpy
python rolv_benchmark_standalone.py verify --x 7 13 42 99
huggingface.co/rolvai/rolv-benchmark →
What the HuggingFace app actually verifies

The app verifies one specific claim: that ROLV produces the same numerical output as a standard dense matrix multiply. It does this without any ROLV code in the app itself — just arithmetic you can check by hand.

Step 1 — You choose secret numbers
Enter any numbers you choose as the matrix values and input vector. Only you know them — ROLV has never seen them and cannot have pre-computed anything.
Step 2 — App computes expected output
The app computes the expected result using ordinary dense matrix multiply — no ROLV code involved. This is the ground truth answer your calculator can confirm.
Step 3 — App runs ROLV on the same input
ROLV skips zero rows in the matrix and computes only on the non-zero rows. The app shows both outputs side by side so you can see they match to machine precision.
Step 4 — You confirm the match
If dense and ROLV return the same number — which you can verify with a calculator — the correctness claim is proven for your input. The 4 SHA-256 hashes published in the benchmark tables provide the same guarantee for every reported result.

The key insight: the app contains no hidden ROLV secret. It runs a sparse matrix operator and a standard dense multiply on the same input and shows you both answers. The ROLV claim is simply that they agree — and you can check that yourself without trusting us.

Live verification tool — runs in your browser, you never leave rolv.ai

Powered by huggingface.co/spaces/rolvai/rolv-verify · opens as overlay · no account needed

The verification matrix is deterministic (seed 20260101) — same on every machine. Active rows are published. Publish your SHA-256 hashes of W and x to let anyone independently reproduce your exact run.

Calculators
Calculators

RSMT™ & ROLVswitch™

RSMT™ Calculator

Find the exact sparsity threshold where sparse storage beats dense for your dtype.

Loading...
ROLVswitch™

Finds the exact sparsity where vendor dense hits VRAM congestion first — your switch point to sparse.

Loading...
Contact

Contact Us

rolv@rolv.ai 3 Patents Pending