Canonical source:
reliquary-forge/docs/scoring.md. Synced into this site asapps/web/content/synced/scoring.md.
The shift: from "do as many rollouts as you can" → "find me the rollouts that I need to train on." Miners now compete on the best inferences, not the most. Two pricing knobs together collapse the quantity-farming exploit:
- Per-rollout quality — each accepted, unique verdict adds its
dense_reward(fraction of clauses satisfied) to the miner's total. Mass-producing low-reward completions earns less than a few high-reward ones. - Superlinear weighting — totals are squared before normalization.
weight ∝ total². A miner with 2× the raw reward earns 4× the weight; 3× → 9×.
Pipeline
Rollouts → Verdicts → Uniqueness Dedup → Score Aggregation
→ Superlinear Weighting → Normalized Weights
Uniqueness dedup
Each rollout is keyed by SHA-256 of its comma-joined token ids. The
first occurrence of (challenge_id, uniqueness_key) is marked
is_unique=true; duplicates are excluded from scoring entirely.
Submitting the same completion twice across a window earns nothing
extra.
Acceptance gate
A verdict is accepted when both:
proof_valid = true— the sketch proof passed verificationevaluation.accepted = true— the SAT evaluation completed successfully
Anything else (failed proof, malformed environment, copycat, log-prob drift) is rejected before scoring.
Score aggregation
For each accepted, unique verdict the miner's total accumulates:
totals[miner] = Σ dense_reward across all accepted, unique verdicts
Dense reward is the fraction of clauses satisfied — partial credit for partial solutions. Validators recompute it independently from the clause evaluation; miners cannot self-report.
Superlinear weighting
powered[miner] = totals[miner] ^ SUPERLINEAR_EXPONENT // = 2.0
weight[miner] = powered[miner] / Σ powered[*]
Weights normalize to 1.0. With exponent = 2.0, an Alice-vs-Bob ratio of 2.18× in raw reward becomes 4.75× in onchain weight.
DAPO zone filter (the trainer-side dual of the incentive)
Per §9.1 of the paper, Forge's GRPO trainer only consumes
rollout groups where σ ≥ σ_min — the within-group reward
standard deviation crosses a threshold. Below that, the normalized
advantage (rᵢ − μ)/σ collapses to zero and the group carries no
gradient signal.
The dashboard surfaces this directly in the Training Signal · Zone Filter panel (zone-hit ratio per window, group σ histogram, mean reward inside zone). Miners producing rollouts that LAND IN the zone are the ones whose work actually trains the model — and the ones the incentive curve concentrates weight on.
Worked example
| Miner | Accepted + unique | dense_rewards | total | total² | weight |
|---|---|---|---|---|---|
| Alice | 3 | 0.8, 0.9, 0.7 | 2.4 | 5.76 | 0.798 |
| Bob | 2 | 0.6, 0.5 | 1.1 | 1.21 | 0.168 |
| Carol | 1 | 0.5 | 0.5 | 0.25 | 0.035 |
Alice has 2.18× Bob's raw reward but receives 4.75× the weight.
Sybil resistance
Splitting identity is punished by the same superlinear curve. One
miner with total 5.0 earns 5²=25 units of weight. Two miners with
total 2.5 each earn 2 × 2.5² = 12.5 — half. The math actively
discourages identity fragmentation.
The nine stages (verifier pipeline)
Verdicts must clear:
- schema — verdict envelope well-formed
- tokens — every token in vocab
- prompt — prompt matches window policy
- proof — sketch receipt matches within tolerance
- termination — completion ended cleanly
- environment — the env's own evaluator accepts
- reward — reward value matches declared formula
- logprob — per-position log-prob drift < 0.15 on ≥ 51 % positions
- distribution — median importance ratio in [0.85, 1.15]
Stages 1–7 are hard rejects. Stages 8–9 can soft-flag (accepted but observed).