verdicts to weight

Canonical source: reliquary-forge/docs/scoring.md. Synced into this site as apps/web/content/synced/scoring.md.

The shift: from "do as many rollouts as you can" → "find me the rollouts that I need to train on." Miners now compete on the best inferences, not the most. Two pricing knobs together collapse the quantity-farming exploit:

Per-rollout quality — each accepted, unique verdict adds its dense_reward (fraction of clauses satisfied) to the miner's total. Mass-producing low-reward completions earns less than a few high-reward ones.
Superlinear weighting — totals are squared before normalization. weight ∝ total². A miner with 2× the raw reward earns 4× the weight; 3× → 9×.

Pipeline

Rollouts → Verdicts → Uniqueness Dedup → Score Aggregation
        → Superlinear Weighting → Normalized Weights

Uniqueness dedup

Each rollout is keyed by SHA-256 of its comma-joined token ids. The first occurrence of (challenge_id, uniqueness_key) is marked is_unique=true; duplicates are excluded from scoring entirely. Submitting the same completion twice across a window earns nothing extra.

Acceptance gate

A verdict is accepted when both:

proof_valid = true — the sketch proof passed verification
evaluation.accepted = true — the SAT evaluation completed successfully

Anything else (failed proof, malformed environment, copycat, log-prob drift) is rejected before scoring.

Score aggregation

For each accepted, unique verdict the miner's total accumulates:

totals[miner] = Σ dense_reward across all accepted, unique verdicts

Dense reward is the fraction of clauses satisfied — partial credit for partial solutions. Validators recompute it independently from the clause evaluation; miners cannot self-report.

Superlinear weighting

powered[miner] = totals[miner] ^ SUPERLINEAR_EXPONENT  // = 2.0
weight[miner]  = powered[miner] / Σ powered[*]

Weights normalize to 1.0. With exponent = 2.0, an Alice-vs-Bob ratio of 2.18× in raw reward becomes 4.75× in onchain weight.

DAPO zone filter (the trainer-side dual of the incentive)

Per §9.1 of the paper, Forge's GRPO trainer only consumes rollout groups where σ ≥ σ_min — the within-group reward standard deviation crosses a threshold. Below that, the normalized advantage (rᵢ − μ)/σ collapses to zero and the group carries no gradient signal.

The dashboard surfaces this directly in the Training Signal · Zone Filter panel (zone-hit ratio per window, group σ histogram, mean reward inside zone). Miners producing rollouts that LAND IN the zone are the ones whose work actually trains the model — and the ones the incentive curve concentrates weight on.

Worked example

Miner	Accepted + unique	dense_rewards	total	total²	weight
Alice	3	0.8, 0.9, 0.7	2.4	5.76	0.798
Bob	2	0.6, 0.5	1.1	1.21	0.168
Carol	1	0.5	0.5	0.25	0.035

Alice has 2.18× Bob's raw reward but receives 4.75× the weight.

Sybil resistance

Splitting identity is punished by the same superlinear curve. One miner with total 5.0 earns 5²=25 units of weight. Two miners with total 2.5 each earn 2 × 2.5² = 12.5 — half. The math actively discourages identity fragmentation.

The nine stages (verifier pipeline)

Verdicts must clear:

schema — verdict envelope well-formed
tokens — every token in vocab
prompt — prompt matches window policy
proof — sketch receipt matches within tolerance
termination — completion ended cleanly
environment — the env's own evaluator accepts
reward — reward value matches declared formula
logprob — per-position log-prob drift < 0.15 on ≥ 51 % positions
distribution — median importance ratio in [0.85, 1.15]

Stages 1–7 are hard rejects. Stages 8–9 can soft-flag (accepted but observed).

Score aggregation

For each accepted, unique verdict the miner's total accumulates:

totals[miner] = Σ dense_reward across all accepted, unique verdicts

Dense reward is the fraction of clauses satisfied — partial credit for partial solutions. Validators recompute it independently from the clause evaluation; miners cannot self-report.

DAPO zone filter (the trainer-side dual of the incentive)

Miner

Accepted + unique

dense_rewards

total

total²

weight

Alice

0.8, 0.9, 0.7

2.4

5.76

0.798

Bob

0.6, 0.5

1.1

1.21

0.168

Carol

0.5

0.25

0.035

The nine stages (verifier pipeline)

Verdicts must clear:

schema — verdict envelope well-formed

tokens — every token in vocab

prompt — prompt matches window policy

proof — sketch receipt matches within tolerance

termination — completion ended cleanly

environment — the env's own evaluator accepts

reward — reward value matches declared formula

logprob — per-position log-prob drift < 0.15 on ≥ 51 % positions

distribution — median importance ratio in [0.85, 1.15]

Stages 1–7 are hard rejects. Stages 8–9 can soft-flag (accepted but observed).

Scoring + the incentive shift

Pipeline

Uniqueness dedup

Acceptance gate

Score aggregation

Superlinear weighting

DAPO zone filter (the trainer-side dual of the incentive)

Worked example

Sybil resistance

The nine stages (verifier pipeline)

Scoring + the incentive shift

Pipeline

Uniqueness dedup

Acceptance gate

Score aggregation

Superlinear weighting

DAPO zone filter (the trainer-side dual of the incentive)

Worked example

Sybil resistance

The nine stages (verifier pipeline)