forge · training

The training loop

Every window is one GRPO step. The validator trains on the rollouts that survive verification — PPO-clipped, KL-penalized against a frozen reference — and publishes the updated checkpoint to Hugging Face every ten trained windows.

live

FIG.01 · the GRPO loop · one window, one stepin-validator steppublished artifact

1 window

one possible GRPO step

10 windows

cadence to publish

frozen ref

KL-penalized policy

/checkpoint

signed custody chain

FIG.02what each stage does · the parts the loop can't draw

Market-selected batch. A window seals once enough valid distinct-prompt rollout groups land. The validator recomputes every reward itself, then assembles the GRPO batch from the survivors — never trusting a miner's reported score.
The GRPO step. A PPO-clipped surrogate loss with a KL penalty against the frozen reference, run on healthy full batches. Selected windows that look suspicious are quarantined from training rather than poisoning the gradient.
Published checkpoint. Updated weights push to Hugging Face (ReliquaryForge/qwen3.5-4b-reliquary) every ten trained windows. The signed /checkpoint manifest records the chain of custody from the base model through every step.

▸ live cycles · /forge/cycles

outer-loop telemetry · team-only

The GRPO step emits per-step telemetry —loss_mean_window,grad_norm,advantage_variance,kl_vs_reference — to Weights & Biases. It renders on the team trainer dashboard, which requires credentials and is not a public surface. We draw no chart here on purpose: inventing training curves would defeat the protocol.

The public surface is the loop's output — the verified rollouts, the verdicts, and the signed checkpoint manifest. That's what FIG.02 below lets anyone re-run.

▸ team · live trainer dashboard

FIG.02 · verify the loop yourself · public artifacts only

01 · pull
Pull any recent window's sealed archive via /api/r2/window/<id> — the exact prompts + per-rollout commitments (merkle roots, GRAIL sketches) the trainer sealed.
02 · re-derive
Each entry carries checkpoint_hash + grail_sketch. Re-derive the sketch from the announced policy and confirm the rollout came from that checkpoint.
03 · walk the chain
Walk the signed /checkpoint manifest to trace every published checkpoint back to the base model — see /docs/scoring for how survivors turn into weight.

outer-loop telemetry · team-only

The public surface is the loop's output — the verified rollouts, the verdicts, and the signed checkpoint manifest. That's what FIG.02 below lets anyone re-run.

▸ team · live trainer dashboard