the core training loop

Reliquary is a decentralized GRPO training market on Bittensor Subnet 81. A model learns fastest from prompts at its learning frontier — hard enough that rollouts disagree, easy enough that the gradient still carries signal. Reliquary turns finding those prompts into a competitive market.

Miners aren't paid per rollout. They're paid for verified rollouts the trainer actually needs. A miner predicts which prompts sit in the learning zone, generates rollouts, and attaches a cryptographic proof that the work is real. The validator verifies every proof, runs a GRPO step on what survives, and publishes the updated checkpoint to Hugging Face.

The actors

Miners select prompts at the learning frontier, run the model, and submit rollouts + a GRAIL proof.
The validator recomputes rewards, verifies proofs, runs the GRPO step, and publishes checkpoints.
The chain records selection and scoring so outside observers can audit later.

One window, end to end

The validator announces the active checkpoint and the per-window randomness.
Miners bet their own compute on prompts they predict sit in the trainable band, generate M rollouts per group, and attach a GRAIL sketch binding each rollout to the announced weights.
The window seals once enough valid distinct-prompt groups land; final ordering follows drand/canonical rules, not validator-side latency.
The validator recomputes every reward, verifies every sketch, and assembles a GRPO batch from the survivors. Fabricated work earns zero.
A PPO-clipped GRPO step runs; updated weights publish to Hugging Face every ten trained windows, with a signed manifest recording the chain of custody from the base model.

Why selection is the game

Only a small slice of prompts sit in the learning zone at any checkpoint, and the band narrows as the policy matures. A miner who picks well lands on winning prompts and earns emission; a miner who picks poorly burns its own compute on rejects like out_of_zone. This converts DAPO's reactive generate-then-discard filter into an ex-ante prediction market — and makes selection intelligence more valuable over time, not less.

Why the work can't be faked

Every rollout carries a GRAIL sketch — a fingerprint over hidden-state activations at sampled positions, bound to per-window randomness the miner can't predict in advance. The validator recomputes it and rejects anything outside tolerance. The full attack-class audit lives at /docs/scoring.

What's different from today's subnets

Most Bittensor subnets pay miners for volume. Reliquary pays for the rollouts the trainer needs — and proves they're real before a single gradient step. The network's output is a continuously-trained model published to Hugging Face, not just a leaderboard.

One window, end to end

The validator announces the active checkpoint and the per-window randomness.

Miners bet their own compute on prompts they predict sit in the trainable band, generate M rollouts per group, and attach a GRAIL sketch binding each rollout to the announced weights.

The window seals once enough valid distinct-prompt groups land; final ordering follows drand/canonical rules, not validator-side latency.

The validator recomputes every reward, verifies every sketch, and assembles a GRPO batch from the survivors. Fabricated work earns zero.

A PPO-clipped GRPO step runs; updated weights publish to Hugging Face every ten trained windows, with a signed manifest recording the chain of custody from the base model.

Why selection is the game

How Reliquary works

The actors

One window, end to end

Why selection is the game

Why the work can't be faked

What's different from today's subnets

How Reliquary works

The actors

One window, end to end

Why selection is the game

Why the work can't be faked

What's different from today's subnets