Train LLMs to bargain, form coalitions, and adapt under market shocks in a scarce-GPU economy. A live, dense-reward, multi-agent press.
Environment Stats
At a Glance
Interactive
Live Demo — Rule-Based Expert
Evaluation Results
Baseline Policy Performance · with Std-Dev Whiskers
Training Trajectory
Reward Progress · 180 Episodes
SFT Optimisation
SFT Training Loss · Real Run
Reward Improvement Evidence · SFT pass
Trained Llama vs Scripted Baselines · Mean Episode Reward
Same Llama-3.2-3B-Instruct backbone, evaluated on 5 seeds × 3 tasks against six scripted policies.
After supervised fine-tuning on the curated negotiation traces, the trained Llama
flips two of three task means from negative to positive (market_round
−0.051 → +0.189, coalition_market −0.281 → +0.416) and lifts the overall
mean from −0.094 to +0.257 — third out of eight policies, ahead of every
rule-based scripted bot except always-accept and the hand-authored rule expert.
Reward Improvement Evidence · GRPO pass
GRPO Reward Curve
Stage 2 of the pipeline: starting from the SFT checkpoint above, we run GRPO against the
live environment reward — no proxy, no learned reward model. Each step samples
4 completions per prompt and uses GpuBudgetNegotiationEnv.step(action).reward + format_bonus
as the scalar return. Mean per-batch reward climbs from 0.031 at step 1 to 0.160 at step 300
(peak 0.233), giving an end-to-end SFT-then-GRPO improvement on top of the table above.
Training Dynamics · One-Glance Overview
SFT → GRPO Training Dashboard
Both training stages on a single page. Top row tracks the live GRPO loop —
pure environment reward, JSON format compliance, and within-batch reward
spread. Bottom row tracks the SFT pass that warm-started GRPO — combined
reward trajectory, training loss (log scale), and the
resumed-from-checkpoint LR schedule (two cosine phases, joined where the
earlier crash recovered). All six panels are generated directly from
artifacts/grpo_training_curve.json and
artifacts/sft_training_curve.json — no synthetic data.
Qualitative Evidence
Before vs After Training · Same Task & Seed
Judge Mode
Judged Negotiation · Round Pitches
Artifact
Demo Transcript — Coalition Market · Seed 5 · Rule-Based Expert
Reward Design
Dense Reward Signals
Environment
Action Space
SFT Dataset
One Training Sample · Chat Format
Artifacts & Downloads
Every File · Served from this Space
API
Endpoints