GPU Budget Negotiation Arena

Environment Stats

At a Glance

3

Task Types

12

Action Types

10

Reward Signals

6

Baseline Policies

180

Train Episodes

0.45

Final Eval Reward

…

API Status

0.81

Expert Reward (Hard)

Interactive

Live Demo — Rule-Based Expert

Negotiation Session IDLE

            Press RUN to start a live negotiation episode.
          

Reward Accumulation STEP 0

CUMULATIVE REWARD

0.0000

Evaluation Results

Baseline Policy Performance · with Std-Dev Whiskers

SPLIT

Training Trajectory

Reward Progress · 180 Episodes

SFT Optimisation

SFT Training Loss · Real Run

Reward Improvement Evidence · SFT pass

Trained Llama vs Scripted Baselines · Mean Episode Reward

Same Llama-3.2-3B-Instruct backbone, evaluated on 5 seeds × 3 tasks against six scripted policies. After supervised fine-tuning on the curated negotiation traces, the trained Llama flips two of three task means from negative to positive (market_round −0.051 → +0.189, coalition_market −0.281 → +0.416) and lifts the overall mean from −0.094 to +0.257 — third out of eight policies, ahead of every rule-based scripted bot except always-accept and the hand-authored rule expert.

Reward Improvement Evidence · GRPO pass

GRPO Reward Curve

Stage 2 of the pipeline: starting from the SFT checkpoint above, we run GRPO against the live environment reward — no proxy, no learned reward model. Each step samples 4 completions per prompt and uses GpuBudgetNegotiationEnv.step(action).reward + format_bonus as the scalar return. Mean per-batch reward climbs from 0.031 at step 1 to 0.160 at step 300 (peak 0.233), giving an end-to-end SFT-then-GRPO improvement on top of the table above.

Training Dynamics · One-Glance Overview

SFT → GRPO Training Dashboard

Both training stages on a single page. Top row tracks the live GRPO loop — pure environment reward, JSON format compliance, and within-batch reward spread. Bottom row tracks the SFT pass that warm-started GRPO — combined reward trajectory, training loss (log scale), and the resumed-from-checkpoint LR schedule (two cosine phases, joined where the earlier crash recovered). All six panels are generated directly from artifacts/grpo_training_curve.json and artifacts/sft_training_curve.json — no synthetic data.

Vector source · plots/training_dashboard.svg

Qualitative Evidence

Before vs After Training · Same Task & Seed

Judge Mode

Judged Negotiation · Round Pitches

round 0 of 1

Artifact

Demo Transcript — Coalition Market · Seed 5 · Rule-Based Expert

Loading transcript…

Reward Design

Dense Reward Signals

job_utility_score

deal_quality_score

coalition_reliability_score

budget_efficiency_score

negotiation_efficiency_score

market_adaptation_score

invalid_action_penalty

breach_penalty

Environment

Action Space

send_offer

Propose a GPU block trade to another lab.

accept_offer

Accept a pending offer, executing the transfer.

reject_offer

Decline an incoming offer with no penalty.

counter_offer

Respond with a modified price or block set.

reserve_capacity

Lock blocks for a future job deadline.

release_capacity

Free reserved blocks back to the market.

form_coalition

Invite a lab into a shared-capacity coalition.

commit_to_coalition

Bind yourself to coalition terms—breaking it incurs penalty.

allocate_to_job

Assign GPU blocks to one of your pending jobs.

send_message

Free-text communication for belief modeling.

wait

Pass the turn; useful when watching market shocks.

finish

Signal episode end; triggers final settlement.

SFT Dataset

One Training Sample · Chat Format

Artifacts & Downloads

Every File · Served from this Space

API

Endpoints

GET/health

Liveness check — returns benchmark_id and status.

GET/tasks

Lists all task types with difficulty and feature flags.

POST/reset

Start a new episode with a given task_type and seed.

POST/step

Submit an action; returns the next observation and reward.

GET/state

Public market state. Pass include_private=true (debug only).