● OpenEnv · Multi-Agent Benchmark · MMXXVI

GPU Budget
Negotiation Arena.

Train LLMs to bargain, form coalitions, and adapt under market shocks in a scarce-GPU economy. A live, dense-reward, multi-agent press.

Running Theme #1 · Multi-Agent OpenEnv Compatible FastAPI · Docker

At a Glance

3
Task Types
12
Action Types
10
Reward Signals
6
Baseline Policies
180
Train Episodes
0.45
Final Eval Reward
API Status
0.81
Expert Reward (Hard)

Live Demo — Rule-Based Expert

Negotiation Session IDLE
Press RUN to start a live negotiation episode.
Reward Accumulation STEP 0
CUMULATIVE REWARD
0.0000

Baseline Policy Performance · with Std-Dev Whiskers

SPLIT

Reward Progress · 180 Episodes

SFT Training Loss · Real Run

Trained Llama vs Scripted Baselines · Mean Episode Reward

Same Llama-3.2-3B-Instruct backbone, evaluated on 5 seeds × 3 tasks against six scripted policies. After supervised fine-tuning on the curated negotiation traces, the trained Llama flips two of three task means from negative to positive (market_round −0.051 → +0.189, coalition_market −0.281 → +0.416) and lifts the overall mean from −0.094 to +0.257 — third out of eight policies, ahead of every rule-based scripted bot except always-accept and the hand-authored rule expert.

trained_llm_vs_baselines.svg

GRPO Reward Curve

Stage 2 of the pipeline: starting from the SFT checkpoint above, we run GRPO against the live environment reward — no proxy, no learned reward model. Each step samples 4 completions per prompt and uses GpuBudgetNegotiationEnv.step(action).reward + format_bonus as the scalar return. Mean per-batch reward climbs from 0.031 at step 1 to 0.160 at step 300 (peak 0.233), giving an end-to-end SFT-then-GRPO improvement on top of the table above.

SFT → GRPO Training Dashboard

Both training stages on a single page. Top row tracks the live GRPO loop — pure environment reward, JSON format compliance, and within-batch reward spread. Bottom row tracks the SFT pass that warm-started GRPO — combined reward trajectory, training loss (log scale), and the resumed-from-checkpoint LR schedule (two cosine phases, joined where the earlier crash recovered). All six panels are generated directly from artifacts/grpo_training_curve.json and artifacts/sft_training_curve.json — no synthetic data.

Vector source · plots/training_dashboard.svg

Before vs After Training · Same Task & Seed

Judged Negotiation · Round Pitches

round 0 of 1

Demo Transcript — Coalition Market · Seed 5 · Rule-Based Expert

Loading transcript…

Dense Reward Signals

job_utility_score
deal_quality_score
coalition_reliability_score
budget_efficiency_score
negotiation_efficiency_score
market_adaptation_score
invalid_action_penalty
breach_penalty

Action Space

send_offer
Propose a GPU block trade to another lab.
accept_offer
Accept a pending offer, executing the transfer.
reject_offer
Decline an incoming offer with no penalty.
counter_offer
Respond with a modified price or block set.
reserve_capacity
Lock blocks for a future job deadline.
release_capacity
Free reserved blocks back to the market.
form_coalition
Invite a lab into a shared-capacity coalition.
commit_to_coalition
Bind yourself to coalition terms—breaking it incurs penalty.
allocate_to_job
Assign GPU blocks to one of your pending jobs.
send_message
Free-text communication for belief modeling.
wait
Pass the turn; useful when watching market shocks.
finish
Signal episode end; triggers final settlement.

One Training Sample · Chat Format

Every File · Served from this Space

Endpoints

GET/health
Liveness check — returns benchmark_id and status.
GET/tasks
Lists all task types with difficulty and feature flags.
POST/reset
Start a new episode with a given task_type and seed.
POST/step
Submit an action; returns the next observation and reward.
GET/state
Public market state. Pass include_private=true (debug only).