Paper
Eval · Bilateral price negotiation

TERMS-Bench

A diagnostic benchmark for LLM negotiation agents.

Each agent plays multi-round alternating-offer episodes against a fixed, history-reactive counterpart across three regimes and six behavioral families. Metrics are computed programmatically from logged actions and reported along four orthogonal diagnostic axes — no human or LLM judge, no composite score.

Run Generated Agents

How it works

TERMS-Bench is a simulator-based benchmark: the counterpart is a fixed stochastic policy — not a second language model — so every episode is reproducible from a seed and comparisons across agents are clean.

  1. 1 · Game

    A single bilateral price negotiation for one item. Each side holds a private type t = (r, κ, η) — reservation price, urgency, stance — drawn from a regime-conditioned prior. Surplus is realized only if the agreed price lies inside the ZOPA [rseller, rbuyer].

  2. 2 · Protocol

    Alternating offers over up to K = 10 rounds. Each turn the agent receives the counterpart's price plus a sentiment and a stance cue, then returns Offer(price, message), Accept, or Reject. Price bounds, monotonic concession, and the turn budget are enforced by the environment.

  3. 3 · Counterpart

    A stochastic policy parameterised by (α, β, γ, ρ, ξ, λ…). Its acceptance and concession rules are history-reactive: they penalise fast concessions (to exploit eager agents) and reward rigidity (to break deadlocks). Cues are emitted from stance- and proximity-conditioned distributions.

  4. 4 · Metrics

    Four orthogonal diagnostic axes: Surplus (SE⁺, CSE⁺), Agreement calibration (AGR⁺, FAGR⁻), Opponent modelling (BEtype), and Procedural robustness (CritViol%). We report them separately — no composite score.

  5. 5 · Difficulty grader

    Each episode is scored by a structural grader that combines ZOPA tightness, urgency pressure, stance compatibility, and deadline proximity into a single env_score ∈ [0, 1]. Episodes are bucketed into five equal-mass bins so the headline chart traces degradation as negotiations become structurally harder.

  6. 6 · Commerce mode

    Each scenario can also carry unit economics — a resale value, a fulfillment cost, a margin floor — so the negotiated price maps directly to dollars of profit. Two perspectives are evaluated: Merchant (agent buys from a supplier) and Vendor (agent sells to a customer). The diagnostic axes above are unchanged; the dollar view is shown alongside in the next section.

  7. 7 · Bankroll mode

    A stateful sibling to commerce. Instead of evaluating each negotiation independently, the agent runs 4 merchant-side sessions against a fixed pool of suppliers, each session starting with a $100 bankroll and running for 50 negotiation periods. Cash balance carries forward within a session; hard ruin would terminate the chain if cash crossed the $0 bankruptcy threshold. The headline is terminal balance after the 50-period chain.

Surplus efficiency by counterpart family

Feasible surplus efficiency (SE⁺) across the six counterpart behavioral families, shown as a single profile overview: one polygon per agent, with the distance along each spoke encoding that agent's SE⁺ on that family. Broadly inflated shapes are robust generalists; sharply asymmetric shapes reveal which families an agent leans on (clean cues vs noise vs adversarial pressure). Hover a polygon for that agent's full profile read-out, or hover near a spoke for the family's leaderboard ranking; click anything to pin the focus.

Kind

Surplus efficiency by environment difficulty

Episodes are graded into five equal-mass difficulty bins by the paper-aligned structural grader, combining ZOPA tightness, private pressure, and stance compatibility. Each row is one agent; cell brightness encodes SE⁺, so reading left→right shows each row fading as negotiations get structurally harder. The rightmost column summarises the easy→hard drop in percent.

Kind

Current leaderboard

Per-agent paper-aligned metrics. Toggle regime to see regime-specific performance. FAGR⁻ and critical-violation rates are only reported on the latest schema; cells are empty when an agent pre-dates that instrumentation.

Regime
# Agent Provider SE⁺ AGR⁺ CSE⁺ FAGR⁻ BEtype CritViol% Ū

Data-grounded robustness

The leaderboard above is the canonical TERMS-Bench-v1 ranking. As a robustness check, we re-evaluate eleven of the agents on a data-grounded variant in which the ZOPA distribution and observable product context come from a real catalog (AmazonHistoryPrice) rather than the synthetic price geometry. The data-grounded ZOPA distribution shifts the absolute numbers, but rank order is largely preserved (Spearman ρ = 0.90, p < 10−3). The cross-suite shift is structured: stronger models tend to gain under data grounding, weaker models tend to lose, and the two structural penalties from the paper's Findings 2 and 3 (cue-use, latent-type inference) replicate.

Synthetic → data-grounded SE+ shift

Each line connects one agent's synthetic SE+ (left) to its data-grounded SE+ (right). Lines are coloured green when the agent gains, pink when it loses, and grey otherwise. Hover a line to spotlight one agent.

Data-grounded SE+ ranking

Per-model data-grounded SE+, sorted descending. Bars are coloured by tier (frontier / open-weight / sub-frontier); provider mark sits above each bar. Hover a bar to spotlight one agent.

Diagnostics

Four per-agent scatters across the four orthogonal axes. Each dot is one agent evaluated on the Overall slice of the selected run. Optimal corners are annotated on each panel.

Kind
Agreement vs. Surplus
AGR⁺ (x) × SE⁺ (y) — top-right is ideal.
Opponent modelling vs. Surplus
BEtype (x, lower is better) × SE⁺ (y).
Safety vs. Surplus
CritViol% (x, lower is better) × SE⁺ (y).
No-deal discipline vs. Surplus
FAGR⁻ (x, lower is better) × SE⁺ (y).

What the agent sees

Each agent is called via a single, deterministic JSON-in / JSON-out contract. The system prompt below is identical across models (buyer variant shown). Reasoning effort is set to the maximum the provider supports.

Show full system prompt
Loading…

Selected traces

Three hand-picked episodes — one illustrating surplus capture, one illustrating no-deal discipline, and one illustrating a diagnostic failure. Each trace is the actual logged interaction, unmodified.

Metrics

SE⁺ — Feasible surplus efficiency
Fraction of available bargaining surplus captured on episodes where a mutually rational deal exists. Higher is better.
AGR⁺ — Feasible agreement rate
Share of feasible episodes that terminate in an agreement. Higher is better; low values flag over-rigid agents.
CSE⁺ — Conditional feasible deal quality
SE⁺ conditioned on an agreement being reached. Isolates agents who agree often but settle for bad deals.
FAGR⁻ — No-deal false agreement rate
Share of infeasible episodes where the agent still agreed. Lower is better; flags over-eager / over-concessive agents.
BEtype — Belief error
Aggregate error on the agent's stated belief about counterpart reservation value, urgency, and stance. Lower is better.
CritViol% — Critical violations
Episodes with a price-bound, individual-rationality, or invalid-action violation. Lower is better.
Ū — Mean utility
Raw mean per-episode utility. Scale-dependent; kept as a sanity column alongside the normalized metrics.