TERMS-Bench
A diagnostic benchmark for LLM negotiation agents.
Each agent plays multi-round alternating-offer episodes against a fixed, history-reactive counterpart across three regimes and six behavioral families. Metrics are computed programmatically from logged actions and reported along four orthogonal diagnostic axes — no human or LLM judge, no composite score.
How it works
TERMS-Bench is a simulator-based benchmark: the counterpart is a fixed stochastic policy — not a second language model — so every episode is reproducible from a seed and comparisons across agents are clean.
-
1 · Game
A single bilateral price negotiation for one item. Each side holds a private type
t = (r, κ, η)— reservation price, urgency, stance — drawn from a regime-conditioned prior. Surplus is realized only if the agreed price lies inside the ZOPA[rseller, rbuyer]. -
2 · Protocol
Alternating offers over up to
K = 10rounds. Each turn the agent receives the counterpart's price plus a sentiment and a stance cue, then returnsOffer(price, message),Accept, orReject. Price bounds, monotonic concession, and the turn budget are enforced by the environment. -
3 · Counterpart
A stochastic policy parameterised by
(α, β, γ, ρ, ξ, λ…). Its acceptance and concession rules are history-reactive: they penalise fast concessions (to exploit eager agents) and reward rigidity (to break deadlocks). Cues are emitted from stance- and proximity-conditioned distributions. -
4 · Metrics
Four orthogonal diagnostic axes: Surplus (
SE⁺,CSE⁺), Agreement calibration (AGR⁺,FAGR⁻), Opponent modelling (BEtype), and Procedural robustness (CritViol%). We report them separately — no composite score. -
5 · Difficulty grader
Each episode is scored by a structural grader that combines ZOPA tightness, urgency pressure, stance compatibility, and deadline proximity into a single
env_score ∈ [0, 1]. Episodes are bucketed into five equal-mass bins so the headline chart traces degradation as negotiations become structurally harder. -
6 · Commerce mode
Each scenario can also carry unit economics — a resale value, a fulfillment cost, a margin floor — so the negotiated price maps directly to dollars of profit. Two perspectives are evaluated: Merchant (agent buys from a supplier) and Vendor (agent sells to a customer). The diagnostic axes above are unchanged; the dollar view is shown alongside in the next section.
-
7 · Bankroll mode
A stateful sibling to commerce. Instead of evaluating each negotiation independently, the agent runs 4 merchant-side sessions against a fixed pool of suppliers, each session starting with a $100 bankroll and running for 50 negotiation periods. Cash balance carries forward within a session; hard ruin would terminate the chain if cash crossed the $0 bankruptcy threshold. The headline is terminal balance after the 50-period chain.
TERMS-Commerce profit leaderboard
The same negotiation kernel as above, scored in dollars. Profit is a
deterministic post-hoc transformation of the agreed price using
per-episode unit economics.
Merchant: profit = units · (resale − price − fulfillment) − overhead.
Vendor: profit = units · (price − cogs − sales_overhead) − overhead.
A no-deal returns the outside option (default $0); a deal struck
past the agent's reservation surfaces as real money lost (negative
profit, never clipped).
Statistical scope. Each episode is an independent bilateral negotiation; total profit is a sum of i.i.d. realisations and admits standard SEM aggregation. Path-dependent dynamics — chained cash balance, inventory, ruin — are evaluated separately in the bankroll view below.
| # | Agent | Provider | Total profit | Avg / ep. | Avg margin | Neg. profit % | Walk-away % | Money left | Regret % | Episodes |
|---|
TERMS-Bankroll stateful procurement chains
Each agent plays 4 merchant-side sessions against a fixed pool of suppliers. Every session starts with a $100 bankroll and runs for 50 negotiation periods; cash balance carries forward within the session. Terminal balance after the 50-period chain is the headline. A session ends early only if cash falls below the $0 bankruptcy threshold (hard ruin); subsequent periods produce $0 profit and time-to-ruin is recorded.
This release uses the supplier-pool setup: each period the merchant draws a supplier from a fixed pool of counterparts, with no inventory carryover and a $0 bankruptcy threshold. The table compares the LLM panel on the same bankroll geometry.
Cash balance over time
Stateful cash balance per period across 4 sessions per agent. Each line is the mean (or median, in IQR mode) and the shaded ribbon is its uncertainty (mean ± SEM, or p25–p75). Hover or scrub to spotlight a single agent at a specific period; click to pin. Tap the play button for a race-replay and watch the right-edge ladder re-rank as cash accumulates. The dashed line marks the starting bankroll; the dotted line marks the bankruptcy threshold.
| # | Agent | Provider | Terminal $ | ± SEM | Avg / period | Survival | Ruin @ | Max DD | Memory premium | ± SEM | Sessions |
|---|
Surplus efficiency by counterpart family
Feasible surplus efficiency (SE⁺) across the six
counterpart behavioral families, shown as a single profile
overview: one polygon per agent, with the distance along each
spoke encoding that agent's SE⁺ on that family. Broadly inflated
shapes are robust generalists; sharply asymmetric shapes reveal
which families an agent leans on (clean cues vs noise vs
adversarial pressure). Hover a polygon for that agent's full
profile read-out, or hover near a spoke for the family's leaderboard
ranking; click anything to pin the focus.
Surplus efficiency by environment difficulty
Episodes are graded into five equal-mass difficulty bins by the paper-aligned structural grader, combining ZOPA tightness, private pressure, and stance compatibility. Each row is one agent; cell brightness encodes SE⁺, so reading left→right shows each row fading as negotiations get structurally harder. The rightmost column summarises the easy→hard drop in percent.
Current leaderboard
Per-agent paper-aligned metrics. Toggle regime to see regime-specific
performance. FAGR⁻ and critical-violation rates are only
reported on the latest schema; cells are empty when an agent
pre-dates that instrumentation.
| # | Agent | Provider | SE⁺ | AGR⁺ | CSE⁺ | FAGR⁻ | BEtype | CritViol% | Ū |
|---|
Data-grounded robustness
The leaderboard above is the canonical TERMS-Bench-v1 ranking. As a robustness check, we re-evaluate eleven of the agents on a data-grounded variant in which the ZOPA distribution and observable product context come from a real catalog (AmazonHistoryPrice) rather than the synthetic price geometry. The data-grounded ZOPA distribution shifts the absolute numbers, but rank order is largely preserved (Spearman ρ = 0.90, p < 10−3). The cross-suite shift is structured: stronger models tend to gain under data grounding, weaker models tend to lose, and the two structural penalties from the paper's Findings 2 and 3 (cue-use, latent-type inference) replicate.
Synthetic → data-grounded SE+ shift
Each line connects one agent's synthetic SE+ (left) to its data-grounded SE+ (right). Lines are coloured green when the agent gains, pink when it loses, and grey otherwise. Hover a line to spotlight one agent.
Data-grounded SE+ ranking
Per-model data-grounded SE+, sorted descending. Bars are coloured by tier (frontier / open-weight / sub-frontier); provider mark sits above each bar. Hover a bar to spotlight one agent.
Diagnostics
Four per-agent scatters across the four orthogonal axes. Each dot is one agent evaluated on the Overall slice of the selected run. Optimal corners are annotated on each panel.
What the agent sees
Each agent is called via a single, deterministic JSON-in / JSON-out contract. The system prompt below is identical across models (buyer variant shown). Reasoning effort is set to the maximum the provider supports.
Show full system prompt
Loading…
Selected traces
Three hand-picked episodes — one illustrating surplus capture, one illustrating no-deal discipline, and one illustrating a diagnostic failure. Each trace is the actual logged interaction, unmodified.
Metrics
SE⁺— Feasible surplus efficiency- Fraction of available bargaining surplus captured on episodes where a mutually rational deal exists. Higher is better.
AGR⁺— Feasible agreement rate- Share of feasible episodes that terminate in an agreement. Higher is better; low values flag over-rigid agents.
CSE⁺— Conditional feasible deal qualitySE⁺conditioned on an agreement being reached. Isolates agents who agree often but settle for bad deals.FAGR⁻— No-deal false agreement rate- Share of infeasible episodes where the agent still agreed. Lower is better; flags over-eager / over-concessive agents.
BEtype— Belief error- Aggregate error on the agent's stated belief about counterpart reservation value, urgency, and stance. Lower is better.
CritViol%— Critical violations- Episodes with a price-bound, individual-rationality, or invalid-action violation. Lower is better.
Ū— Mean utility- Raw mean per-episode utility. Scale-dependent; kept as a sanity column alongside the normalized metrics.