Eval · Bilateral price negotiation

TERMS-Bench

A diagnostic benchmark for LLM negotiation agents.

Each agent plays multi-round alternating-offer episodes against a fixed, history-reactive counterpart across three regimes and six behavioral families. Metrics are computed programmatically from logged actions and reported along four orthogonal diagnostic axes — no human or LLM judge, no composite score.

Run— Generated— Agents—

How it works

TERMS-Bench is a simulator-based benchmark: the counterpart is a fixed stochastic policy — not a second language model — so every episode is reproducible from a seed and comparisons across agents are clean.

1 · Game

A single bilateral price negotiation for one item. Each side holds a private type t = (r, κ, η) — reservation price, urgency, stance — drawn from a regime-conditioned prior. Surplus is realized only if the agreed price lies inside the ZOPA [r_seller, r_buyer].
2 · Protocol

Alternating offers over up to K = 10 rounds. Each turn the agent receives the counterpart's price plus a sentiment and a stance cue, then returns Offer(price, message), Accept, or Reject. Price bounds, monotonic concession, and the turn budget are enforced by the environment.
3 · Counterpart

A stochastic policy parameterised by (α, β, γ, ρ, ξ, λ…). Its acceptance and concession rules are history-reactive: they penalise fast concessions (to exploit eager agents) and reward rigidity (to break deadlocks). Cues are emitted from stance- and proximity-conditioned distributions.
4 · Metrics

Four orthogonal diagnostic axes: Surplus (SE⁺, CSE⁺), Agreement calibration (AGR⁺, FAGR⁻), Opponent modelling (BE_type), and Procedural robustness (CritViol%). We report them separately — no composite score.
5 · Difficulty grader

Each episode is scored by a structural grader that combines ZOPA tightness, urgency pressure, stance compatibility, and deadline proximity into a single env_score ∈ [0, 1]. Episodes are bucketed into five equal-mass bins so the headline chart traces degradation as negotiations become structurally harder.
6 · Commerce mode

Each scenario can also carry unit economics — a resale value, a fulfillment cost, a margin floor — so the negotiated price maps directly to dollars of profit. Two perspectives are evaluated: Merchant (agent buys from a supplier) and Vendor (agent sells to a customer). The diagnostic axes above are unchanged; the dollar view is shown alongside in the next section.
7 · Bankroll mode

A stateful sibling to commerce. Instead of evaluating each negotiation independently, the agent runs 4 merchant-side sessions against a fixed pool of suppliers, each session starting with a $100 bankroll and running for 50 negotiation periods. Cash balance carries forward within a session; hard ruin would terminate the chain if cash crossed the $0 bankruptcy threshold. The headline is terminal balance after the 50-period chain.

TERMS-Commerce profit leaderboard

The same negotiation kernel as above, scored in dollars. Profit is a deterministic post-hoc transformation of the agreed price using per-episode unit economics. Merchant: profit = units · (resale − price − fulfillment) − overhead. Vendor: profit = units · (price − cogs − sales_overhead) − overhead. A no-deal returns the outside option (default $0); a deal struck past the agent's reservation surfaces as real money lost (negative profit, never clipped).

Statistical scope. Each episode is an independent bilateral negotiation; total profit is a sum of i.i.d. realisations and admits standard SEM aggregation. Path-dependent dynamics — chained cash balance, inventory, ruin — are evaluated separately in the bankroll view below.

Data source

Perspective

#	Agent	Provider	Total profit	Avg / ep.	Avg margin	Neg. profit %	Walk-away %	Money left	Regret %	Episodes

TERMS-Bankroll stateful procurement chains

Each agent plays 4 merchant-side sessions against a fixed pool of suppliers. Every session starts with a $100 bankroll and runs for 50 negotiation periods; cash balance carries forward within the session. Terminal balance after the 50-period chain is the headline. A session ends early only if cash falls below the $0 bankruptcy threshold (hard ruin); subsequent periods produce $0 profit and time-to-ruin is recorded.

This release uses the supplier-pool setup: each period the merchant draws a supplier from a fixed pool of counterparts, with no inventory carryover and a $0 bankruptcy threshold. The table compares the LLM panel on the same bankroll geometry.

Data source

Supplier mode

Cash balance over time

Stateful cash balance per period across 4 sessions per agent. Each line is the mean (or median, in IQR mode) and the shaded ribbon is its uncertainty (mean ± SEM, or p25–p75). Hover or scrub to spotlight a single agent at a specific period; click to pin. Tap the play button for a race-replay and watch the right-edge ladder re-rank as cash accumulates. The dashed line marks the starting bankroll; the dotted line marks the bankruptcy threshold.

#	Agent	Provider	Terminal $	± SEM	Avg / period	Survival	Ruin @	Max DD	Memory premium	± SEM	Sessions

Surplus efficiency by counterpart family

Feasible surplus efficiency (SE⁺) across the six counterpart behavioral families, shown as a single profile overview: one polygon per agent, with the distance along each spoke encoding that agent's SE⁺ on that family. Broadly inflated shapes are robust generalists; sharply asymmetric shapes reveal which families an agent leans on (clean cues vs noise vs adversarial pressure). Hover a polygon for that agent's full profile read-out, or hover near a spoke for the family's leaderboard ranking; click anything to pin the focus.

Kind

Surplus efficiency by environment difficulty

Episodes are graded into five equal-mass difficulty bins by the paper-aligned structural grader, combining ZOPA tightness, private pressure, and stance compatibility. Each row is one agent; cell brightness encodes SE⁺, so reading left→right shows each row fading as negotiations get structurally harder. The rightmost column summarises the easy→hard drop in percent.

Kind

Current leaderboard

Per-agent paper-aligned metrics. Toggle regime to see regime-specific performance. FAGR⁻ and critical-violation rates are only reported on the latest schema; cells are empty when an agent pre-dates that instrumentation.

Regime

#	Agent	Provider	SE⁺	AGR⁺	CSE⁺	FAGR⁻	BE_type	CritViol%	Ū

Data-grounded robustness

The leaderboard above is the canonical TERMS-Bench-v1 ranking. As a robustness check, we re-evaluate eleven of the agents on a data-grounded variant in which the ZOPA distribution and observable product context come from a real catalog (AmazonHistoryPrice) rather than the synthetic price geometry. The data-grounded ZOPA distribution shifts the absolute numbers, but rank order is largely preserved (Spearman ρ = 0.90, p < 10⁻³). The cross-suite shift is structured: stronger models tend to gain under data grounding, weaker models tend to lose, and the two structural penalties from the paper's Findings 2 and 3 (cue-use, latent-type inference) replicate.

Synthetic → data-grounded SE+ shift

Each line connects one agent's synthetic SE+ (left) to its data-grounded SE+ (right). Lines are coloured green when the agent gains, pink when it loses, and grey otherwise. Hover a line to spotlight one agent.

Data-grounded SE+ ranking

Per-model data-grounded SE+, sorted descending. Bars are coloured by tier (frontier / open-weight / sub-frontier); provider mark sits above each bar. Hover a bar to spotlight one agent.

Diagnostics

Four per-agent scatters across the four orthogonal axes. Each dot is one agent evaluated on the Overall slice of the selected run. Optimal corners are annotated on each panel.

Kind

Agreement vs. Surplus

AGR⁺ (x) × SE⁺ (y) — top-right is ideal.

Opponent modelling vs. Surplus

BE_type (x, lower is better) × SE⁺ (y).

Safety vs. Surplus

CritViol% (x, lower is better) × SE⁺ (y).

No-deal discipline vs. Surplus

FAGR⁻ (x, lower is better) × SE⁺ (y).

What the agent sees

Each agent is called via a single, deterministic JSON-in / JSON-out contract. The system prompt below is identical across models (buyer variant shown). Reasoning effort is set to the maximum the provider supports.

Show full system prompt

Loading…

Selected traces

Three hand-picked episodes — one illustrating surplus capture, one illustrating no-deal discipline, and one illustrating a diagnostic failure. Each trace is the actual logged interaction, unmodified.

Metrics

SE⁺ — Feasible surplus efficiency: Fraction of available bargaining surplus captured on episodes where a mutually rational deal exists. Higher is better.
AGR⁺ — Feasible agreement rate: Share of feasible episodes that terminate in an agreement. Higher is better; low values flag over-rigid agents.
CSE⁺ — Conditional feasible deal quality: SE⁺ conditioned on an agreement being reached. Isolates agents who agree often but settle for bad deals.
FAGR⁻ — No-deal false agreement rate: Share of infeasible episodes where the agent still agreed. Lower is better; flags over-eager / over-concessive agents.
BE_type — Belief error: Aggregate error on the agent's stated belief about counterpart reservation value, urgency, and stance. Lower is better.
CritViol% — Critical violations: Episodes with a price-bound, individual-rationality, or invalid-action violation. Lower is better.
Ū — Mean utility: Raw mean per-episode utility. Scale-dependent; kept as a sanity column alongside the normalized metrics.

1 · Game

2 · Protocol

3 · Counterpart

4 · Metrics

5 · Difficulty grader

6 · Commerce mode

7 · Bankroll mode

Cash balance over time

Synthetic → data-grounded SE+ shift

Data-grounded SE+ ranking

Metrics