TERMS-Bench

Submit your model

Want your model on the TERMS-Bench leaderboard? You provide the API access; we run the full diagnostic suite on our backend and publish the results. No local infra, no eval harness setup, no GPU. Submissions typically run within 3-5 business days.

One submission, full diagnostic report

Each accepted submission is evaluated on the full paper suite: the 3 × 6 regime × family design matrix across the bilateral price negotiation protocol. You receive all four diagnostic axes (SE⁺, AGR⁺, FAGR⁻, BE_type, CritViol%), per-family stratification, per-regime breakdowns, and a bargaining fingerprint card. Cost is covered by your API key; we publish aggregate spend so you know what to expect.

Submissions are emailed to submissions@terms-bench.org. For the API key we recommend a one-time secret link (paste it in the notes field) so the key never sits in your sent-mail archive. We use the key only for the eval run, never log it, and recommend you create a dedicated, scope-limited key with a hard spend cap.

How submission works

  1. 1. Pick a provider

    Any OpenAI-compatible chat-completions endpoint works: OpenAI, Anthropic (via the dedicated route), OpenRouter, Together, Fireworks, Groq, or your own self-hosted vLLM / SGLang server with the OpenAI-compatible adapter.

  2. 2. Mint a dedicated API key

    We recommend creating a fresh key with a hard spend cap (typically $50-$200 covers the full paper suite, depending on model). The key is transported over TLS, used by our eval runner only, and never persisted to disk on our side.

  3. 3. Fill in the form and send the email

    Fill in model identifier (e.g. openai/gpt-5, anthropic/claude-opus-4-7, meta-llama/llama-4-70b-instruct), provider, your contact info, and any sampling notes. Hitting submit composes an email to submissions@terms-bench.org with everything pre-filled - review and send from your mail client. For the API key, paste a one-time secret link in the notes field (recommended) or include it directly if you accept the risk.

  4. 4. We run the eval, you get the report

    We run the full paper suite (single replicate, paper-aligned cell sizes). Within 3-5 business days, you receive a private report with all metrics, traces of representative episodes, and the leaderboard PR for your sign-off before publishing.

Submission

Shown on the leaderboard exactly as written.
The exact string passed to the provider's API.
Required only for custom / self-hosted endpoints.
We default to temperature=0.0 and the model's standard reasoning mode unless you say otherwise. Recommended API-key delivery: paste a one-time secret URL here (e.g. from onetimesecret.com) so the raw key never lives in your sent-mail archive.

Prefer your own email client? Just email submissions@terms-bench.org with the model identifier, provider, your contact, and a one-time secret link for the API key.

Supported providers

Any of these work out of the box. custom covers anything OpenAI-compatible (vLLM, SGLang, Ollama with the OpenAI adapter, self-hosted endpoints).

OpenAI

https://api.openai.com/v1 Header: Authorization: Bearer sk-...

Reasoning models (o1, o3, gpt-5 family) handled via the reasoning-effort param; pass reasoning_effort in notes.

Anthropic

https://api.anthropic.com/v1 Header: x-api-key: sk-ant-...

Native client; thinking models (claude-opus-4.x) run with extended thinking by default.

OpenRouter

https://openrouter.ai/api/v1 Header: Authorization: Bearer sk-or-...

Use the vendor/model identifier (e.g. meta-llama/llama-4-70b-instruct). Reasoning-capable routes auto-detected.

Together AI

https://api.together.xyz/v1 Header: Authorization: Bearer ...

OpenAI-compatible. Good for open-weight models.

Fireworks AI

https://api.fireworks.ai/inference/v1 Header: Authorization: Bearer ...

OpenAI-compatible. Fast open-weight inference.

Groq

https://api.groq.com/openai/v1 Header: Authorization: Bearer gsk_...

OpenAI-compatible. Low-latency inference.

Custom / self-hosted

Any OpenAI-compatible URL Header: Authorization: Bearer ... (or none)

vLLM, SGLang, Ollama (OpenAI adapter), TGI, llama.cpp HTTP server, etc. Provide a publicly reachable base URL.

Missing your provider? Pick custom and paste the base URL, or reach out at contact@terms-bench.org.

Fine print