TERMS-Bench: Submit your model

One submission, full diagnostic report

Each accepted submission is evaluated on the full paper suite: the 3 × 6 regime × family design matrix across the bilateral price negotiation protocol. You receive all four diagnostic axes (SE⁺, AGR⁺, FAGR⁻, BE_type, CritViol%), per-family stratification, per-regime breakdowns, and a bargaining fingerprint card. Cost is covered by your API key; we publish aggregate spend so you know what to expect.

Submissions are emailed to submissions@terms-bench.org. For the API key we recommend a one-time secret link (paste it in the notes field) so the key never sits in your sent-mail archive. We use the key only for the eval run, never log it, and recommend you create a dedicated, scope-limited key with a hard spend cap.

How submission works

1. Pick a provider

Any OpenAI-compatible chat-completions endpoint works: OpenAI, Anthropic (via the dedicated route), OpenRouter, Together, Fireworks, Groq, or your own self-hosted vLLM / SGLang server with the OpenAI-compatible adapter.
2. Mint a dedicated API key

We recommend creating a fresh key with a hard spend cap (typically $50-$200 covers the full paper suite, depending on model). The key is transported over TLS, used by our eval runner only, and never persisted to disk on our side.
3. Fill in the form and send the email

Fill in model identifier (e.g. openai/gpt-5, anthropic/claude-opus-4-7, meta-llama/llama-4-70b-instruct), provider, your contact info, and any sampling notes. Hitting submit composes an email to submissions@terms-bench.org with everything pre-filled - review and send from your mail client. For the API key, paste a one-time secret link in the notes field (recommended) or include it directly if you accept the risk.
4. We run the eval, you get the report

We run the full paper suite (single replicate, paper-aligned cell sizes). Within 3-5 business days, you receive a private report with all metrics, traces of representative episodes, and the leaderboard PR for your sign-off before publishing.

Submission

Your name*

Contact email*

Affiliation

Model display name* Shown on the leaderboard exactly as written.

Model identifier* The exact string passed to the provider's API.

Provider*

API base URL Required only for custom / self-hosted endpoints.

Notes — API key delivery, sampling, reasoning * We default to temperature=0.0 and the model's standard reasoning mode unless you say otherwise. Recommended API-key delivery: paste a one-time secret URL here (e.g. from onetimesecret.com) so the raw key never lives in your sent-mail archive.

Prefer your own email client? Just email submissions@terms-bench.org with the model identifier, provider, your contact, and a one-time secret link for the API key.

Supported providers

Any of these work out of the box. custom covers anything OpenAI-compatible (vLLM, SGLang, Ollama with the OpenAI adapter, self-hosted endpoints).

OpenAI

https://api.openai.com/v1 Header: Authorization: Bearer sk-...

Reasoning models (o1, o3, gpt-5 family) handled via the reasoning-effort param; pass reasoning_effort in notes.

Anthropic

https://api.anthropic.com/v1 Header: x-api-key: sk-ant-...

Native client; thinking models (claude-opus-4.x) run with extended thinking by default.

OpenRouter

https://openrouter.ai/api/v1 Header: Authorization: Bearer sk-or-...

Use the vendor/model identifier (e.g. meta-llama/llama-4-70b-instruct). Reasoning-capable routes auto-detected.

Together AI

https://api.together.xyz/v1 Header: Authorization: Bearer ...

OpenAI-compatible. Good for open-weight models.

Fireworks AI

https://api.fireworks.ai/inference/v1 Header: Authorization: Bearer ...

OpenAI-compatible. Fast open-weight inference.

Groq

https://api.groq.com/openai/v1 Header: Authorization: Bearer gsk_...

OpenAI-compatible. Low-latency inference.

Custom / self-hosted

Any OpenAI-compatible URL Header: Authorization: Bearer ... (or none)

vLLM, SGLang, Ollama (OpenAI adapter), TGI, llama.cpp HTTP server, etc. Provide a publicly reachable base URL.

Missing your provider? Pick custom and paste the base URL, or reach out at contact@terms-bench.org.

Fine print

Cost. The full paper suite is roughly 3 × 6 × 60 episodes × ~12 rounds × ~600 tokens/round ≈ ~7-8M tokens total. Frontier chat models land in the $50-$120 range; reasoning models with high effort can reach $200+.
Cadence. We batch submissions weekly. Expedited runs (e.g. for paper deadlines) can be arranged - email submissions@terms-bench.org.
Re-runs. Each model gets one re-run per six months (e.g. after a checkpoint update). Re-submit using the same display name to overwrite; pick a new display name to keep both versions on the leaderboard.
Embargo. If you need results held until a publication deadline, mention it in the notes; we publish on your timeline up to 60 days.
Reproducibility. Submissions ship with the exact termsbench/defaults.toml hash and the run manifest; everything is reproducible from the artifacts.
Get in touch. General queries, interest, collaboration, or anything else - email contact@terms-bench.org and we'll do our best to get back.