TERMS-Bench
Submit your model
Want your model on the TERMS-Bench leaderboard? You provide the API access; we run the full diagnostic suite on our backend and publish the results. No local infra, no eval harness setup, no GPU. Submissions typically run within 3-5 business days.
One submission, full diagnostic report
Each accepted submission is evaluated on the full paper suite:
the 3 × 6 regime × family design matrix
across the bilateral price negotiation protocol. You receive
all four diagnostic axes (SE⁺,
AGR⁺, FAGR⁻,
BE_type, CritViol%), per-family
stratification, per-regime breakdowns, and a bargaining
fingerprint card. Cost is covered by your API key; we publish
aggregate spend so you know what to expect.
Submissions are emailed to submissions@terms-bench.org. For the API key we recommend a one-time secret link (paste it in the notes field) so the key never sits in your sent-mail archive. We use the key only for the eval run, never log it, and recommend you create a dedicated, scope-limited key with a hard spend cap.
How submission works
-
1. Pick a provider
Any OpenAI-compatible chat-completions endpoint works: OpenAI, Anthropic (via the dedicated route), OpenRouter, Together, Fireworks, Groq, or your own self-hosted vLLM / SGLang server with the OpenAI-compatible adapter.
-
2. Mint a dedicated API key
We recommend creating a fresh key with a hard spend cap (typically $50-$200 covers the full paper suite, depending on model). The key is transported over TLS, used by our eval runner only, and never persisted to disk on our side.
-
3. Fill in the form and send the email
Fill in model identifier (e.g.
openai/gpt-5,anthropic/claude-opus-4-7,meta-llama/llama-4-70b-instruct), provider, your contact info, and any sampling notes. Hitting submit composes an email to submissions@terms-bench.org with everything pre-filled - review and send from your mail client. For the API key, paste a one-time secret link in the notes field (recommended) or include it directly if you accept the risk. -
4. We run the eval, you get the report
We run the full paper suite (single replicate, paper-aligned cell sizes). Within 3-5 business days, you receive a private report with all metrics, traces of representative episodes, and the leaderboard PR for your sign-off before publishing.
Submission
Supported providers
Any of these work out of the box. custom covers
anything OpenAI-compatible (vLLM, SGLang, Ollama with the OpenAI
adapter, self-hosted endpoints).
OpenAI
https://api.openai.com/v1 Header:Authorization: Bearer sk-...
Reasoning models (o1, o3,
gpt-5 family) handled via the reasoning-effort
param; pass reasoning_effort in notes.
Anthropic
https://api.anthropic.com/v1 Header:x-api-key: sk-ant-...
Native client; thinking models (claude-opus-4.x)
run with extended thinking by default.
OpenRouter
https://openrouter.ai/api/v1 Header:Authorization: Bearer sk-or-...
Use the vendor/model identifier
(e.g. meta-llama/llama-4-70b-instruct).
Reasoning-capable routes auto-detected.
Together AI
https://api.together.xyz/v1 Header:Authorization: Bearer ...
OpenAI-compatible. Good for open-weight models.
Fireworks AI
https://api.fireworks.ai/inference/v1 Header:Authorization: Bearer ...
OpenAI-compatible. Fast open-weight inference.
Groq
https://api.groq.com/openai/v1 Header:Authorization: Bearer gsk_...
OpenAI-compatible. Low-latency inference.
Custom / self-hosted
Any OpenAI-compatible URL Header:Authorization: Bearer ... (or none)
vLLM, SGLang, Ollama (OpenAI adapter), TGI, llama.cpp HTTP server, etc. Provide a publicly reachable base URL.
Missing your provider? Pick custom and paste the
base URL, or reach out at
contact@terms-bench.org.
Fine print
-
Cost. The full paper suite is roughly
3 × 6 × 60episodes × ~12 rounds × ~600 tokens/round ≈~7-8M tokenstotal. Frontier chat models land in the$50-$120range; reasoning models with high effort can reach$200+. - Cadence. We batch submissions weekly. Expedited runs (e.g. for paper deadlines) can be arranged - email submissions@terms-bench.org.
- Re-runs. Each model gets one re-run per six months (e.g. after a checkpoint update). Re-submit using the same display name to overwrite; pick a new display name to keep both versions on the leaderboard.
- Embargo. If you need results held until a publication deadline, mention it in the notes; we publish on your timeline up to 60 days.
-
Reproducibility. Submissions ship with the
exact
termsbench/defaults.tomlhash and the run manifest; everything is reproducible from the artifacts. - Get in touch. General queries, interest, collaboration, or anything else - email contact@terms-bench.org and we'll do our best to get back.