Leaderboards

Greatness isn't accidental. How we measure it shouldn't be either. If we want AGI that builds billion-dollar enterprises and globe-spanning infrastructure, we can't evaluate it with clickbait and slop. We need benchmarks that test for intelligence and sophistication.

This is our ranking of models, measured by their capacity for rigorous reasoning and real-world mastery. Discover which labs are leading the frontier.

View by :

Antidote / Everyday

Antidote: Everyday Edition

A real-world AI leaderboard – real prompts, real stakes, graded by experts who read every word, check every citation, and run every line of code.

Today's release benchmarks everyday use. Agentic and enterprise workflows coming soon.

Read Blog Post

Rank

Model

elo score (95% ci)

Gemini 3.1 (Pro)

1099

(

1087

1110

)

Gemini 3.5 (Flash)

1082

(

1065

1100

)

Opus 4.6

1046

(

1034

1058

)

Opus 4.7

1045

(

1030

1060

)

Kimi K2.5

1022

(

1011

1034

)

GPT-5.5

1022

(

1005

1038

)

DeepSeek V4 (Pro)

1014

(

998

1031

)

Grok 4.20 (Beta)

1013

(

1002

1024

)

Qwen 3.5 (Plus)

1009

(

997

1021

)

DeepSeek V4 (Flash)

986

(

969

1002

)

Grok 4.3

983

(

969

996

)

DeepSeek V3.2

970

(

957

982

)

Ernie 5.1

967

(

951

983

)

Mistral Large 2512

963

(

951

975

)

Ernie 4.5 300B

884

(

872

896

)

Nova 2 (Pro)

847

(

834

859

)

View full leaderboard

Your $100B model can't read a PDF

GDP.pdf

Can frontier models master the documents that run the world? GDP.pdf is a multimodal and reasoning benchmark that takes real-world prompts and PDFs pulled directly from expert professional workflows.

Read Blog Post

Research paper

View the dataset

Rank

Model

Score

GPT-5.5 (xHigh reasoning)

Gemini 3.1 (Pro)

Claude Opus 4.7

GPT-5.4

Claude Opus 4.6

Grok 4.20 (Beta)

Kimi K2.5

Mistral Large 3

Nova 2 (Pro)

View full leaderboard

Mathematics at the frontier

Riemann-bench

We evaluate AI models on advanced mathematical problems requiring deep reasoning and novel synthesis. Our benchmark features problems from cutting-edge mathematics, sourced from leading mathematicians – Ivy League professors, PhD IMO medalists, graduate students at the top of their field – in the course of their research.

Read Blog Post

Research paper

Rank

Model

Score

GPT-5.5 (xHigh reasoning)

41.6

GPT-5.2 (xHigh reasoning)

Claude Opus 4.6

22.4

Claude Opus 4.7

20.8

Gemini 3.1 (Pro)

15.2

Gemini 3.5 Flash (High Reasoning)

15.2

Claude Opus 4.5

10.4

Kimi K2.6

10.4

Kimi K2.5

View full leaderboard

Enterprise Agents in Realistic RL Environments

EnterpriseBench: CoreCraft

Stop testing models in tiny, self-contained environments. We built CoreCraft, a large-scale startup world, and deployed AI agents to solve real tasks. Our goal: to move agents beyond the cleanliness of the lab and into the chaos of enterprise reality.

Read Blog Post

Research paper

Rank

Model

Score

GPT-5.5

52.8

GPT-5.5 (xHigh reasoning)

51.3

Gemini 3.5 Flash (High reasoning)

50.8

View full leaderboard

Creative, Business, and Everyday writing

Hemingway-bench

Stop rewarding slop. We take real-world writing tasks and put them in front of master wordsmiths. Our goal: to push AI writing from two-second vibes to genuine nuance and impact.

Read Blog Post

Rank

Model

elo score (95% ci)

Gemini 3.1 (Pro)

1087

(

1068

1105

)

Gemini 3 (Flash)

1079

(

1062

1095

)

Gemini 3 (Pro)

1074

(

1051

1097

)

Claude Opus 4.7 (Max)

1057

(

1036

1078

)

GPT-5.5

1054

(

1032

1076

)

Claude Opus 4.6

1054

(

1035

1073

)

DeepSeek V4 (Pro)

1039

(

1017

1060

)

Claude Opus 4.5

1038

(

1019

1057

)

DeepSeek V4 (Flash)

1021

(

999

1042

)

GPT-5.2 (Chat)

1018

(

1001

1035

)

Kimi K2.5

1018

(

1000

1035

)

Claude Sonnet 4.6

1014

(

995

1032

)

View full leaderboard

Leaderboards

Antidote: Everyday Edition

GDP.pdf

Riemann-bench

EnterpriseBench: CoreCraft

Hemingway-bench

Stay up-to-date on new leaderboards

Raise AGI with the richness of human intelligence.