Leaderboards

Greatness isn't accidental. How we measure it shouldn't be either. If we want AGI that builds billion-dollar enterprises and globe-spanning infrastructure, we can't evaluate it with clickbait and slop. We need benchmarks that test for intelligence and sophistication.

This is our ranking of models, measured by their capacity for rigorous reasoning and real-world mastery. Discover which labs are leading the frontier.
View by :
Frontier Instruction Following

ComplexConstraints

A benchmark for entangled instruction following, where constraints depend on each other, fire conditionally, and must be inferred from context.

Rank
Model
Score
Gemini 3.1 Pro
Gemini 3.1 Pro
40.4
%
GPT-5.5
GPT-5.5
38.7
%
Gemini 3.5 Flash
Gemini 3.5 Flash
36.9
%
Qwen3.7 Max
Qwen3.7 Max
36
%
Claude Opus 4.8
Claude Opus 4.8
34.9
%
Kimi K2.6
Kimi K2.6
34
%
Claude Opus 4.7
Claude Opus 4.7
33.6
%
DeepSeek V4 Pro
DeepSeek V4 Pro
26.7
%
Kimi K2.5
Kimi K2.5
18.7
%
Grok 4.20 Beta
Grok 4.20 Beta
16.9
%
DeepSeek V4 Flash
DeepSeek V4 Flash
16.4
%
Qwen3.5 Plus
Qwen3.5 Plus
16
%
Ernie 5.1
Ernie 5.1
15.2
%
GPT 5.4
GPT 5.4
4.9
%
DeepSeek v3.2
DeepSeek v3.2
1.8
%
Mistral Large
Mistral Large
0.4
%
Nova 2 Pro
Nova 2 Pro
0
%
Ernie 4.5
Ernie 4.5
0
%
View full leaderboard
Antidote / Everyday

Antidote: Everyday Edition

A real-world AI leaderboard – real prompts, real stakes, graded by experts who read every word, check every citation, and run every line of code.


Today's release benchmarks everyday use. Agentic and enterprise workflows coming soon.

Rank
Model
elo score (95% ci)
Gemini 3.1 Pro
1100
(
1090
-
1111
)
Gemini 3.5 Flash
1085
(
1070
-
1099
)
Qwen3.7 Max
1067
(
1055
-
1079
)
Opus 4.7
1054
(
1041
-
1067
)
Kimi K2.6
1053
(
1043
-
1062
)
Opus 4.6
1053
(
1042
-
1064
)
Opus 4.8
1049
(
1033
-
1065
)
Sonnet 4.6
1036
(
1021
-
1051
)
GPT-5.5
1026
(
1012
-
1041
)
Kimi K2.5
1021
(
1010
-
1031
)
DeepSeek V4 Pro
1019
(
1005
-
1034
)
Grok 4.20 Beta
1015
(
1005
-
1026
)
Qwen3.5 Plus
1014
(
1003
-
1025
)
DeepSeek V4 Flash
991
(
977
-
1004
)
DeepSeek V3.2
971
(
958
-
983
)
Grok 4.3
969
(
957
-
981
)
Mistral Large 3
964
(
953
-
975
)
Haiku 4.5
961
(
945
-
977
)
Ernie 5.1
960
(
946
-
974
)
GPT-5.4 Mini
953
(
937
-
969
)
Gemma 3 12B
910
(
894
-
926
)
Ernie 4.5 300B
885
(
872
-
897
)
Nova 2 Pro
845
(
833
-
856
)
View full leaderboard
Multimodal Reasoning

GDP.pdf

Can frontier models master the documents that run the world? GDP.pdf is a multimodal and reasoning benchmark that takes real-world prompts and PDFs pulled directly from expert professional workflows.

Rank
Model
Score
GPT-5.5 (xHigh reasoning)
GPT-5.5 (xHigh reasoning)
25
%
Claude Opus 4.8 (Adaptive Max)
Claude Opus 4.8 (Adaptive Max)
23
%
Claude Opus 4.7 (Adaptive Max)
Claude Opus 4.7 (Adaptive Max)
21
%
Gemini 3.1 (Pro)
Gemini 3.1 (Pro)
17
%
Gemini 3.5 Flash
Gemini 3.5 Flash
14
%
Kimi K2.6
Kimi K2.6
12
%
Gemini 3 Flash
Gemini 3 Flash
10
%
Grok 4.3 (High)
Grok 4.3 (High)
8
%
Nova 2 (Pro)
Nova 2 (Pro)
2
%
NVIDIA Nemotron 3 Nano Omni
NVIDIA Nemotron 3 Nano Omni
2
%
Mistral Large 3
Mistral Large 3
2
%
View full leaderboard
Mathematics at the frontier

Riemann-bench

We evaluate AI models on advanced mathematical problems requiring deep reasoning and novel synthesis. Our benchmark features problems from cutting-edge mathematics, sourced from leading mathematicians – Ivy League professors, PhD IMO medalists, graduate students at the top of their field – in the course of their research.

Rank
Model
Score
GPT-5.5 (xHigh reasoning)
GPT-5.5 (xHigh reasoning)
41.6
%
GPT-5.2 (xHigh reasoning)
GPT-5.2 (xHigh reasoning)
32
%
Claude Opus 4.6
Claude Opus 4.6
22.4
%
Claude Opus 4.7
Claude Opus 4.7
20.8
%
Gemini 3.1 (Pro)
Gemini 3.1 (Pro)
15.2
%
Gemini 3.5 Flash (High Reasoning)
Gemini 3.5 Flash (High Reasoning)
15.2
%
Claude Opus 4.5
Claude Opus 4.5
10.4
%
Kimi K2.6
Kimi K2.6
10.4
%
Kimi K2.5
Kimi K2.5
8
%
View full leaderboard
Enterprise Agents in Realistic RL Environments

EnterpriseBench: CoreCraft

Stop testing models in tiny, self-contained environments. We built CoreCraft, a large-scale startup world, and deployed AI agents to solve real tasks. Our goal: to move agents beyond the cleanliness of the lab and into the chaos of enterprise reality.

Rank
Model
Score
GPT-5.5
GPT-5.5
52.8
%
GPT-5.5 (xHigh reasoning)
GPT-5.5 (xHigh reasoning)
51.3
%
Gemini 3.5 Flash (High reasoning)
Gemini 3.5 Flash (High reasoning)
50.8
%
View full leaderboard
Creative, Business, and Everyday writing

Hemingway-bench

Stop rewarding slop. We take real-world writing tasks and put them in front of master wordsmiths. Our goal: to push AI writing from two-second vibes to genuine nuance and impact.

Rank
Model
elo score (95% ci)
Gemini 3.1 (Pro)
1087
(
1068
-
1105
)
Gemini 3 (Flash)
1079
(
1062
-
1095
)
Gemini 3 (Pro)
1074
(
1051
-
1097
)
Claude Opus 4.7 (Max)
1057
(
1036
-
1078
)
GPT-5.5
1054
(
1032
-
1076
)
Claude Opus 4.6
1054
(
1035
-
1073
)
DeepSeek V4 (Pro)
1039
(
1017
-
1060
)
Claude Opus 4.5
1038
(
1019
-
1057
)
DeepSeek V4 (Flash)
1021
(
999
-
1042
)
GPT-5.2 (Chat)
1018
(
1001
-
1035
)
Kimi K2.5
1018
(
1000
-
1035
)
Claude Sonnet 4.6
1014
(
995
-
1032
)
View full leaderboard

Stay up-to-date on new leaderboards