Benchmarks

Greatness isn't accidental. How we measure it shouldn't be either. If we want AGI that builds billion-dollar enterprises and globe-spanning infrastructure, we need benchmarks that test for intelligence and sophistication — not clickbait and slop.

This is our ranking of models, measured by their capacity for rigorous reasoning and real-world mastery.
View by :
Long-Context Agentic Instruction Following

HANDBOOK.md

Can an agent follow a 100-page company handbook inside an enterprise RL environment?

HANDBOOK.md is a benchmark for long-context agentic instruction following, modeled on how professionals follow corporate policy in their day-to-day work. Each task is a unique RL environment with internal tools and external MCP servers, across five enterprise domains.

Rank
Model
Score
Opus 4.8 (max reasoning)
21.9
%
Opus 4.8 (max reasoning)
GPT-5.5 (xHigh reasoning)
21.5
%
GPT-5.5 (xHigh reasoning)
GPT-5.5
21.5
%
GPT-5.5
Opus 4.8
18.9
%
Opus 4.8
GLM 5.2
12.7
%
GLM 5.2
Gemini 3.5 Flash (high reasoning)
11.2
%
Gemini 3.5 Flash (high reasoning)
Sonnet 4.6 (max reasoning)
10.4
%
Sonnet 4.6 (max reasoning)
GLM 5.2 (xHigh reasoning)
10
%
GLM 5.2 (xHigh reasoning)
Gemini 3.1 Pro Preview
10
%
Gemini 3.1 Pro Preview
Gemini 3.5 Flash
9.2
%
Gemini 3.5 Flash
Deepseek V4 Pro (xHigh reasoning)
9.2
%
Deepseek V4 Pro (xHigh reasoning)
Qwen 3.7 Max
8.5
%
Qwen 3.7 Max
Sonnet 4.6
7.7
%
Sonnet 4.6
Deepseek V4 Flash (xHigh reasoning)
7.3
%
Deepseek V4 Flash (xHigh reasoning)
Deepseek V4 Flash
7.3
%
Deepseek V4 Flash
Kimi K.26
6.9
%
Kimi K.26
Deepseek V4 Pro
6.9
%
Deepseek V4 Pro
Grok 4.3 (high reasoning)
1.9
%
Grok 4.3 (high reasoning)
Nemotron 3 Ultra
1.5
%
Nemotron 3 Ultra
Grok 4.3
0.8
%
Grok 4.3
View full Benchmark
Enterprise Instruction Following

ComplexConstraints

A benchmark for the kind of instruction following professional work demands — where constraints depend on each other, fire conditionally, and must be inferred from context.

Rank
Model
Score
Gemini 3.1 Pro
40.4
%
GPT-5.5
38.7
%
Gemini 3.5 Flash
36.9
%
Qwen3.7 Max
36
%
Claude Opus 4.8
34.9
%
Kimi K2.6
34
%
Claude Opus 4.7
33.6
%
DeepSeek V4 Pro
26.7
%
Kimi K2.5
18.7
%
Grok 4.20 Beta
16.9
%
DeepSeek V4 Flash
16.4
%
Qwen3.5 Plus
16
%
Ernie 5.1
15.2
%
GPT 5.4
4.9
%
DeepSeek v3.2
1.8
%
Mistral Large
0.4
%
Nova 2 Pro
0
%
Ernie 4.5
0
%
View full Benchmark
Antidote / Everyday

Antidote: Everyday Edition

A real-world AI leaderboard – real prompts, real stakes, graded by experts who read every word, check every citation, and run every line of code.


Today's release benchmarks everyday use. Agentic and enterprise workflows coming soon.

Rank
Model
elo score (95% ci)
Gemini 3.1 Pro
1099
(
1088
-
1109
)
Gemini 3.5 Flash
1082
(
1068
-
1096
)
Qwen3.7 Max
1065
(
1053
-
1077
)
GLM 5.2
1060
(
1044
-
1076
)
Opus 4.7
1054
(
1041
-
1067
)
Kimi K2.6
1052
(
1042
-
1061
)
Opus 4.6
1050
(
1039
-
1060
)
Opus 4.8
1046
(
1031
-
1062
)
Sonnet 4.6
1028
(
1014
-
1043
)
GPT-5.5
1027
(
1012
-
1041
)
DeepSeek V4 Pro
1019
(
1005
-
1032
)
Kimi K2.5
1018
(
1007
-
1029
)
Grok 4.20 Beta
1013
(
1002
-
1024
)
Qwen3.5 Plus
1011
(
1000
-
1023
)
DeepSeek V4 Flash
990
(
977
-
1003
)
DeepSeek V3.2
968
(
956
-
981
)
Grok 4.3
965
(
953
-
977
)
Mistral Large 3
962
(
950
-
973
)
Haiku 4.5
957
(
942
-
973
)
Ernie 5.1
956
(
943
-
970
)
GPT-5.4 Mini
948
(
933
-
964
)
Gemma 3 12B
905
(
889
-
920
)
Ernie 4.5 300B
882
(
870
-
894
)
Nova 2 Pro
842
(
830
-
853
)
View full Benchmark
Professional Multimodal Reasoning

GDP.pdf

Can frontier models master the documents that run the world? GDP.pdf is a multimodal and reasoning benchmark that takes real-world prompts and PDFs pulled directly from expert professional workflows.

Rank
Model
Score
Claude Fable 5 / Mythos 5
30
%
Claude Fable 5 / Mythos 5
GPT-5.5 (xHigh reasoning)
25
%
GPT-5.5 (xHigh reasoning)
Claude Opus 4.8 (Adaptive Max)
23
%
Claude Opus 4.8 (Adaptive Max)
Claude Opus 4.7 (Adaptive Max)
21
%
Claude Opus 4.7 (Adaptive Max)
Gemini 3.1 (Pro)
17
%
Gemini 3.1 (Pro)
Gemini 3.5 Flash
14
%
Gemini 3.5 Flash
Kimi K2.6
12
%
Kimi K2.6
Gemini 3 Flash
10
%
Gemini 3 Flash
Grok 4.3 (High)
8
%
Grok 4.3 (High)
Nova 2 (Pro)
2
%
Nova 2 (Pro)
NVIDIA Nemotron 3 Nano Omni
2
%
NVIDIA Nemotron 3 Nano Omni
Mistral Large 3
2
%
Mistral Large 3
View full Benchmark
Mathematics at the frontier

Riemann-bench

We evaluate AI models on advanced mathematical problems requiring deep reasoning and novel synthesis. Our benchmark features problems from cutting-edge mathematics, sourced from leading mathematicians – Ivy League professors, PhD IMO medalists, graduate students at the top of their field – in the course of their research.

Rank
Model
Score
Claude Fable 5 / Mythos 5
55
%
Claude Fable 5 / Mythos 5
GPT-5.5 (xHigh reasoning)
41.6
%
GPT-5.5 (xHigh reasoning)
GPT-5.2 (xHigh reasoning)
32
%
GPT-5.2 (xHigh reasoning)
Claude Opus 4.8
25.6
%
Claude Opus 4.8
Claude Opus 4.6
22.4
%
Claude Opus 4.6
Claude Opus 4.7
20.8
%
Claude Opus 4.7
Gemini 3.1 (Pro)
15.2
%
Gemini 3.1 (Pro)
Gemini 3.5 Flash (High Reasoning)
15.2
%
Gemini 3.5 Flash (High Reasoning)
Claude Opus 4.5
10.4
%
Claude Opus 4.5
Kimi K2.6
10.4
%
Kimi K2.6
Kimi K2.5
8
%
Kimi K2.5
DeepSeek V4 (Flash)
5.6
%
DeepSeek V4 (Flash)
DeepSeek v3.2 (Thinking)
4.8
%
DeepSeek v3.2 (Thinking)
Qwen 3.7 (Max)
4.8
%
Qwen 3.7 (Max)
DeepSeek V4 (Pro)
2.4
%
DeepSeek V4 (Pro)
View full Benchmark
Enterprise Agents in Realistic RL Environments

EnterpriseBench: CoreCraft

Stop testing models in tiny, self-contained environments. We built CoreCraft, a large-scale startup world, and deployed AI agents to solve real tasks. Our goal: to move agents beyond the cleanliness of the lab and into the chaos of enterprise reality.

Rank
Model
Score
Fable 5 (Max reasoning)
70.3
%
Fable 5 (Max reasoning)
GPT-5.5
52.8
%
GPT-5.5
Claude Opus 4.8 (Max reasoning)
52.3
%
Claude Opus 4.8 (Max reasoning)
View full Benchmark
Creative, Business, and Everyday writing

Hemingway-bench

Stop rewarding slop. We take real-world writing tasks and put them in front of master wordsmiths. Our goal: to push AI writing from two-second vibes to genuine nuance and impact.

Rank
Model
elo score (95% ci)
Gemini 3.1 Pro
1079
(
1061
-
1096
)
Gemini 3.5 Flash
1078
(
1057
-
1098
)
Gemini 3 Flash
1067
(
1052
-
1083
)
Gemini 3 Pro
1064
(
1041
-
1088
)
Opus 4.8 (Default)
1051
(
1029
-
1072
)
Opus 4.6
1050
(
1033
-
1068
)
Opus 4.7
1049
(
1030
-
1068
)
GPT-5.5
1048
(
1029
-
1068
)
GLM-5.2
1046
(
1024
-
1067
)
Kimi K2.6
1036
(
1015
-
1057
)
Opus 4.5
1027
(
1007
-
1047
)
DeepSeek V4 Pro
1014
(
995
-
1033
)
View full Benchmark

Stay up-to-date on new benchmarks