`Benchmarks`

Greatness isn't accidental. How we measure it shouldn't be either. If we want AGI that builds billion-dollar enterprises and globe-spanning infrastructure, we need benchmarks that test for intelligence and sophistication — not clickbait and slop.

This is our ranking of models, measured by their capacity for rigorous reasoning and real-world mastery.

View by :

Professional Graphical Reasoning

Chartography

Chartography is our benchmark for professional chart understanding. It tests whether frontier models can read the Kaplan-Meier curves, candlestick charts, contour maps, Sankey diagrams, Bode plots, and other specialized graphics that professionals use to make real decisions every day. It evaluates visual perception, domain-aware interpretation, and multi-step graphical reasoning.

Rank

Model

Score

GPT 5.6 Sol (Max)

GPT 5.6 Sol

39.5

GPT 5.6 Sol

Gemini 3.5 Flash

35.9

Gemini 3.5 Flash

Claude Fable 5 (Adaptive/Max)

34.8

Claude Fable 5 (Adaptive/Max)

GPT 5.6 Terra (Max)

GPT 5.6 Luna (Max)

31.4

GPT 5.6 Luna (Max)

GPT 5.5 (xHigh)

GPT 5.4 (xHigh)

29.6

GPT 5.4 (xHigh)

Claude Fable 5

29.5

Claude Fable 5

GPT 5.5

28.5

GPT 5.5

Kimi K3

26.6

Kimi K3

GPT 5.6 Terra

26.6

GPT 5.6 Terra

Gemini 3.1 Pro

26.1

Gemini 3.1 Pro

Muse Spark 1.1 (xHigh)

24.4

Muse Spark 1.1 (xHigh)

Muse Spark 1.1

23.6

Muse Spark 1.1

GPT 5.6 Luna

21.4

GPT 5.6 Luna

Grok 4.5

17.3

Grok 4.5

Grok 4.5 (High)

16.7

Grok 4.5 (High)

Claude Sonnet 5 (Adaptive/Max)

16.6

Claude Sonnet 5 (Adaptive/Max)

Claude Opus 4.7 (Adaptive/Max)

16.5

Claude Opus 4.7 (Adaptive/Max)

Qwen 3.5 Plus

15.9

Qwen 3.5 Plus

Claude Opus 4.8 (Adaptive/Max)

15.9

Claude Opus 4.8 (Adaptive/Max)

Qwen 3.7 Plus

15.8

Qwen 3.7 Plus

GPT 5.4

14.1

GPT 5.4

Claude Opus 4.7

13.6

Claude Opus 4.7

Grok 4.3 (High)

12.9

Grok 4.3 (High)

Kimi K2.5

12.6

Kimi K2.5

Kimi K2.6

12.2

Kimi K2.6

Claude Sonnet 5

12.2

Claude Sonnet 5

Grok 4.3

11.7

Grok 4.3

Claude Opus 4.8

11.3

Claude Opus 4.8

Mistral Large 3

Long-Context Agentic Instruction Following

HANDBOOK.md Agents

Can an agent follow a 100-page company handbook inside an enterprise RL environment?

HANDBOOK.md is a benchmark for long-context agentic instruction following, modeled on how professionals follow corporate policy in their day-to-day work. Each task is a unique RL environment with internal tools and external MCP servers, across five enterprise domains.

Rank

Model

Score

Fable 5 (max reasoning)

36.2

Fable 5 (max reasoning)

Fable 5

34.2

Fable 5

Opus 4.8 (max reasoning)

21.9

Opus 4.8 (max reasoning)

GPT-5.5 (xHigh reasoning)

21.5

GPT-5.5 (xHigh reasoning)

GPT-5.5

21.5

GPT-5.5

Opus 4.8

18.9

Opus 4.8

GLM 5.2

12.7

GLM 5.2

Gemini 3.5 Flash (high reasoning)

11.2

Gemini 3.5 Flash (high reasoning)

Sonnet 4.6 (max reasoning)

10.4

Sonnet 4.6 (max reasoning)

GLM 5.2 (xHigh reasoning)

Gemini 3.1 Pro Preview

Gemini 3.5 Flash

9.2

Gemini 3.5 Flash

Deepseek V4 Pro (xHigh reasoning)

9.2

Deepseek V4 Pro (xHigh reasoning)

Qwen 3.7 Max

8.5

Qwen 3.7 Max

Sonnet 4.6

7.7

Sonnet 4.6

Deepseek V4 Flash (xHigh reasoning)

7.3

Deepseek V4 Flash (xHigh reasoning)

Deepseek V4 Flash

7.3

Deepseek V4 Flash

Kimi K2.6

6.9

Kimi K2.6

Deepseek V4 Pro

6.9

Deepseek V4 Pro

Grok 4.3 (high reasoning)

1.9

Grok 4.3 (high reasoning)

Nemotron 3 Ultra

1.5

Nemotron 3 Ultra

Grok 4.3

0.8

Grok 4.3

Leaderboard

Blog

Github

Antidote / Everyday

Antidote: Everyday Edition

A real-world AI leaderboard – real prompts, real stakes, graded by experts who read every word, check every citation, and run every line of code.

Today's release benchmarks everyday use. Agentic and enterprise workflows coming soon.

Rank

Model

elo score (95% ci)

Gemini 3.1 Pro

1099

(

1088

1109

)

Gemini 3.5 Flash

1082

(

1068

1096

)

Qwen3.7 Max

1065

(

1053

1077

)

GLM 5.2

1060

(

1044

1076

)

Opus 4.7

1054

(

1041

1067

)

Kimi K2.6

1052

(

1042

1061

)

Opus 4.6

1050

(

1039

1060

)

Opus 4.8

1046

(

1031

1062

)

Sonnet 4.6

1028

(

1014

1043

)

GPT-5.5

1027

(

1012

1041

)

DeepSeek V4 Pro

1019

(

1005

1032

)

Kimi K2.5

1018

(

1007

1029

)

Grok 4.20 Beta

1013

(

1002

1024

)

Qwen3.5 Plus

1011

(

1000

1023

)

DeepSeek V4 Flash

990

(

977

1003

)

DeepSeek V3.2

968

(

956

981

)

Grok 4.3

965

(

953

977

)

Mistral Large 3

962

(

950

973

)

Haiku 4.5

957

(

942

973

)

Ernie 5.1

956

(

943

970

)

GPT-5.4 Mini

948

(

933

964

)

Gemma 3 12B

905

(

889

920

)

Ernie 4.5 300B

882

(

870

894

)

Nova 2 Pro

842

(

830

853

)

Leaderboard

Blog

Creative, Business, and Everyday writing

Hemingway-bench

Stop rewarding slop. We take real-world writing tasks and put them in front of master wordsmiths. Our goal: to push AI writing from two-second vibes to genuine nuance and impact.

Rank

Model

elo score (95% ci)

Gemini 3.1 Pro

1079

(

1061

1096

)

Gemini 3.5 Flash

1078

(

1057

1098

)

Gemini 3 Flash

1067

(

1052

1083

)

Gemini 3 Pro

1064

(

1041

1088

)

Opus 4.8 (Default)

1051

(

1029

1072

)

Opus 4.6

1050

(

1033

1068

)

Opus 4.7

1049

(

1030

1068

)

GPT-5.5

1048

(

1029

1068

)

GLM-5.2

1046

(

1024

1067

)

Kimi K2.6

1036

(

1015

1057

)

Opus 4.5

1027

(

1007

1047

)

DeepSeek V4 Pro

1014

(

995

1033

)

Kimi K2.5

1007

(

990

1023

)

GPT-5.2 Chat

1006

(

989

1024

)

Sonnet 4.6

1004

(

987

1021

)

Qwen3.5 Plus

997

(

980

1013

)

DeepSeek V4 Flash

996

(

977

1016

)

GPT-5.4

991

(

973

1009

)

GPT-5.2

967

(

942

991

)

Qwen3 Max

960

(

934

986

)

Grok 4.1 Fast Reasoning

932

(

916

948

)

K2 Instruct

899

(

874

924

)

Llama 4 Maverick

822

(

802

843

)

Nova 2 Pro

769

(

740

798

)

Leaderboard

Blog

Enterprise Instruction Following

ComplexConstraints

A benchmark for the kind of instruction following professional work demands — where constraints depend on each other, fire conditionally, and must be inferred from context.

Rank

Model

Score

GPT-5.5 (xHigh reasoning)

GPT-5.5 (High reasoning)

GPT-5.4 (xHigh reasoning)

Gemini 3.1 Pro

GPT-5.4 (High reasoning)

GPT-5.5

Claude Fable 5 (Max reasoning)

Gemini 3.5 Flash

Claude Opus 4.6 (High reasoning)

Qwen3.7 Max

Claude Opus 4.8 (High reasoning)

Claude Fable 5 (High reasoning)

Claude Opus 4.8

Claude Opus 4.7

Claude Opus 4.7 (High reasoning)

Kimi K2.6

Claude Sonnet 4.6 (High reasoning)

GLM 5.2 (Max reasoning)

DeepSeek v4 (Pro)

GLM 5.2

Kimi K2.5

NVIDIA Nemotron 3 Ultra

Claude Opus 4.6

Grok 4.20 Beta

DeepSeek v4 (Flash)

Qwen3.5 Plus

Ernie 5.1

Claude Sonnet 4.6

GPT 5.4

DeepSeek v3.2

Mistral Large

Nova 2 Pro

Ernie 4.5

Professional Multimodal Reasoning

GDP.pdf

Can frontier models master the documents that run the world? GDP.pdf is a multimodal and reasoning benchmark that takes real-world prompts and PDFs pulled directly from expert professional workflows.

Rank

Model

Score

GPT-5.6 Sol

30.7

GPT-5.6 Sol

Claude Fable 5 (Adaptive Max)

29.8

Claude Fable 5 (Adaptive Max)

GPT-5.5 (xHigh reasoning)

GPT-5.6 Terra

24.7

GPT-5.6 Terra

Claude Opus 4.8 (Adaptive Max)

GPT-5.6 Luna

22.7

GPT-5.6 Luna

Claude Opus 4.7 (Adaptive Max)

Sonnet 4.6 (Adaptive Max)

Gemini 3.1 Pro

Grok 4.5 (High)

Gemini 3.5 Flash

Kimi K2.6

Gemini 3 Flash

Grok 4.3 (High)

Nova 2 (Pro)

NVIDIA Nemotron 3 Nano Omni

Mistral Large 3

Enterprise Agents in Realistic RL Environments

EnterpriseBench: CoreCraft Agents

Stop testing models in tiny, self-contained environments. We built CoreCraft, a large-scale startup world, and deployed AI agents to solve real tasks. Our goal: to move agents beyond the cleanliness of the lab and into the chaos of enterprise reality.

Rank

Model

Score

Fable 5 (Max reasoning)

70.3

Fable 5 (Max reasoning)

GPT-5.5

52.8

GPT-5.5

Claude Opus 4.8 (Max reasoning)

52.3

Claude Opus 4.8 (Max reasoning)

Leaderboard

Blog

Research

Mathematics at the frontier

Riemann-bench

We evaluate AI models on advanced mathematical problems requiring deep reasoning and novel synthesis. Our benchmark features problems from cutting-edge mathematics, sourced from leading mathematicians – Ivy League professors, PhD IMO medalists, graduate students at the top of their field – in the course of their research.

Rank

Model

Score

GPT-5.6 Sol (Max)

74.4

GPT-5.6 Sol (Max)

Claude Fable 5 / Mythos 5

GPT-5.5 (xHigh reasoning)

55.2

GPT-5.5 (xHigh reasoning)

Claude Opus 4.8

47.2

Claude Opus 4.8

GPT-5.4 (xHigh reasoning)

41.6

GPT-5.4 (xHigh reasoning)

Grok 4.5

38.4

Grok 4.5

Kimi K3

37.6

Kimi K3

GPT-5.2 (xHigh reasoning)

37.6

GPT-5.2 (xHigh reasoning)

Gemini 3.5 Flash (High reasoning)

36.8

Gemini 3.5 Flash (High reasoning)

Gemini 3.1 Pro

33.6

Gemini 3.1 Pro

Claude Opus 4.7

32.8

Claude Opus 4.7

Claude Opus 4.6

27.2

Claude Opus 4.6

Muse Spark 1.1

20.8

Muse Spark 1.1

Qwen 3.7 Max

15.2

Qwen 3.7 Max

Kimi K2.5

Claude Opus 4.5

11.2

Claude Opus 4.5

GLM 5.2

10.4

GLM 5.2

DeepSeek V4 Flash

10.4

DeepSeek V4 Flash

DeepSeek v3.2 (Thinking)

Kimi K2.6

7.2

Kimi K2.6

DeepSeek V4 Pro

5.6

DeepSeek V4 Pro

Leaderboard

Blog

Research

Benchmarks

Chartography

HANDBOOK.md Agents

Antidote: Everyday Edition

Hemingway-bench

ComplexConstraints

GDP.pdf

EnterpriseBench: CoreCraft Agents

Riemann-bench

Stay Posted on New Benchmarks

Raise AGI with the richness of human intelligence.

`Benchmarks`