Blog
Leaderboards
Workforce
Products
Research
Careers
Contact
Login

Leaderboards

Greatness isn't accidental. How we measure it shouldn't be either. If we want AGI that builds billion-dollar enterprises and globe-spanning infrastructure, we can't evaluate it with clickbait and synthetic slop. We need benchmarks that test for intelligence and sophistication.

This is our definitive ranking of models, measured by their capacity for rigorous reasoning and real-world mastery. Discover which labs are leading the frontier.
View by :
Creative, Business, and Everyday writing

Hemingway-bench

Stop rewarding slop.

We take real-world writing tasks and put them in front of master wordsmiths.

Our goal: to push AI writing from two-second vibes to genuine nuance and impact.

Rank
Model
elo score (95% ci)
Gemini 3.1 Pro
1093
(
1072
-
1114
)
Gemini 3 Flash
1088
(
1070
-
1106
)
Gemini 3 Pro
1082
(
1060
-
1105
)
Claude Opus 4.6
1064
(
1043
-
1086
)
Claude Opus 4.5
1047
(
1028
-
1066
)
Claude Sonnet 4.6
1032
(
1012
-
1053
)
GPT-5.2 Chat
1029
(
1010
-
1048
)
Kimi K2.5
1029
(
1010
-
1049
)
GPT-5.4
1022
(
999
-
1044
)
Qwen3.5 Plus
1008
(
988
-
1028
)
GPT-5.2
990
(
967
-
1013
)
Qwen3 Max
983
(
958
-
1008
)
View full leaderboard
Enterprise Agents in Realistic RL Environments

EnterpriseBench: CoreCraft

Stop testing models in tiny, self-contained environments.

We built CoreCraft, a large-scale startup world, and deployed AI agents to solve real tasks.

Our goal: to move agents beyond the cleanliness of the lab and into the chaos of enterprise reality.

Rank
Model
Score
GPT-5.2 (xHigh reasoning)
GPT-5.2 (xHigh reasoning)
42.6
%
GPT-5.4 (xHigh reasoning)
GPT-5.4 (xHigh reasoning)
36.4
%
Claude Opus 4.6 (Adaptive Thinking + Max Reasoning Effort)
Claude Opus 4.6 (Adaptive Thinking + Max Reasoning Effort)
30.8
%
View full leaderboard
Mathematics at the frontier

Riemann-bench

We evaluate AI models on advanced mathematical problems requiring deep reasoning and novel synthesis. Our benchmark features problems from cutting-edge mathematics, sourced from leading mathematicians – Ivy League professors, PhD IMO medalists, graduate students at the top of their field – in the course of their research.

Rank
Model
Score
Gemini 3.1 Pro
Gemini 3.1 Pro
6
%
Claude Opus 4.6
Claude Opus 4.6
6
%
Gemini 3 Pro
Gemini 3 Pro
4
%
Kimi K2.5
Kimi K2.5
4
%
DeepSeek v3.2
DeepSeek v3.2
3
%
Claude Opus 4.5
Claude Opus 4.5
2
%
GPT-5.2
GPT-5.2
2
%
View full leaderboard

Stay up-to-date on new leaderboards