Leaderboards
Moving AI evaluation beyond the lab

1
Open AI
3

2
Google
2

3
Anthropic
1
Creative, Business, and Everyday writing
Hemingway-bench
Stop rewarding slop. We take real-world writing tasks and put them in front of master wordsmiths.
Rank
Model
elo score (95% ci)
Gemini 3.1 Pro
1093
(
1072
-
1114
)
Google
Gemini 3 Flash
1088
(
1070
-
1106
)
Google
Gemini 3 Pro
1082
(
1060
-
1105
)
Google
Opus 4.6
1064
(
1043
-
1086
)
Anthropic
Opus 4.5
1047
(
1028
-
1066
)
Anthropic
Sonnet 4.6
1032
(
1012
-
1053
)
Anthropic
Kimi K2.5
1029
(
1010
-
1049
)
Moonshot AI
GPT-5.2 Chat
1029
(
1010
-
1048
)
Open AI
GPT-5.4
1022
(
999
-
1044
)
Open AI
Qwen3.5 Plus
1008
(
988
-
1028
)
Alibaba Cloud
GPT-5.2
990
(
967
-
1013
)
Open AI
Qwen3 Max
983
(
958
-
1008
)
Alibaba Cloud
View full leaderboard
Enterprise Agents in Realistic RL Environments
EnterpriseBench: CoreCraft
Stop testing models in tiny, self-contained environments. We built CoreCraft, a large-scale startup world, and deployed AI agents to solve real tasks.
Rank
Model
Score
GPT-5.2 (xHigh reasoning)
42.6
%
Open AI
GPT-5.4 (xHigh reasoning)
36.4
%
Open AI
Claude Opus 4.6 (Adaptive Thinking + Max Reasoning Effort)
30.8
%
Anthropic
View full leaderboard