Blog
Leaderboards
Workforce
Products
Research
Careers
Contact
Login

Leaderboards

Moving AI evaluation beyond the lab
1
Open AI
3
2
Google
2
3
Anthropic
1
View by :
Creative, Business, and Everyday writing

Hemingway-bench

Stop rewarding slop. We take real-world writing tasks and put them in front of master wordsmiths.
Rank
Model
elo score (95% ci)
Gemini 3.1 Pro
1093
(
1072
-
1114
)
Gemini 3 Flash
1088
(
1070
-
1106
)
Gemini 3 Pro
1082
(
1060
-
1105
)
Opus 4.6
1064
(
1043
-
1086
)
Opus 4.5
1047
(
1028
-
1066
)
Sonnet 4.6
1032
(
1012
-
1053
)
Kimi K2.5
1029
(
1010
-
1049
)
GPT-5.2 Chat
1029
(
1010
-
1048
)
GPT-5.4
1022
(
999
-
1044
)
Qwen3.5 Plus
1008
(
988
-
1028
)
GPT-5.2
990
(
967
-
1013
)
Qwen3 Max
983
(
958
-
1008
)
View full leaderboard
Enterprise Agents in Realistic RL Environments

EnterpriseBench: CoreCraft

Stop testing models in tiny, self-contained environments. We built CoreCraft, a large-scale startup world, and deployed AI agents to solve real tasks.
Rank
Model
Score
GPT-5.2 (xHigh reasoning)
GPT-5.2 (xHigh reasoning)
42.6
%
GPT-5.4 (xHigh reasoning)
GPT-5.4 (xHigh reasoning)
36.4
%
Claude Opus 4.6 (Adaptive Thinking + Max Reasoning Effort)
Claude Opus 4.6 (Adaptive Thinking + Max Reasoning Effort)
30.8
%
View full leaderboard

Get notified when we add new leaderboards