Leaderboards
Moving AI evaluation beyond the lab

1
Open AI
3

2
Google
2

3
Anthropic
1
Creative, Business, and Everyday writing
Hemingway-bench
Stop rewarding slop. We take real-world writing tasks and put them in front of master wordsmiths.
Our goal: to push Al writing from two-second vibes to genuine nuance and impact.
Our goal: to push Al writing from two-second vibes to genuine nuance and impact.
Rank
Model
elo score (95% ci)
Gemini 3.1 Pro
1093
(
1072
-
1115
)
Google
Gemini 3 Flash
1089
(
1070
-
1107
)
Google
Gemini 3 Pro
1080
(
1056
-
1103
)
Google
Claude Opus 4.6
1062
(
1040
-
1083
)
Anthropic
Claude Opus 4.5
1051
(
1032
-
1070
)
Anthropic
Claude Sonnet 4.6
1034
(
1012
-
1055
)
Anthropic
GPT 5.2 Chat
1031
(
1012
-
1050
)
Open AI
Kimi K2.5
1029
(
1009
-
1049
)
Moonshot AI
Qwen 3.5 Plus
1009
(
988
-
1029
)
Alibaba Cloud
GPT 5.2
995
(
971
-
1019
)
Open AI
Qwen 3 Max
984
(
958
-
1011
)
Alibaba Cloud
Grok 4.1 Fast Reasoning
958
(
938
-
977
)
xAI
View full leaderboard
Creative, Business, and Everyday writing
EnterpriseBench: CoreCraft
Stop testing models in tiny, self-contained environments. We built CoreCraft, a large-scale startup world, and deployed AI agents to solve real tasks.
Our goal: to move agents beyond the cleanliness of the lab and into the chaos of enterprise reality.
Our goal: to move agents beyond the cleanliness of the lab and into the chaos of enterprise reality.
Rank
Model
Score
GPT-5.2 (xHigh reasoning)
42.6
%
Open AI
GPT-5.4 (xHigh reasoning)
36.4
%
Claude Opus 4.6 (Adaptive Thinking + Max Reasoning Effort)
30.8
%
Anthropic
View full leaderboard