Blog
Leaderboards
Workforce
Products
Research
Careers
Contact
Login

Leaderboards

Moving AI evaluation beyond the lab
1
Open AI
3
2
Google
2
3
Anthropic
1
View by :
Creative, Business, and Everyday writing

Hemingway-bench

Stop rewarding slop. We take real-world writing tasks and put them in front of master wordsmiths.
Our goal: to push Al writing from two-second vibes to genuine nuance and impact.
Rank
Model
elo score (95% ci)
Gemini 3.1 Pro
1093
(
1072
-
1115
)
Gemini 3 Flash
1089
(
1070
-
1107
)
Gemini 3 Pro
1080
(
1056
-
1103
)
Claude Opus 4.6
1062
(
1040
-
1083
)
Claude Opus 4.5
1051
(
1032
-
1070
)
Claude Sonnet 4.6
1034
(
1012
-
1055
)
GPT 5.2 Chat
1031
(
1012
-
1050
)
Kimi K2.5
1029
(
1009
-
1049
)
Qwen 3.5 Plus
1009
(
988
-
1029
)
GPT 5.2
995
(
971
-
1019
)
Qwen 3 Max
984
(
958
-
1011
)
Grok 4.1 Fast Reasoning
958
(
938
-
977
)
View full leaderboard
Creative, Business, and Everyday writing

EnterpriseBench: CoreCraft

Stop testing models in tiny, self-contained environments. We built CoreCraft, a large-scale startup world, and deployed AI agents to solve real tasks.
Our goal: to move agents beyond the cleanliness of the lab and into the chaos of enterprise reality.
Rank
Model
Score
GPT-5.2 (xHigh reasoning)
GPT-5.2 (xHigh reasoning)
42.6
%
GPT-5.4 (xHigh reasoning)
GPT-5.4 (xHigh reasoning)
36.4
%
Claude Opus 4.6 (Adaptive Thinking + Max Reasoning Effort)
Claude Opus 4.6 (Adaptive Thinking + Max Reasoning Effort)
30.8
%
View full leaderboard