Table of contents
Case Study Llama
Analysis of Individual writings
Appendix

When Microsoft AI released MAI-Thinking-1, they wanted to know something benchmarks alone can't tell you: not just whether the model scores well, but whether people actually prefer what it produces. To measure that, they ran blind human evaluations with Surge.

Microsoft used our pool of generalist and professional expert raters to compare models in side-by-side, blind preference tests across 1,276 tasks, spanning a wide range of real use cases in both single-turn and multi-turn conversations. The focus was on the things people actually care about: whether a model understands the task, follows instructions, uses the right level of detail, writes clearly, and advances the user's goal. In Microsoft's evaluation and technical report, raters preferred MAI-Thinking-1 over Claude Sonnet 4.6.

We're sharing this because it points at something we think the field underuses: rigorous human preference evaluation as a complement to benchmarks.

Benchmarks tell you part of the story

Automated benchmarks are essential, but they measure a narrow slice of what makes a model good to use. A model can post strong benchmark numbers and still be subtly frustrating in practice: over-long when you wanted a direct answer, brittle when the task is phrased unusually, technically correct but missing the point. Those qualities are exactly what determine whether someone is glad they used the model, and they're hard to capture in an automated score.

That's what blind human evaluation is for. By putting two responses side by side and asking qualified people which one actually served the user better, you get a direct read on the experience, not a proxy for it. As Microsoft put it, human preference data is how you tell whether benchmark improvements translate into better experiences for real users.

Human signal, alongside the benchmarks

We build hard, expert-graded benchmarks because the field needs to know where models stand on difficult, well-defined capabilities. We build human preference evaluations for the same underlying reason: because the goal isn't a model that wins on paper, it's a model people are actually glad to use, and measuring that takes real human judgment.

Work with us

If you're building a model and want rigorous human preference evaluation, blind side-by-sides, expert raters, evaluations designed around the qualities your users actually care about, reach out to benchmark@surgehq.ai. You can find more about Surge's evaluation work on our Leaderboards & Benchmarks page.

Follow us on
Linkedin
X

Read what frontier labs read.

We publish 1-2 deep posts every month on Al evaluation, post-training, and pushing the frontier.

Subscription confirmed

You'll get updates when we post
Oops! Something went wrong while submitting the form.

More Posts

Appendix