Microsoft used Surge human evaluations to benchmark MAI-Thinking-1

Table of contents

When Microsoft AI released MAI-Thinking-1, they wanted to establish something benchmarks alone can't show: whether people actually prefer what the model produces. To measure it, and to publish it, they ran blind human preference evaluations with Surge.

Microsoft used our pool of generalist and professional expert raters to compare models in side-by-side, blind preference tests across 1,276 tasks, spanning a wide range of real use cases in both single-turn and multi-turn conversations. The evaluation focused on what people actually care about: whether a model understands the task, follows instructions, uses the right level of detail, writes clearly, and advances the user's goal. On the strength of that evaluation, Microsoft reported in both their launch and their technical report that raters preferred MAI-Thinking-1 over Claude Sonnet 4.6.

When frontier labs want to show how their model stacks up against the best, they use Surge human evaluations to prove it.

Why labs stake claims on human preference

Automated benchmarks are essential, but they measure a narrow slice of what makes a model good to use. A model can post strong benchmark numbers and still be subtly frustrating in practice: over-long when you wanted a direct answer, brittle when a task is phrased unusually, technically correct but missing the point. Those qualities decide whether someone is glad they used the model, and they're hard to capture in an automated score.

Blind human evaluation measures them directly. By putting two responses side by side and asking qualified raters which one actually served the user better, you get a read on the experience itself, not a proxy for it. That's why a result like this carries weight in a launch: it reflects how real people respond to the model, not how it performs on a fixed test.

The evaluation is only as good as the judgment behind it

A preference result is only as trustworthy as the judgment that produced it. Two responses can both look fine at a glance; knowing which one actually served the user better takes raters who can tell the difference, and an evaluation designed to surface it. That's the core of what Surge does: we field generalist and professional expert raters, design blind side-by-sides that isolate the qualities users actually care about, and produce evaluations a lab can stand behind publicly. It's how labs measure themselves against the best models in the world.

Learn more about Surge's evaluation work on our Leaderboards & Benchmarks page.

/surge-ai

@hellosurgeai