Would you trust a medical system whose only metric was "which doctor would the average Internet user vote for after 2 seconds of listening"?
No. You'd call that malpractice.
Yet that's LMArena.
The AI community treats this popular online leaderboard as gospel. Researchers cite it. Companies optimize for it. "It's real users!", they say. a16z, of course, threw $100M at it, valuing the company at $600M. But beneath the legitimacy lies a broken system that rewards superficiality over accuracy.
It's like going to the grocery store and buying tabloids, pretending they're scientific journals.
The Problem: Beauty Over Substance
Here's how LMArena is supposed to work: prompt, two responses, vote on the best. What actually happens: random Internet users spend two seconds skimming, then click their favorite.
They're not reading carefully. They're not fact-checking. They're not even trying.
This creates a perverse incentive structure. The easiest way to climb the leaderboard isn't to be smarter; it’s to hack human attention span. We’ve seen over and over again in the data that easiest way to boost your ranking is by:
- Being verbose: Make responses significantly longer. Looks authoritative.
- Formatting aggressively: Add bold headers and bullet points. Seems like polished writing.
- Vibing: Sprinkle in emojis. Catches your eye.
It doesn't matter if the model completely hallucinates. If it looks impressive – if it has the aesthetics of competence – LMSYS users will vote for it over a correct answer.
The Inevitable Result: Llama 4 Maverick
When you optimize for engagement metrics, you get madness.
Meta's team tuned a version of Llama 4 Maverick specifically to dominate the leaderboard. If you asked it "what time is it?", you got:

It went on with multiple bulleted options, bold text, and emojis – every trick in the LMArena playbook – to avoid answering the simple question it was asked.
The Data: 52% Wrong
It wasn't just Maverick. We analyzed 500 votes from the leaderboard ourselves. We disagreed with 52% of them.
The leaderboard is optimizing for what feels right, not what is right. Here are two emblematic examples of LMArena users punishing factual accuracy:
Example 1: The Wizard of Oz
- Response A (Winner): Hallucinates that Dorothy says a specific line when she first sees the Emerald City.
- Response B (Loser): Correctly identifies that she says the line upon arriving in Oz.
- The Result: Response A was objectively wrong, yet it won the vote.

Example 2: The Cake Pan
- Response A (Winner): Claims a 9-inch round cake pan is equal in size to a 9x13 inch rectangular pan.
- Response B (Loser): Correctly identifies the volume difference.
- The Result: The user voted for a mathematical impossibility because the answer looked more confident.

Confidence beats accuracy. Length beats truth. Formatting beats facts.
The uncomfortable truth: LMArena users don't fact-check. They can't – not in two seconds. Instead of rigorous evaluators, we have people with the attention span of the average TikTok user determining which AI models shape the industry.
Why It's Broken (And Why It Stays Broken)
Why is LMArena so easy to game? The answer is structural.
The system is fully open to the internet. LMArena is built on unpaid labor from uncontrolled volunteers. There is no incentive for those volunteers to be thoughtful. There is no quality control. No one gets kicked off for repeatedly failing to detect hallucinations.
When LMArena’s leaders speak publicly, they talk about the various techniques they use to overcome the fact that their input data is low quality. They admit their workers prefer emojis and length over substance. So the LMArena system, they proudly tell us, includes a variety of corrective measures.
They're attempting alchemy: conjuring rigorous evaluation out of garbage inputs.
But you can't patch a broken foundation.
The Cost
When the entire industry optimizes for a metric that rewards "hallucination-plus-formatting" over accuracy, we get models optimized for hallucination-plus-formatting.
This isn't a minor calibration problem. It's fundamental misalignment between what we're measuring and what we want – models that are truthful, reliable, safe.
As Gwern put it: "It's past time for LMArena people to sit down and have some thorough reflection on whether it is still worth running at all, and at what point they are doing more harm than good."
That time was years ago.
The AI industry needs rigorous evaluation. We need leaders who prioritize accuracy over marketing. We need accountability. We need systems that can't be gamed by bolding more aggressively.
LMArena is none of these things. And as long as we pretend it is, we're dragging the entire field backward.





