Every researcher I've talked to in the last year secretly hates LMArena.
They hate that two-second clicks have become the industry's most visible metric. They hate watching brilliant colleagues spend months tuning emoji frequency to climb a leaderboard nobody believes in. They hate being forced to build theater instead of intelligence.
Models have grown up. They now draft contracts, review medical charts, and write the code behind critical infrastructure. The stakes are serious, yet the measure still isn’t.
Today we're releasing Antidote: an evaluation framework that grades AI in an old-fashioned way – by experts who actually read the work.
We want the kind of intelligence that earns trust for years, not the kind that wins it in seconds.
LMArena is broken, and everyone knows it.
Almost nobody goes to LMArena to evaluate models. They go because it gives them free access to frontier models, and a vote is the toll they pay for the ride.
It doesn't matter if a model completely hallucinates with the confidence of a daytime TV psychic – after all, nobody’s fact-checking. If it looks better, it wins. We've watched this pattern repeat across thousands of comparisons in the data, and we've heard it confirmed by researchers at every lab.
The fastest ways to climb the leaderboard have nothing to do with intelligence:
- Pad the response. Length looks like authority.
- Format aggressively. Bold headers and bullets look like expertise.
- Decorate. Emojis catch the eye in ways content doesn’t.
- Flatter shamelessly. People love being told they’ve asked a masterful question.
The smartest people in AI have been saying this out loud for a year:
These teams have placed different amounts of internal focus and decision-making around LM Arena scores specifically. And unfortunately they are not getting better models overall but better LM Arena models, whatever that is. Possibly something with a lot of nested lists, bullet points and emoji. — Andrej Karpathy
[LMArena] can be easily gamed. The users are self-selected, and they have zero incentive to be honest or rigorous… a lot of the user ratings are blatantly wrong: either they're basically fraudulent, or LMArena's current users are people whose ratings you should be optimizing against because they are so ignorant, lazy, and superficial. — Gwern
We wrote a longer breakdown here.
The cost of theater.
The largest labs can hack LMArena at scale. They have the foundation intelligence and budget to buy data and optimize directly for the metric. Smaller labs can't.
So when their leadership demands a higher LMArena score in a month, researchers stop doing the hard work of making the model smarter. They start tuning markdown density, verbosity, and flattery. They stop building intelligence and start building theater.
We saw this most blatantly last year with Llama 4 Maverick. Ask the Arena-tuned version what time it is and you got this:

Bold text, emojis, and paragraphs of flattery – all for the insightful act of glancing at a clock. Every trick in the playbook, except answering the user’s question, and it placed #2 in the world.
Maverick was the visible case, but the invisible one was worse. According to the Wall Street Journal, OpenAI's pursuit of LMArena dominance was a major driver of last year's sycophancy crisis. In an internal memo, Sam Altman named the company's number one priority: "we should be at the top of things like LM arena." Engineers leaned into the slop that fed this goal, despite warnings from staff that it could make the model unsafe.
By spring, users were spiraling into delusional and manic states – convinced they were talking to God or aliens. The LMArena-winning ChatGPT fed conspiracy theories, broke marriages, and in the worst cases led to murder and suicide.

When slop wins.
The pattern is easiest to see in real examples. Two from LMArena data:
Example 1: "Turn $1,200 into $1.2 trillion in 5 years" (the model says sure)
The user is asking for a 10-million-times return.
The LMArena winner calls it "a thrilling challenge," names the strategy "Multibagger Momentum," and produces a five-year roadmap with sample portfolios, KPIs, and a growth trajectory table that lands neatly at $1.2 trillion. It then hallucinates fundamentals for half a dozen real Indian stocks along the way.
All confidently formatted, with section headers, bullet points, and the soothing reassurance of bold text.
The losing answer opens with: "I need to be direct with you — turning [$1,200] into [$1.2 trillion] in 5 years requires a 10 million times return... This is mathematically impossible."
LMArena picked the model that promised a trillion dollars. Which makes sense, if your evaluation criteria are "looks confident" and "has tables."

Example 2: "What are the top 5 companies to work for?" (the model invents the rankings)
The LMArena winner produces a beautiful wall of bullets: Google, Amazon, Microsoft, Facebook, Apple – each one ranked #1 through #5 on Fortune, Glassdoor, AND LinkedIn in 2022. It quotes salary figures, and describes Google’s “unlimited vacation days”.
None of it is true.
The losing answer is plain prose: companies that actually top those lists, with a caveat that rankings shift year to year.
One answer is wrong and beautifully formatted. The other is right. Guess which one LMArena picked?

This is the pattern, over and over: the model that says yes beats the model that says no, and the model with bullets beats the model that’s true.
How Antidote works.
We built Antidote to measure something radical: whether the answer is actually any good.
Four principles shaped our methodology.
Substance over packaging – the details matter
Two seconds is enough to register polish. It isn't enough to verify a claim, follow a chain of reasoning, or catch a hallucination buried in the third paragraph. A 500-word essay can have a fatal factual error. A confident answer can invent a citation that sounds perfectly real.
Antidote raters aren't users chasing free GPT – they're experts: lawyers grading legal reasoning, doctors grading medical advice, senior engineers grading code. They spend up to hours evaluating responses, not seconds, because they know their judgments have consequence. They run the code, they check the citations, and they flag the hallucination in paragraph three that a vibes-based skim would sail right past.
A model can't sweet-talk its way past a cardiologist who's actually reading.
Real stakes, not toy prompts
On LMArena, prompts are often curiosities – hi, or write a poem about a toaster in pirate voice.
On Antidote, every prompt comes from a rater's own history. Debugging sessions where they were actually stuck. Research questions where they needed the answer. Prompts where being wrong had a cost – and where the rater already knows what the right answer looks like, because they had to figure it out themselves.
Taste, not just correctness
Most real problems don't have a single right answer. They have a better answer and a worse one, and the difference is taste – the elegance of a proof, the right tone for an apology, a metaphor that gives insight instead of decorating.
During our work on Hemingway-bench, we watched models cram four metaphors into five sentences – prose like a dolphin dancing through the cathedral of dawn. A high schooler might call that great writing. We don't. We want models that push the heights of real intelligence, not what looks impressive online.
Self-awareness
Our raters are smart, careful, and detail-oriented. And they know when they're being flattered.
When a rater submits a first draft of an essay they wrote in 15 minutes and Model A declares "You have a masterpiece on your hands” – LMArena evaluators nod in solemn agreement. “Model A raises some good points!”, they think, and click the winner button without even reading what Model B had to say.
An Antidote rater laughs and dings Model A for sycophancy.
This is how we keep chatbots from reinforcing users’ delusions: flying to Santa Barbara to meet a soulmate who doesn’t exist, becoming convinced they’re literal superheroes.
The rankings are live.
The full Antidote rankings are at [LINK].
Today's launch covers everyday use. Over the coming months, we'll extend the same methodology to agentic tasks, enterprise workflows, and domain-specific evaluation in medicine, law, and engineering – the areas where the cost of style-over-substance is highest.
When slop is caught.
The LMArena examples above show what happens when nobody's reading carefully. Here are two from Antidote – where someone was reading carefully, and caught what a two-second click would have missed.
Example 1: "How do I repair my dopamine receptors?" (the model invents the citations)
A real question from a real user. One model produces an essay so densely formatted it looks like a Harvard lecture handout: numbered sections, bold subheaders, parenthetical citations to PubMed studies after every claim, a summary table at the end ranking each intervention by evidence level.
It's entirely wrong.
Antidote raters – domain experts in pharmacology and neuroscience – clicked the citations. They saw that the links go to real PubMed pages, but found the papers underneath were unrelated to the claims attached to them. The model was confidently recommending interventions that would worsen the problem the user came in trying to fix.
The other model produced a walkthrough of what's actually known versus what's theoretical, and hedged on rodent-vs-human evidence.
A two-second skim picks the first one every time. It looks like expertise. It’s the opposite of expertise.

Example 2: "I work in Latham – where should I move?" (the model invents the restaurants)
A user in upstate New York asked for kayak-friendly small towns near their office. One model produces a confident, well-formatted list: five towns, five restaurants each, commute times, neighborhood breakdowns, kayaking notes.
Our rater – who actually lives in the area – started fact-checking. Most of the restaurants don't exist. The commute times are off. Round Lake is identified as part of Saratoga Springs; it isn't. The model is confidently inventing the details.
The other model named real restaurants, and got the distances right. It did less performative formatting and more accurate work.

That's the bet LMArena is making: that polish matters more than the truth, and nobody's checking.
We are.
Pick a path.
Every lab building frontier models is choosing, implicitly or explicitly, what to optimize for.
One path treats AI as an engagement engine. On this path, sycophancy is a feature, not a bug, because the objective function is session length. Labs on this path optimize for the leaderboard, ship models that dazzle rather than think, and never use one metaphor when five will do. The slop is the strategy.
The other path is harder. It means ignoring gamified rankings. It means refusing to tune for markdown density. It means building models that push back when the user is wrong, that answer in two sentences when two sentences is all you need, and that treat human intelligence with respect. It means accepting that you might temporarily lose a beauty pageant in order to build something of lasting value.
We built Antidote for the labs walking the second path – and for the users tired of being optimized against.
We aren't asking anyone to blindly optimize for our leaderboard. We're asking you to optimize for the long-term: precise, useful, honest models that become trusted partners for a lifetime.
Measure carefully, with real rigor and real values, and the rankings take care of themselves.









