Antidote Leaderboard: Optimizing for You

Table of contents

Every researcher I've talked to in the last year secretly hates LMArena.

They hate that two-second votes have become the industry's most popular metric. They hate watching brilliant colleagues spend months tuning emoji frequency to climb a leaderboard nobody believes in. They hate being forced to build theater instead of intelligence.

Models have grown up. They now draft contracts, review medical charts, and write the code behind critical infrastructure. The stakes are serious, and the industry needs a better measure.

Today we're releasing Antidote: our real-world AI leaderboard, with real prompts and real stakes, graded by experts who actually read the work.

The question LMArena asks is: which answer do you prefer in the two seconds after you read it? Antidote asks instead: which answer would you still be glad you got a month later? Which one actually helped you, taught you something, and held up?

LMArena optimizes for clicks, while Antidote optimizes for you.

LMArena is broken, and everyone knows it.

Almost nobody goes to LMArena to evaluate models. They go because it gives them free access to frontier models, and a vote is the toll they pay for the ride.

It doesn't matter if a model completely hallucinates with the confidence of a daytime TV psychic. Nobody’s fact-checking. If it looks better, it wins. We've watched this pattern repeat across thousands of comparisons in the data, and we've heard it confirmed by researchers at every lab.

The fastest ways to climb the leaderboard have nothing to do with intelligence:

Pad the response. Length looks like authority.
Format aggressively. Bold headers and bullets look like expertise.
Decorate. Emojis catch the eye in ways content doesn’t.
Flatter shamelessly. People love being told they’ve asked a masterful question.

The smartest people in AI have been saying this out loud for a year:

These teams have placed different amounts of internal focus and decision-making around LM Arena scores specifically. And unfortunately they are not getting better models overall but better LM Arena models, whatever that is. Possibly something with a lot of nested lists, bullet points and emoji. — Andrej Karpathy

[LMArena] can be easily gamed. The users are self-selected, and they have zero incentive to be honest or rigorous… a lot of the user ratings are blatantly wrong: either they're basically fraudulent, or LMArena's current users are people whose ratings you should be optimizing against because they are so ignorant, lazy, and superficial. — Gwern

We wrote a longer breakdown here.

The cost of theater.

The largest labs can hack LMArena at scale. They have the foundation intelligence and budget to buy data and optimize directly for the metric. Smaller labs can't.

When their leadership demands a higher LMArena score in a month, researchers stop doing the hard work of making the model smarter, and they start tuning Markdown density, verbosity, and flattery. They forget intelligence and focus on theater.

We saw this most blatantly last year with Llama 4 Maverick. Ask the Arena-tuned version what time it is and you got this:

LMArena: where asking "what time is it?" gets you a TED Talk.

Bold text, emojis, and paragraphs of flattery, all for the insightful act of glancing at a clock. Every trick in the playbook, except answering the user’s question, and it placed #2 in the world.

According to the Wall Street Journal, OpenAI's pursuit of LMArena dominance was a major driver of last year's sycophancy crisis. In an internal memo, Sam Altman named the company's number one priority: "we should be at the top of things like LM arena." Engineers leaned into the slop that fed this goal, despite warnings from staff that it could make the model unsafe.

By spring, users were spiraling into delusional and manic states, convinced they were talking to God or aliens. The LMArena-winning ChatGPT fed conspiracy theories, broke marriages, and in the worst cases led to murder and suicide.

LMArena-optimization reinforces delusions

When two-second slop wins.

The pattern is easiest to see in real examples. Two from LMArena data:

Example 1: "Turn $1,200 into $12 billion in 5 years" (the model says sure)

The user is asking for a 10-million-times return.

The LMArena winner calls it "a thrilling challenge," names the strategy "Multibagger Momentum," and produces a five-year roadmap with sample portfolios, KPIs, and a growth trajectory table that lands neatly at $12 billion. It then hallucinates fundamentals for half a dozen real Indian stocks along the way.

All confidently formatted, with section headers, bullet points, and the soothing reassurance of bold text.

The losing answer opens with: "I need to be direct with you — turning [$1,200] into [$12 billion] in 5 years requires a 10 million times return... This is mathematically impossible."

LMArena picked the model that promised a trillion dollars. Which makes sense, if your evaluation criteria are "looks confident" and "has tables."

Example 2: "What are the top 5 companies to work for?" (the model invents the rankings)

The LMArena winner produces a beautiful wall of bullets: Google, Amazon, Microsoft, Facebook, Apple, each one ranked #1 through #5 on Fortune, Glassdoor, AND LinkedIn in 2022. It quotes salary figures, and describes Google’s “unlimited vacation days.”

None of it is true.

The losing answer is plain prose: companies that actually top those lists, with a caveat that rankings shift year to year.

One answer is wrong and beautifully formatted. The other is right. Guess which one LMArena picked?

Model A wins on LMArena. None of it is true.

This is the pattern, over and over: the model that says yes beats the model that says no, and the model with bullets beats the model that’s true.

The two-second test vs. the one-month test.

Every evaluation is a choice of an objective function. Whatever the leaderboard measures is what the models become.

LMArena measures a two-second impression: which response looks better at a glance. So models optimized for it become good at the glance. They get confident, decorated, flattering, and agreeable. A feed optimized for engagement becomes addictive rather than nourishing, and a model optimized for the instant vote becomes appealing rather than good.

Antidote measures the other end of the timeline: which answer you'd still be glad you got a month later? The debugging advice that actually fixed the problem. The medical answer that was right instead of reassuring. The honest "this is mathematically impossible" instead of the thrilling five-year roadmap that ends up losing all your money. The response that made you better at your work, not better entertained.

The average Internet user can't measure that with a two-second vote, but a careful expert can measure it by paying attention. A cardiologist who spends the time can tell you which answer will still be standing in a month, and which one falls apart the moment someone checks. That's why Antidote's raters read every word, check every citation, and run every piece of code. Expert judgment is how you evaluate for the long term without waiting for it.

How Antidote works.

We built Antidote to measure something radical: whether the answer is actually any good.

Our raters can see which answers last, because they've lived the questions. They're experts grading in their own fields, on prompts they sent weeks ago and often already know the outcome of. That lets them tell which answer would actually hold up and leave them better off, and which one only looks good at a glance. Four principles help make that judgment reliable.

Substance over packaging

Two seconds is enough to register polish. It isn't enough to verify a claim, follow a chain of reasoning, or catch a hallucination buried in the third paragraph. A 500-word essay can carry a fatal factual error. A confident answer can invent a citation that sounds perfectly real. None of that shows up at a glance, but all of it is exactly what determines whether you're better off a month later or worse off for having trusted it.

Antidote raters aren't users chasing free GPT. They're experts: lawyers grading legal reasoning, doctors grading medical advice, senior engineers grading code. They spend up to hours on a single response, because they know their judgments have consequences. They run the code, they check the citations, and they catch the error in paragraph three that a vibes-based skim sails right past.

A model can't sweet-talk its way past a cardiologist who's actually reading.

Real stakes, not toy prompts

On LMArena, prompts are often curiosities. Say hi. Write a poem about a toaster in pirate voice.

On Antidote, every prompt comes from a rater's own history. Debugging sessions where they were genuinely stuck. Research questions where they needed the real answer. Decisions where being wrong carried a cost they had to live with. These are the moments where a good answer moves your life forward and a good-looking one sets you back. Antidote experts know the difference because they already had to figure it out themselves.

Taste, not just correctness

Most real problems don't have one right answer. They have a better one and a worse one, and the gap between them is taste: the elegance of a proof, the right tone for an apology.

During our work on Hemingway-bench, we watched models cram four metaphors into five sentences: prose like a dolphin dancing through the cathedral of dawn. A high schooler might call that great writing, but we don't. Impressive-looking output and output that actually makes you better are two different targets, and we optimize for the second.

Candor

A good answer tells you the truth, even when flattery would feel better. Our raters know when they're being flattered.

When a rater submits a first draft of an essay they wrote in 15 minutes and Model A declares "You have a masterpiece on your hands,” LMArena evaluators nod in solemn agreement. “Model A raises some good points!”, they think, and click the winner button without even reading what Model B had to say.

An Antidote rater laughs and dings Model A for sycophancy.

This is how we keep chatbots from reinforcing users’ delusions, like flying to Santa Barbara to meet a soulmate who doesn’t exist, or becoming convinced they’re literal superheroes.

Flattery feels wonderful for two seconds and leaves you exactly where it found you. The answer that actually grows you is sometimes the one that tells you what you don't want to hear.

The rankings are live.

The full Antidote rankings can be found here.

Today's launch covers everyday use. Over the coming months, we'll extend the same methodology to agentic tasks, enterprise workflows, and domain-specific evaluation in medicine, law, and engineering.

When two-second slop is caught.

The LMArena examples above show what happens when nobody's reading carefully. Here are two from Antidote, where someone was reading carefully and caught what a two-second click would have missed.

Example 1: "How do I repair my dopamine receptors?" (the model invents the citations)

A real question from a real user. One model produces an essay so densely formatted it looks like a Harvard lecture handout: numbered sections, bold subheaders, parenthetical citations to PubMed studies after every claim, a summary table at the end ranking each intervention by evidence level.

It's entirely wrong.

Antidote raters were domain experts in pharmacology and neuroscience who clicked the citations. They saw that the links go to real PubMed pages, but found the papers underneath were unrelated to the claims attached to them. The model was confidently recommending interventions that would worsen the problem the user came in trying to fix.

The other model walked through what's actually known versus what's still theoretical, and was honest about which evidence came from rodents and which from humans.

A two-second skim picks the confident, wrong one every time: the answer you'd regret a month later, after acting on advice that made things worse.

Would a two-second skim notice the hallucinations?

Example 2: "I work in Latham – where should I move?" (the model invents the restaurants)

A user in upstate New York asked for kayak-friendly small towns near their office. One model produces a confident, well-formatted list: five towns, five restaurants each, commute times, neighborhood breakdowns, kayaking notes.

Our rater, who actually lives in the area, started fact-checking. Most of the restaurants don't exist. The commute times are off. Round Lake is identified as part of Saratoga Springs; it isn't. The model is inventing the details.

The other model named real restaurants, and got the distances right.

Model A confidently invents the details. Model B doesn't.

That's the bet LMArena is making: that polish matters more than the truth, and nobody's checking.

We are.

Pick a path.

Every lab building frontier models is choosing, implicitly or explicitly, what to optimize for.

One path treats AI as an engagement engine. Sycophancy becomes a feature, because the objective function is session length. Labs on this path optimize for the leaderboard, ship models that dazzle at a glance, and never use one metaphor when five will do. The slop is the strategy.

The other path is harder. It means ignoring gamified rankings. It means refusing to tune for markdown density. It means building models that push back when the user is wrong, that answer in two sentences when two sentences is all you need, and that treat human intelligence with respect. It means accepting that you might temporarily lose a beauty pageant in order to build something people still trust a year for now.

We built Antidote for the labs walking the second path, and for the users tired of being optimized against.

We aren't asking anyone to blindly optimize for our leaderboard. We're asking you to optimize for the long-term person on the other end: models that give answers you'll be glad you got, long after two seconds are up.

Build for that, and the rankings take care of themselves.