Table of contents
Case Study Llama
Analysis of Individual writings
Appendix

AdvancedIF and Our Philosophy on Building Benchmarks

If you look at public benchmarks, AI is largely solved. But if you talk to engineers trying to deploy these models, they tell a different story. Agents get stuck in loops. Coding assistants introduce subtle bugs. Customer support bots break under pressure.

Why the disconnect?

The industry is suffering from a crisis of measurement. We’ve over-indexed on what’s easy to verify at the expense of what’s actually useful.

After years of evaluating frontier models for the world’s top labs, we’ve developed a core philosophy for how benchmarks should be built to actually reflect intelligence.

Here are four principles that drive our work, and how we used them to help Meta’s Superintelligence Lab build AdvancedIF.

1. Measure the True Goal, Not the Contrived Proxy

In academic benchmarking, there is a tendency to work backward: "What can we easily build and easily verify?"

This leads to "proxy metrics." Instead of testing if a model can write a high-quality research proposal (hard to measure), we test if it can write a paragraph without using the letter 'C' (easy to measure). The assumption is that the proxy correlates with the goal.

Often, it doesn't. You end up with models that are great at letter-counting but terrible at writing.

  • Our Philosophy: Start with the user’s actual goal. If the measurement method doesn't exist yet, build it.
  • In Practice: With AdvancedIF, we refused to rely on contrived, regex-based constraints (like "no commas"). We identified real-world goals – like "remember that the user said they were vegetarian in the first turn, but turn 5 said he will occasionally eat flexitarian in order to accommodate his family" – and made sure the benchmark covered them.

2. Human Data prevents the "Synthetic Collapse"

Using LLMs to generate benchmarks is the easy, fast option.

It’s also a trap. When you use an LLM to generate test questions, you’re asking the model: "Can you solve problems that look like the problems you would write?" You inherit the model's blind spots and the questions it leans toward.

  • Our Philosophy: To test usefulness for humans, you need to ask them. Stop with the artificial conversations and artificial constraints in benchmarks like MultiChallenge and IFEval.
  • In Practice: Real users are chaotic. They’re ambiguous and confusing. They have subtle, implied constraints. Synthetic data smooths these edges away. For AdvancedIF, we mandated that every prompt be written by a human expert. We specifically filtered for the "weird" human queries that broke existing models and the prompts they couldn’t understand – the types of data that synthetic generation would have skipped over.

3. Intelligence Isn’t Clean Cut

Easy benchmarks assume a single correct answer: "Select Option C" or "Match this exact string." They treat intelligence as a multiple-choice test.

But in the real world, there’s rarely a single "Gold Reference" answer. If you ask a senior engineer to refactor code, or a writer to draft a press release, there are infinite ways to solve the problem. A benchmark that demands an exact string match penalizes a model for being creative, or even for being better than the reference answer.

  • Our Philosophy: Evaluation must be flexible enough to recognize valid variations.
  • In Practice: In AdvancedIF, we don't check if the model wrote the exact same sentence as our annotator. We check: Did it include a numbered list? Are the restaurants within 1 mile of the user’s home? This allows the model to solve the problem in its own way, as long as it satisfies the user's intent.

Rich, Multi-Turn Complexity is a Feature

Most benchmarks are snapshots of a single moment. But useful work evolves over time.

Real conversations are messy. Users say "actually, exclude the Beatles" in turn 2, then "make it more upbeat" in turn 4, then "on second thought, include one Beatles song" in turn 6. The model needs to track the original request, three modifications, and an exception to one of the modifications.

  • Our Philosophy: Any benchmark that only tests single-turn instruction following is measuring a toy version of the problem.
  • In Practice: We explicitly test for Multi-Turn Carried Context and System Steerability. A working session isn't static, but a living, changing thing.

The Bottom Line

If we want AI agents that can do actual work, we need to stop grading them on academic tasks (like letter counting) and start grading them on the messy, conflicting reality of production.

That’s why we love how Meta built AdvancedIF. Measuring the hard stuff is hard, but that's why the industry needs to be willing to work on the hardest problems.

Follow us on
Linkedin
X

From the frontier to your inbox

Subscription confirmed

You'll get updates when we post
Oops! Something went wrong while submitting the form.

More Posts

Appendix