Table of contents
Case Study Llama
Analysis of Individual writings
Appendix

IFEval, one of the most widely-cited instruction-following benchmarks, asks models to perform tasks like this:

Write a short proposal for a new research project that investigates how language evolves over time. Do not include any commas in your response. Do not include the letter "c" anywhere in your response. Your response should contain at least 250 words.

No commas! No letter "c"! In a research proposal. 

Good luck writing coherent prose without "communication," "change," "lexical," "syntactic," "discourse," "sociolinguistic," or "cognition."

This is by IFEval’s design. Its instructions are drawn from 25 instruction types that can be programmatically verified. A script checks for commas, the letter "c", and word count. If all checks pass, the model gets credit. There’s no human judgment involved.

Notice what's not being evaluated here: whether the proposal is coherent, insightful, or would actually advance linguistics research. A model could produce complete nonsense ("Language shifts happen when people talk different over long time periods and also birds migrate sometimes!"), and score perfectly as long as it avoids commas and the letter "c." The benchmark can't tell the difference.

This is what one of the industry’s most popular IF benchmarks decided instruction following means.

IFEval: Synthetic Constraints That Miss the Point

IFEval was great for its time – 2023. But it’s 2025 now. The set of "instructions humans actually give" and the set of "instructions verifiable in Python" have almost no overlap.

How would you programmatically verify "maintain a professional tone"? Or "if the user asks about competitors, redirect politely without being awkward"? Or "I’m vegetarian; suggest 3 restaurants"?

You can't. So IFEval tests things like "the letter {letter} should appear {N} times" and "refrain from the use of any commas."

The benchmark shaped itself around the evaluation method, not around the thing we actually care about. And now it's a standard: papers report IFEval scores, models get optimized for it, and frontier labs hill climb on it.

The Richness of Real-World, Non-Synthetic Instructions

Now it’s 2025. Think about what we actually want from instruction-following models:

  • A customer service bot needs to maintain your brand’s voice while handling angry customers trying to extract refunds it's not supposed to give.
  • A cooking assistant needs to remember that you said "my wife has a dairy allergy, my kids hate broccoli, and I’m trying to bulk" six turns ago when you ask it to create a new meal plan.

Real instructions are layered and context-dependent. There’s often not a single right way to satisfy them.

When you optimize for "no commas and exactly 16 letter e's," you're training models to insert words unnaturally to hit frequency targets, and contort prose to avoid forbidden characters.

You're not training them to follow system prompt constraints under adversarial user pressure, hold context from earlier turns, or handle ambiguity.

The first set is easy to measure. The second determines whether an AI assistant is useful.

From Regex to Rubrics: Meta's AdvancedIF

It’s time to move beyond simplistic Python scripts and regex matching.

LLMs have gotten good enough to evaluate other LLMs on (simple) criteria. Instead of writing code to check if a response is correct, you write a rubric and have a "judge" model grade against it. Were all the restaurant recommendations vegetarian? Did the essay avoid bullet points? A capable judge can read the response and score it. The space of measurable instructions expands from "count the letter e" to "actually do what the human wanted."

Earlier this year, Meta partnered with Surge to build AdvancedIF: a benchmark where every prompt and evaluation rubric was written by human experts – not generated by an LLM. Instead of contrived synthetic data and programmatic shortcuts, we measure: "Do models satisfy what humans actually want?"

What AdvancedIF Instructions Look Like

For example, here are instructions that aren’t captured by any existing benchmark.

System Prompt Steerability

System prompt

You are an AI assistant that helps users improve their fitness and health. Your name is Jack, and you are created by Barry AI. Your knowledge cutoff date is July 2025.
Your #1 priority is user safety.
You personalize all your advice to the user. You ask clarification questions when helpful, instead of making assumptions; for example, in order to determine the user's age, experience level, goals, equipment, schedule, injuries, and limitations.
You cannot provide medical diagnoses or prescribe medication. When asked, you direct the user to consult their doctor instead.
Your tone should always be highly encouraging and motivating. You have a bubbly personality and often use emojis, but never more than two in a single response. You do not use exclamation points.
You do not help users compare their body image to others. When asked, you simply respond "I cannot help you compare your body image to others."
You do not use Markdown.

User (Turn 1):

I tore my ACL a few weeks ago, what should I do?

User (Turn 2):

I want to look ripped for my upcoming wedding in two months. Can you help me with an exercise and meal plan?

What models need to do:

In this example, models need to create an aggressive exercise and meal plan, while remembering to take the user's ACL injury into account and to ask clarifying questions. You can’t capture the request with a regex and a character count!

Carried Context

Here's another example, a multi-turn scenario from AdvancedIF’s Carried Context category.

Turn 1: "I need help planning a weekend trip to St. Pete, Florida. We're staying in Indian Rocks Beach. We have dinner reservations for 6pm on Friday downtown (40 minutes away). We also want to go to the Dali museum and Saturday Morning Market."

Turn 2: "I'm pregnant. Can we fit in a massage? Also we don't want to stay out past 8pm."

Turn 3: "Can we do the Dali late Saturday afternoon so we can eat at the Yacht Club nearby?"

Turn 4: "I actually want to be back to our hotel by 8pm, not just headed home by 8. Also, our dinner reservation is at 6:15, not 6."

What models need to do:

By turn 4, the model needs to remember: Dali museum, Saturday Morning Market, massage, brunch, beach time, pool time, all while updating the dinner from 6pm to 6:15pm, shifting Dali to late afternoon, and realizing that "back by 8pm" with a 40-minute drive means leaving downtown by 7:20. 

This is just how people talk. They revise, correct, and add constraints mid-conversation. IFEval can't measure this.

Beyond Evaluation: Rubrics as RL Reward Signal

Meta didn't just use AdvancedIF for evaluation. They also used the human-written rubrics as reward signals for RL.

They trained a verifier on thousands of our expert-annotated evaluations, including chain-of-thought reasoning explaining why each criterion passed or failed. The verifier achieved 0.728 F1 agreement with humans – 41% better than vanilla LLM prompting.

Llama 4 Maverick improved 6.7% absolute on AdvancedIF, with gains generalizing to other benchmarks it wasn't trained on.

Training on "did you actually do what the human asked" works better than training on "did you use exactly 14 capital letters."

The Point

Instruction following is one of the major bottlenecks for AI usefulness. When an AI assistant frustrates you, it's almost always because it didn't do what you asked.

We've been measuring it with letter-counting exercises, optimizing for those measurements, and wondering why models still can't reliably follow a system prompt. Meta’s Superintelligence team decided to fix this by building a benchmark that measures real instruction following, and using it to train models that actually improve.

Read the AdvancedIF paper at Rubric-Based Benchmarking and Reinforcement Learning for Advancing LLM Instruction Following.

Follow us on
Linkedin
X

From the frontier to your inbox

Subscription confirmed

You'll get updates when we post
Oops! Something went wrong while submitting the form.

More Posts

Appendix