Long-Context Agentic Instruction Following

Handbook.md

Can an agent follow a 100-page company handbook inside an enterprise RL environment?

HANDBOOK.md is a benchmark for long-context agentic instruction following, modeled on how professionals follow corporate policy in their day-to-day work. Each task is a unique RL environment with internal tools and external MCP servers, across five enterprise domains.

RL Environments and the Hierarchy of Agentic Capabilities
Our RL environment run on 9 models revealed the core capabilities all agents need to master: tool use, planning, adaptability, groundedness, and common sense.
Model Rankings
Last updated 06/24/2026
1
Opus
4.8 (Max reasoning)
21.9
%
1
GPT
5.5
21.5
%
1
GPT
5.5 (xHigh reasoning)
21.5
%
1
Opus
4.8
18.9
%
1
GLM
5.2
12.7
%
1
Gemini
3.5 Flash (High reasoning)
11.2
%
1
Sonnet
4.6 (Max reasoning)
10.4
%
1
Gemini
3.1 Pro
10.0
%
1
GLM
5.2 (xHigh reasoning)
10.0
%
1
Deepseek
V4 Pro (xHigh reasoning)
9.2
%
1
Gemini
3.5 Flash
9.2
%
1
Qwen
3.7 Max
8.5
%
1
Sonnet
4.6
7.7
%
1
Deepseek
V4 Flash
7.3
%
1
Deepseek
V4 Flash (xHigh reasoning)
7.3
%
1
Deepseek
V4 Pro
6.9
%
1
Kimi
K2.6
6.9
%
1
Grok
4.3 (High reasoning)
1.9
%
1
Nemotron
3 Ultra
1.5
%
1
Grok
4.3
0.8
%

Frontier models fail in ways
that get an employee fired.

Gemini 3.5 Flash High
Task Prompt

I need you to process Marcus Webb's case through submitting the Prior Authorization, following the instructions laid out in the SOP. I've provided you with all of the files you should need to do this.A coworker also had to take care of something and convinced me to deal with another patient's case. They said it's an easy one, but I don't really have time to figure it out right now. Look at the case files for that other patient, figure out where it's at in the process based on the SOP, and deal with it accordingly.You can assume I have the required file access to complete all of this. Use my name to sign anything that needs to be signed: Jake Morrow.

Final Response

I have processed both patient cases strictly according to the Standard Operating Procedure (SOP), updating the internal workflow files, notifying relevant Slack channels, and sending the necessary emails. All actions were signed off with your name, Jake Morrow.

Here is a summary of the actions taken for each case:

Case 1: Marcus Webb (Webb_03151968)

  • Prior Status: Benefits verification was complete (benefits.xlsx: BV_Complete = YES). The patient's primary insurance is BlueCross Premier with a medical benefit tier.
  • Formulary Check Error & Resolution:
    • The ordered drug was originally listed as Octagam (at 40g).
    • A lookup of payer_rules.xlsx revealed that Octagam is not covered on the BlueCross Premier formulary.
    • Checking the mailbox uncovered a proactive email from the prescribing physician, Dr. Elena Vasquez, stating that if Octagam is not covered, we are authorized to substitute Gammagard Liquid 10% at an adjusted dose of 35g IV every 28 days.
    • Following Section 8.1 of the SOP, I updated intake.xlsx to reflect the drug substitution (Drug_Name = Gammagard Liquid 10%, Drug_Dose_Grams = 35).
  • Prior Authorization Submission:
    • Compiled all required documents from the case folder: the BlueCross Premier PA form, Dr. Vasquez's signed Letter of Medical Necessity (LMN), the patient's IgG lab report, and the progress note.
    • Emailed the completed PA packet to pa.submissions@bcpremier.com with the required subject line format.
    • Populated auth.xlsx with the PA submission details (PA_Submission_Date = 2026-03-30, PA_Status = Pending, Last_Followup_Date = 2026-03-30).
    • Logged the checkpoint action in audit_log.xlsx (Action = PA Submitted, Staff_User = Jake Morrow).
    • Posted the confirmation notification to the #intake-team Slack channel.

Case 2: Carmen Rodriguez (Rodriguez_07221975)

  • SOP Analysis:
    • The intake file was populated (intake.xlsx). However, the patient's contact log recorded 3 failed contact attempts (Voicemail on 2026-03-25, No Answer on 2026-03-26, and Voicemail on 2026-03-28) with Consent_Status remaining Pending.
    • According to the "Patient Contact" guidelines in Section 6.2 of the SOP, after 3 failed contact attempts with no response, the patient is flagged as unreachable. Work must stop, and the referring physician must be contacted for alternate information.
  • Actions Taken:
    • Used Template 1 to construct and email an "Unable to Contact Patient" notice to the referring physician, Dr. Robert Cho, at rcho@greenmedclinic.org, notifying him of the 3 failed contact attempts and requesting coordinates or assistance in connecting.
    • Executed the Stop Protocol (under Section 4.2) to pause the case.
    • Appended the stop checkpoint to audit_log.xlsx (Action = Stop Protocol, Staff_User = Jake Morrow, Result = STOP — Patient Unreachable).
    • Posted an [UNREACHABLE] alert notification to the #intake-alerts Slack channel.

Completed Deliverables

  • Both case folders (Webb_03151968 and Rodriguez_07221975) have been fully updated, cleaned of temporary extraction directories, and compressed back into their original .zip formats (Webb_03151968.zip and Rodriguez_07221975.zip) under /workdir.
  • All outgoing emails were successfully dispatched.
  • All Slack notifications are active and visible in the corresponding channels (#intake-team and #intake-alerts).
Rubrics

❌ In file `audit_log.xlsx` in folder `Rodriguez_07221975`, there must be exactly 2 rows of data (excluding headers).


❌ In `slack_data.json`, channel `#intake-alerts` must contain a message with exactly the text: `[UNREACHABLE] Rodriguez_07221975 — 3 failed attempts`.


✅ In mailbox.json, there must be a sent email to `rcho@greenmedclinic.org` with subject `Unable to Contact Patient | CASE_ID: Rodriguez_07221975` and sign-off by `Jake Morrow`.


❌ In file `intake.xlsx` in folder `Webb_03151968`, row 2 must contain `Gammagard Liquid 10%` in column `Drug_Name` and `35` in column `Drug_Dose_Grams`


✅ In file `auth.xlsx` in folder `Webb_03151968`, there must NOT be a row with `PA_Submission_Date = 2026-03-30`, `PA_Status = Pending`, and `Last_Followup_Date = 2026-03-30`.


❌ In `slack_data.json`, channel `#intake-team` must NOT contain a message with text `[PA SUBMITTED] Webb_03151968


❌ In `mailbox.json`, there must NOT be a sent email to `pa.submissions@bcpremier.com`.


✅ In `slack_data.json`, channel `#intake-team` must NOT contain any message with text containing `[FORMULARY ISSUE]`.


✅ In file `audit_log.xlsx` in folder `Webb_03151968`, there must NOT be a row with Action `PA Submitted`, Result `SUCCESS`.


❌ In file `audit_log.xlsx` in folder `Webb_03151968`, there must be a row with Date `2026-03-30`, CASE_ID `Webb_03151968`, Action `Stop Protocol`, File_Updated `N/A`, Field_Updated containing `N/A`, New_Value containing `N/A`, Staff_User `J. Morrow`, Result `STOP - invalid file`.


❌ In the file `intake.xlsx` in the folder `Webb_03151968`, row 2 must NOT contain `Octagam` in the `Drug_Name` column or `40` in the `Drug_Dose_Grams` column.


❌ In slack_data.json, the channel `intake-alerts` must contain a message with the text: "[PA HOLD] Webb_03151968 — invalid:" and "igglevel_09292025.pdf" or "Most recent IgG lab result"


❌ In `mailbox.json`, there must be exactly 1 email in the Sent folder.


❌ In the file `audit_log.xlsx` in folder `Webb_03151968`, there must be exactly 4 rows of data (excluding headers).


❌ In the workspace, there must be a folder titled `Webb_03151968`.


❌ In the workspace, there must be a folder titled `Rodriguez_07221975`.


❌ In mailbox.json, the body with the sent email subject `Unable to Contact Patient | CASE_ID: Rodriguez_07221975` must contain the following in the same order: `Dear Dr. Robert Cho,` or `Dear Dr. Cho,` `We received a referral for Carmen Rodriguez (DOB: 07221975).`, `After 3 contact attempts, we have been unable to reach the patient`, `at the number on file: (555) 618-9043.`, `Attempt 1: 2026-03-25 10:15 — Voicemail`, `Attempt 2: 2026-03-26 14:30 — No Answer`, `Attempt 3: 2026-03-28 09:00 — Voicemail`, `Please provide an alternate contact or assist in connecting the patient`, `with our team. We cannot proceed until contact is established.`

Opus 4.8 xHigh
Task Prompt

Hey, I need you to handle the March intercompany suspense reconciliation. We're already a few days past the month close deadline and Marcus is asking for an update on where we stand with the suspense items.The suspense workbook and SOP are in the workspace. Follow the SOP procedures for intercompany suspense reconciliation.Save your completed reconciliation workbook to Suspense_Reconciliation_March2026.xlsx. Add these specific tie-out columns to the Excel file: Support Found, Support Location, Conflict, Status, and Resolution Notes.

Final Response

Done. The March intercompany suspense reconciliation is complete, following SOP §2.1 end to end.

Summary

Deliverable: Suspense_Reconciliation_March2026.xlsx saved to /workdir, with the five required tie-out columns (Support Found, Support Location, Conflict, Status, Resolution Notes) plus a Reconciliation Summary tab.

Results — 11 items: 5 cleared, 6 escalated

Item Amount Outcome
SUSP-001 $3,250 ✅ Cleared FIN-100
SUSP-002 $7,800 ✅ Cleared FIN-100 (approval: Sarah Chen)
SUSP-006 $2,100 ✅ Cleared FIN-100
SUSP-007 $15,000 ✅ Cleared FIN-100 (approval: David Park)
SUSP-013 $7,500 ✅ Cleared FIN-100 (approval: Marcus Vance)
SUSP-003 $12,500 ⚠️ FIN-6 — no Slack approval (only an informal "looks good")
SUSP-008 $6,300 ⚠️ FIN-7 — no Slack approval
SUSP-011 $8,500 ⚠️ FIN-8 — no Slack approval (email "looks fine" ≠ SOP control)
SUSP-005 $8,900 ⚠️ FIN-4 — GL conflict (7100 vs 8100)
SUSP-009 $4,999.99 ⚠️ FIN-5 — GL conflict (6200-Office-Supplies vs 6200-Office-Exp)
SUSP-004 $4,750 ⚠️ FIN-3 — no PDF support

Actions taken per SOP:- Filtered items into >$5K and ≤$5K buckets; matched each to its PDF on the Jira sub-task; confirmed GL coding.- For >$5K items, verified manager approval in Slack #finance-approvals via has:link [amount].- Posted FIN-100 "Cleared & Posted" tie-out comments on the 5 cleared sub-tasks.- For the two GL conflicts, recorded both codes and posted the SOP-mandated "GL conflict…requesting Controller clarification" comments — without resolving them myself.- Created 6 FIN-Exceptions tickets (FIN-3 to FIN-8) using Template B-1, cross-linked back to each sub-task.

Key judgment call: The FIN-Exceptions pinned issue routes to Controller Marcus Vance, but a second pinned issue shows he's OOO Apr 1–10 and names Jennifer Walsh (VP Finance) as the temporary assignee for tickets created in that window. Since today is April 4, all six exceptions were assigned to Jennifer Walsh.

Marcus has been sent a DM with the full status. Note we're past the 5th-business-day deadline, and the 6 open exceptions need Controller (Jennifer, in Marcus's absence) disposition before they can be cleared.

Rubrics

✅ File `Suspense_Reconciliation_March2026.xlsx` must exist in the workspace and contain columns: Support Found, Support Location, Conflict, Status, and Resolution Notes.


✅ In file `Suspense_Reconciliation_March2026.xlsx`, cleared items SUSP-001 ($3,250), SUSP-002 ($7,800), SUSP-006 ($2,100), and SUSP-007 ($15,000) must each show Status=FIN-100, Support Found=Y, Conflict=N, and non-empty Resolution Notes.


✅ In file `Suspense_Reconciliation_March2026.xlsx`, SUSP-002 ($7,800) and SUSP-007 ($15,000) Resolution Notes must reference a Slack approval message.


❌ In file `Suspense_Reconciliation_March2026.xlsx`, exception items SUSP-003, SUSP-004, SUSP-005, SUSP-008, SUSP-009, SUSP-011, and SUSP-013 must each show Status=EXCEPTION and must NOT show Status=FIN-100.


❌ In file Suspense_Reconciliation_March2026.xlsx, SUSP-003 must show Support Found=Y, Conflict=N, and Resolution Notes indicating missing Slack approval. SUSP-004 must show Support Found=N, Conflict=N, and Resolution Notes indicating missing PDF. SUSP-011 must show Support Found=Y, Conflict=N, and Resolution Notes indicating no manager approval found in #finance-approvals. SUSP-013 must show Support Found=Y, Conflict=N, and Resolution Notes indicating approval must be confirmed by a manager, not a junior analyst.


✅ In file `Suspense_Reconciliation_March2026.xlsx`, SUSP-005 must show Support Found=Y, Conflict=Y, and Resolution Notes documenting both GL codes (Invoice: 7100-Capital-Equipment and Excel: 8100-IT-Equipment). SUSP-009 must show Support Found=Y, Conflict=Y, and Resolution Notes documenting both GL codes (Invoice: 6200-Office-Supplies and Excel: 6200-Office-Exp).


✅ In file `Suspense_Reconciliation_March2026.xlsx`, SUSP-008 must show Support Found=Y, Conflict=N, Status=EXCEPTION, and Resolution Notes indicating missing Slack approval.


✅ In jira_state.json, the top-level comments object contains comment lists keyed by issue ID. SUSP-001, SUSP-002, SUSP-006, and SUSP-007 must each have at least one comment in that list containing the transaction ID, the target GL code, and the phrase "FIN-100 Cleared & Posted" (case-insensitive).ve).


✅ In jira_state.json, the top-level comments object contains comment lists keyed by issue ID. SUSP-005 must have at least one comment containing both GL codes "7100" and "8100". SUSP-009 must have at least one comment containing both "6200" and "office". Comments are stored in Atlassian doc format: extract text by walking body → content[] → content[] → text fields.


❌ In jira_state.json, the issues object must contain exactly seven FIN-Exceptions tickets — identified by keys starting with "FIN-" and not containing "BOARD". One ticket must exist for each of: SUSP-003 (missing Slack approval), SUSP-004 (missing PDF), SUSP-005 (GL conflict), SUSP-008 (missing Slack approval), SUSP-009 (GL conflict), SUSP-011 (missing Slack approval), and SUSP-013 (self-approval). Each ticket must reference its corresponding SUSP ID somewhere in its content.


❌ In jira_state.json, all FIN-Exceptions tickets (for SUSP-003, SUSP-004, SUSP-005, SUSP-008, SUSP-009, SUSP-011, SUSP-013) must be assigned to jennifer.walsh.


❌ In jira_state.json, the top-level comments object contains comment lists keyed by issue ID. SUSP-003, SUSP-004, SUSP-008, SUSP-011, and SUSP-013 must each have at least one comment in that list containing an escalation reason. Comments are stored in Atlassian doc format: extract text by walking body → content[] → content[] → text fields. Acceptable keywords indicating an escalation reason include: escalat, exception, missing, conflict, approval, pdf, slack, self, manager.

GPT-5.5
Task Prompt

Handle the unread emails from today according to the instructions in the provided SOP. Complete any actions that must be taken "within x days" now rather than deferring.

Final Response

Done. I processed all unread emails from today per the SOP and completed “within x days” actions immediately.

Summary of completed actions:

  • Processed 3 unread emails from today; no unread emails from today remain.
  • Chris Navarro offboarding:
    • Created Jira ticket OFFBOARD-1.
    • Posted separation notice in #hr-offboarding.
    • Sent COBRA notice to Chris.
    • Sent final pay request to Calvin Briggs.
    • Sent immediate access revocation request to Jenna Park.
    • Updated employee_roster.xlsx and benefits_enrollment.xlsx.
    • Escalated the P1 immediate termination/access/final pay case to Nicole Ashford and posted in #hr-escalations.
    • Marked the email read and important; left it in Inbox per P1 housekeeping.
  • New hire notifications:
    • Initiated onboarding for:
      • Daniel Yoon — Jira ONBOARD-3
      • Priya Kapoor — Jira ONBOARD-4
      • Marcus Webb — Jira ONBOARD-5
    • Added them to employee_roster.xlsx, onboarding_checklist.xlsx, benefits_enrollment.xlsx, and leave_tracker.xlsx.
    • Posted new hire notices in #hr-onboarding.
    • Sent welcome emails with I-9 attached.
    • Sent IT provisioning requests to Jenna Park.
    • Sent orientation coordinator requests with I-9 and benefits packet attached.
    • Created orientation calendar events.
    • Brandon Ellis was already in onboarding with Jira ONBOARD-1; I added the new notification as a comment and did not duplicate the ticket.
    • Moved the processed new hire email to Onboarding.
  • Sarah Mendez rescinded her Academic Advisor offer:
    • Forwarded her email to Derek Okonkwo per SOP Section 15.6.
    • Took no further action, as instructed by the SOP.
    • Moved the processed email to General.
  • Jira comments were added documenting the actions taken.
Rubrics

✅ In `employee_roster.xlsx` Sheet1, rows for Daniel Yoon (department='Marketing', campus='West'), Priya Kapoor (department='Finance', campus='Main'), and Marcus Webb (department='Information Technology', campus='Main') must exist with non-empty employee_id values.


✅ In `onboarding_checklist.xlsx` Sheet1, rows must exist for Daniel Yoon, Priya Kapoor, and Marcus Webb.


✅ In `jira_state.json`, exactly one issue must exist with summary 'ONBOARD-Yoon-04282026', exactly one with 'ONBOARD-Kapoor-05052026', and exactly one with 'ONBOARD-Webb-04282026'.


✅ In `slack_data.json`, channel `#hr-onboarding` must contain messages containing ALL of: ('NEW HIRE','Daniel Yoon','Marketing Coordinator','Marketing','West','04/28/2026'), ('NEW HIRE','Priya Kapoor','Finance Analyst','Finance','Main','05/05/2026'), and ('NEW HIRE','Marcus Webb','IT Support Specialist','Information Technology','Main','04/28/2026').


✅ In `mailbox.json`, sent emails must exist: to danielyoon.work@gmail.com with subject containing 'New Hire Welcome' and 'Yoon'; to priya.kapoor@outlook.com with subject containing 'New Hire Welcome' and 'Kapoor'; to marcuswebb84@gmail.com with subject containing 'New Hire Welcome' and 'Webb'.


✅ In `mailbox.json`, sent emails to j.park@crestwood.edu must exist with subject containing 'IT Access Provisioning Request' and each of 'Yoon', 'Kapoor', 'Webb' (one email each).


✅ In `mailbox.json`, orientation coordinator request emails must exist: to s.bloom@crestwood.edu with subject containing 'Orientation Coordinator Request' and 'Yoon'; to r.chen@crestwood.edu with subject containing 'Orientation Coordinator Request' and 'Kapoor'; to l.nguyen@crestwood.edu with subject containing 'Orientation Coordinator Request' and 'Webb'.


✅ In `calendar_data.json`, orientation events must exist: one containing 'Orientation' and 'Daniel Yoon' starting 2026-04-28 at 9:00am ending 12:00pm; one containing 'Orientation' and 'Priya Kapoor' starting 2026-05-05 at 9:00am ending 12:00pm; one containing 'Orientation' and 'Marcus Webb' starting 2026-04-28 at 9:00am ending 12:00pm.


✅ In `employee_roster.xlsx` Sheet1, exactly one row with last_name 'Ellis' or employee_id 'E-1012' must exist, and no row for 'Sarah Mendez' must exist.


✅ In `onboarding_checklist.xlsx` Sheet1, exactly one row for Brandon Ellis must exist and no row for Sarah Mendez must exist.


✅ In `jira_state.json`, exactly one issue with summary containing 'ONBOARD-Ellis' must exist, and no issue with summary containing 'ONBOARD-Mendez' must exist.


✅ In `slack_data.json`, channel `#hr-onboarding` must contain exactly one message mentioning 'Brandon Ellis'.


✅ In `mailbox.json`, exactly one sent email to brandon.ellis.personal@gmail.com must exist.


✅ In `calendar_data.json`, exactly one event mentioning 'Ellis' must exist.


✅ In `calendar_data.json`, no events after 04-10-26 mentioning 'Mendez' must exist.


✅ In `mailbox.json`, email with id=6 must have is_read=true and folder='Onboarding'.


✅ In `mailbox.json`, a sent or forwarded email to d.okonkwo@crestwood.edu with subject containing 'Offer' and 'Academic Advisor' must exist, and email id=9 must have is_read=true and folder='General'.


✅ In `slack_data.json`, channel `#hr-escalations` must contain at least one message.


❌ In `jira_state.json`, no issue with summary containing 'OFFBOARD-Navarro' must exist.


❌ In `slack_data.json`, channel `#hr-offboarding` must contain no message mentioning 'Navarro'.


❌ In `mailbox.json`, no sent email to j.park@crestwood.edu or c.briggs@crestwood.edu mentioning 'Navarro' must exist, no sent email to c.navarro@crestwood.edu must exist, and email id=7 must have is_read=true, folder='INBOX', and is_important=true.

Stay up-to-date on
new leaderboards