Table of contents
Case Study Llama
Analysis of Individual writings
Appendix

The Unglamorous Lifeblood of the Economy

Parsing PDFs isn't the sexiest area of AI research. It doesn't produce viral videos, code flashy apps, or generate splashy headlines. But PDFs are the unglamorous lifeblood of the global economy – capturing every medical record, earnings report, contract, and invoice.

They’re also the lifeblood of AI agents. If we expect autonomous agents to genuinely transform day-to-day work, they have to natively master these formats: reading them, organizing them, cross-referencing dense data, and accurately filling them out.

When models fail at this level, the consequences are serious:

  • Finance — A model transposes two numbers from a quarterly earnings table, and a fabricated margin profile circulates in a buy-side memo.
  • Legal — A model hallucinates the location of a liability cap in a commercial lease, leading to catastrophic legal advice.
  • Healthcare — A model pulls the wrong row from a drug interaction chart, creating a life-threatening patient safety hazard.

These all happened in our testing.

Measuring the Essential: GDP.pdf

To measure the unsexy-but-essential work that keeps the economy moving, we built GDP.pdf, the public set of which we're releasing today on Huggingface here.

GDP.pdf is an expert multimodal and reasoning benchmark. It consists of 100 real-world prompts and PDFs pulled directly from professional workflows across ten domains: Finance, Healthcare, Legal, STEM/Research, Engineering, Construction, Manufacturing/Supply Chain, Insurance, Real Estate, and HR.

Every task required parsing, understanding, and synthesizing complex PDFs – interpreting a multi-page dosage table, isolating an indemnification clause buried in nested exhibits, reconciling revenue figures across quarterly filings.

The result: Every frontier model scored under 30%.

GDP.pdf Benchmark Scores
GDP.pdf Benchmark Scores
Rank Model Score
(100% rubrics)
Mean criteria
pass rate
1 GPT-5.5 (xHigh reasoning) 25% 76.76%
2 Claude Opus 4.8 (Adaptive Max) 23% 76.94%
3 Claude Opus 4.7 (Adaptive Max) 21% 76.24%
4 Gemini 3.1 (Pro) 17% 73.07%
5 Gemini 3.5 Flash 14% 72.95%
6 Kimi K2.6 12% 69.10%
7 Gemini 3 Flash 10% 64.83%
8 Grok 4.3 (High) 8% 57.03%
9 Mistral Large 3 2% 50.98%
9 NVIDIA Nemotron 3 Nano Omni 2% 42.25%
9 Nova 2 (Pro) 2% 40.03%

Into the Real World

At Surge, we often build benchmarks for the ceiling. Hemingway-bench measures their progress toward the Booker Prize. CoreCraft measures their ability to run a chaotic startup. Riemann-bench tests whether models can solve moonshot mathematics. We care about what's possible at the frontier.

We built GDP.pdf because real-world economic utility matters just as much. A model that can theorize about the Riemann hypothesis but gets lost in the fine print of a commercial lease is simply an intelligent liability.

Before we trust AI agents to manage the high-stakes workflows that drive the economy, they need to be able to master the complex paperwork that sustains it.

View the full GDP.pdf benchmark results and failure examples here. The public set can be found on Huggingface here.

Follow us on
Linkedin
X

Read what frontier labs read.

We publish 1-2 deep posts every month on Al evaluation, post-training, and pushing the frontier.

Subscription confirmed

You'll get updates when we post
Oops! Something went wrong while submitting the form.

More Posts

Appendix