GDP.pdf Benchmark: Can Frontier Models Master the Documents that Run the World?

Table of contents

The unglamorous lifeblood of the economy

Parsing PDFs isn't the sexiest area of AI research. It doesn't produce viral videos, code flashy apps, or generate splashy headlines. But PDFs are the unglamorous lifeblood of the global economy, capturing every medical record, earnings report, contract, and invoice.

They’re also the lifeblood of AI agents. If we expect autonomous agents to genuinely transform day-to-day work, they have to natively master these formats: reading them, organizing them, cross-referencing dense data, and accurately filling them out.

When models fail at this level, the consequences are serious:

Finance. A model transposes two numbers from a quarterly earnings table, and a fabricated margin profile circulates in a buy-side memo.
Legal. A model hallucinates the location of a liability cap in a commercial lease, leading to catastrophic legal advice.
Healthcare. A model pulls the wrong row from a drug interaction chart, creating a life-threatening patient safety hazard.

Every one of these happened in our testing.

Measuring the essential: GDP.pdf

To measure the unsexy-but-essential work that keeps the economy moving, we built GDP.pdf. Its public set is available on HuggingFace here.

GDP.pdf is an expert multimodal and reasoning benchmark. It consists of 100 real-world prompts and PDFs pulled directly from professional workflows across ten domains: Finance, Healthcare, Legal, STEM/Research, Engineering, Construction, Manufacturing/Supply Chain, Insurance, Real Estate, and HR.

Every task requires parsing, understanding, and synthesizing complex PDFs: interpreting a multi-page dosage table, isolating an indemnification clause buried in nested exhibits, reconciling revenue figures across quarterly filings.

The result: Every frontier model scored under 30%.

GDP.pdf Benchmark Scores
Rank	Model	Score (100% rubrics)	Mean criteria pass rate
1	GPT-5.5 (xHigh reasoning)	25%	76.76%
2	Claude Opus 4.8 (Adaptive Max)	23%	76.94%
3	Claude Opus 4.7 (Adaptive Max)	21%	76.24%
4	Gemini 3.1 (Pro)	17%	73.07%
5	Gemini 3.5 Flash	14%	72.95%
6	Kimi K2.6	12%	69.10%
7	Gemini 3 Flash	10%	64.83%
8	Grok 4.3 (High)	8%	57.03%
9	Mistral Large 3	2%	50.98%
9	NVIDIA Nemotron 3 Nano Omni	2%	42.25%
9	Nova 2 (Pro)	2%	40.03%

Into the real world

At Surge, we often build benchmarks for the ceiling. Hemingway-bench measures their progress toward the Booker Prize. CoreCraft measures their ability to run a chaotic startup. Riemann-bench tests whether models can solve moonshot mathematics. We care about what's possible at the frontier.

We built GDP.pdf because real-world economic utility matters just as much. A model that can theorize about the Riemann hypothesis but gets lost in the fine print of a commercial lease is simply an intelligent liability.

Before we trust AI agents to manage the high-stakes workflows that drive the economy, they need to be able to master the complex paperwork that sustains it.

View the full GDP.pdf benchmark results and failure examples here. The public set can be found on Huggingface here. The Github repository with harness can be found here. Reach out to benchmarks@surgehq.ai if you have any questions!