Blog
How we teach AI, and what we learn along the way.
October 15, 2025
October 10, 2025
A Product Take on Sonnet 4.5
After 100+ hours with Opus 4.1 and 20+ hours in the first week of Sonnet 4.5's launch, Nick Heiner, our VP of Product gives first impressions.
Read Post
October 15, 2025
October 8, 2025
Is Sonnet 4.5 the best coding model in the world?
On Surge AI’s agentic coding benchmark, Claude Sonnet 4.5 outperformed GPT-5-Codex in accuracy, while GPT-5-Codex was more cost-efficient. Despite similar scores, the models were distinct in which tasks they failed in. In a refactoring case study, Claude succeeded after persistent debugging, while GPT-5-Codex failed due to an unexplained decision to end the task early. Both stayed focused and avoided hallucinations even when encountering difficulties.
Read Post
October 14, 2025
September 29, 2025
The Human/AI Frontier: A Conversation with Bogdan Grechuk
At Surge AI, we work with the world’s sharpest minds to push the limits of AI. Professor Bogdan Grechuk—an IMO gold medalist and Associate Professor at the University of Leicester—is one of them. We interviewed him about the work he does to train SOTA models to perform frontier research.
Read Post
October 16, 2025
September 22, 2025
Unsexy AI Failures: Still Confidently Hallucinating Image Text
A core problem with today’s AI systems isn’t simply that they make mistakes – it’s that they make mistakes confidently. They’ll insist they can do something, describe exactly how they’ll do it, and then deliver something completely wrong. We saw this in our last Unsexy Failures post, where a SOTA model confidently described generating a Word document – even though this was a completely fabricated capability! – and provided a link to nowhere.
Read Post
October 10, 2025
September 15, 2025
SWE-Bench Failures: When Coding Agents Spiral Into 693 Lines of Hallucinations
When coding models spiral into self-reinforcing hallucinations, small mistakes compound into catastrophic failure. In SWE-bench, we saw SOTA models invent whole classes, methods, and terminal outputs—never realizing they had lost touch with the real codebase. In this case study, we’ll look at how three frontier coding agents tried to solve one particular SWE-bench problem: one spiraled into hallucinations and failed entirely, one spiraled but recovered, and one avoided hallucinations altogether. Our goal: to illustrate how dissecting real-world problems can steer models towards human-ready AGI.
Read Post
October 8, 2025
September 7, 2025
Benchmarks are broken
Academic benchmarks make great headlines, and terrible AI.
Read Post
October 8, 2025
August 25, 2025
Unsexy AI Failures: The PDF That Broke ChatGPT
The AI world loves climbing leaderboards. Companies race to hit #1 on LMSYS, chase perfect scores on academic benchmarks, and demo SVGs of pelicans on bicycles. These achievements make for great headlines and impressive presentations – even when these metrics are easily hacked.
Read Post
October 22, 2025
August 15, 2025
Bringing light to the GPT-4o vs. GPT-5 personality controversy
GPT-5 was released on Aug 7, 2025. The swift removal of all legacy models from the ChatGPT UI was met with an even swifter backlash: some people online felt that GPT-4o was more personable, human, and engaging, whereas GPT-5 was stiff and robotic. This viral meme encapsulated the faction’s thesis:
Read Post
October 9, 2025
August 1, 2024
DALL·E 3 and Midjourney Fail Astral Codex Ten's Image Generation Bet
An update on Astral Codex Ten's Image Generation Bet: close, but no dice. DALL·E 3 and Midjourney fail.
Read Post
October 9, 2025
March 9, 2023
How Anthropic uses Surge AI to Train and Evaluate Claude
Learn how Anthropic partnered with Surge AI to gather high-quality human feedback at scale using the RLHF platform, resulting in one of the safest and most advanced large language models on the planet.
Read Post
October 9, 2025
December 21, 2022
We Evaluated ChatGPT vs. Google on 500 Search Queries
We measured ChatGPT vs. Google on 500 search queries, and found that ChatGPT crushes Google on coding and ties it on general information — despite not being optimized for a search experience at all. Dive into this post to learn more about OpenAI’s existential threat to Google.
Read Post
October 9, 2025
December 12, 2022
AI Red Teams for Adversarial Training: How to Make ChatGPT and LLMs Adversarially Robust
How do you make large language models safer and adversarially robust to counterattacks? Learn about AI red teams of creative data labelers who try to interactively penetrate AI defenses in order to teach them.
Read Post
October 9, 2025
December 4, 2022
HellaSwag or HellaBad? 36% of this popular LLM benchmark contains errors
We analyzed HellaSwag, a popular LLM benchmark, and found errors in 36% of its rows.
Read Post
October 8, 2025
October 25, 2022
How TikTok is Evolving the Next Generation of Search
TikTok has been taking over the world — and now, your Google Search results too. But when are they actually helpful? We ran a large-scale personalized human evaluation, asking Surgers to rate hundreds of <query, TikTok> pairs to find out.
Read Post
October 9, 2025
September 29, 2022
Evaluating Generative AI: Did Astral Codex Ten Win His Bet on AI Progress?
Has Astral Codex Ten's bet on AI progress really been won? We asked Surgers to evaluate DALL·E and Imagen on Scott's 5 compositionality prompts!
Read Post
October 9, 2025
August 31, 2022
Why Instagram is Losing Gen Z: We Asked 100 Users to Compare TikTok vs. Reels
Why can't Meta A/B test its way back to greatness? To move Instagram beyond short-term engagement metrics, we ran a personalized human evaluation asking 100 users to compare TikTok vs. Instagram Reels. Learn why Gen Z considers Reels the place where TikToks go to die, and what Instagram should do about it.
Read Post
October 9, 2025
August 15, 2022
The $250K Inverse Scaling Prize and Human-AI Alignment
Surge AI is partnering with NYU and the Fund for Alignment Research on the Inverse Scaling Prize. If you've found a task with LLM inverse scaling properties, and need help creating a dataset of 300-500+ examples, reach out. We’re a human alignment platform with deep expertise in training large language models on human feedback, and we’re here to help – including $500 of free data labeling credits to kickstart your submission.
Read Post
October 9, 2025
July 29, 2022
Search Behind-the-Scenes: How Neeva Uses Human Evaluation to Measure Search Quality
Search quality measurement is one of the trickiest, but most important parts of building Search. Read how Neeva uses human evaluation of search quality to build a state-of-the-art search engine challenging Google.
Read Post
October 9, 2025
July 19, 2022
Human Evaluation of Large Language Models: How Good is Hugging Face’s BLOOM?
Hugging Face's BLOOM is a new 176B parameter multilingual large language model. How does it compare to other state-of-the-art LLMs? We ran a human evaluation across 7 real-world categories to evaluate its performance.
Read Post
October 9, 2025
July 11, 2022
30% of Google's Emotions Dataset is Mislabeled
Last year, Google released their “GoEmotions” dataset: a human-labeled dataset of 58K Reddit comments categorized according to 27 emotions. The problem? A whopping 30% of the dataset is mislabeled! Check out some of the egregious errors, and learn how to build better datasets.30% of Google's Emotions Dataset is Mislabeled
Read Post
October 9, 2025
June 28, 2022
AI Red Teams and Adversarial Data Labeling with Redwood Research
Our mission at Surge AI is to inject human values and intelligence into AI. We want to build a world where AI
Read Post
October 9, 2025
June 22, 2022
Humans vs. Gary Marcus vs. Slate Star Codex: When is an AI failure actually a failure?
Gary Marcus has several examples of AI mistakes. But are they really failures, or a sign of creativity? We gave them to 15 Surgers to complete GPT-3's "mistakes" to see how they would perform instead.
Read Post
October 9, 2025
June 13, 2022
How Surge AI Built OpenAI's GSM8K Dataset of 8,500 Math Problems
We built a dataset of 8,500 Grade School Math Problems for OpenAI. The goal of the dataset: to train language models like GPT-3 to solve natural language math problems and measure their reasoning ability. Learn about our process in this blog post!
Read Post
October 9, 2025
May 12, 2022
We asked 100 humans to draw the DALL·E prompts
Where do human artists fit in a world of rich, creative AI? We asked 100 Surgers to draw the DALL-E prompts.
Read Post
October 17, 2025
April 29, 2022
The average number of ads on a Google Search recipe? 8.7
We ran a large-scale human evaluation to count the average number of ads on a Google Search recipe.
Read Post
October 9, 2025
April 12, 2022
Google Search is Falling Behind
Google Search is falling behind. We analyzed three areas – programming queries, sports queries, and cooking queries – to understand where Google Search lags behind its competitors.
Read Post
October 9, 2025
February 10, 2022
Moving Beyond Engagement: Optimizing Facebook's Algorithms for Human Values
Social media platforms optimize for clicks and engagement — but those same short-term optimizations drive clickbait, toxic content, and misinformation. How can we align their ML systems to human values instead? This post describes a data-driven approach with Facebook.
Read Post
October 9, 2025
January 22, 2022
Holy $#!t: Are popular toxicity models simply profanity detectors?
Are popular toxicity models simply profanity detectors? We show how toxicity models overweight profanity, and make mistakes when profanity is used in a positive way.
Read Post
October 9, 2025
January 10, 2022
Is Google Search Deteriorating? Measuring Google's Search Quality in 2022
Has Google's Search Quality deteriorated in recent years? This post measures Google Search using human evaluation.
Read Post
October 9, 2025
November 19, 2021
5 Examples of the Importance of Context-Sensitivity in Data-Centric AI
Data-centric AI requires radically rethinking the data that goes into your models. Surge AI provides data labelers with the skills you need to get context-sensitive labels.
Read Post
October 9, 2025
August 2, 2021
The AI Bottleneck: High-Quality, Human-Powered Data
In theory, AI has blown past our wildest dreams; in practice, Siri can’t even tell us the weather. The problem? Creating high-quality datasets to train and measure our models is still incredibly difficult. We should be able to gather 20,000 labels for training a Reddit classifier in a single
Read Post