Login
Careers
Research
Blog
Contact
Blog
Bringing light to the GPT-4o vs. GPT-5 personality controversy
How Surge AI Built OpenAI's GSM8K Dataset of 8,500 Math Problems
How Anthropic uses Surge AI to Train and Evaluate Claude
Holy $#!t: Are popular toxicity models simply profanity detectors?
We asked 100 humans to draw the DALL·E prompts
The Human/AI Frontier: A Conversation with Bogdan Grechuk
Unsexy AI Failures: Still Confidently Hallucinating Image Text
SWE-Bench Failures: When Coding Agents Spiral Into 693 Lines of Hallucinations
Benchmarks are broken
Unsexy AI Failures: The PDF That Broke ChatGPT
DALL·E 3 and Midjourney Fail Astral Codex Ten's Image Generation Bet
We Evaluated ChatGPT vs. Google on 500 Search Queries
AI Red Teams for Adversarial Training: How to Make ChatGPT and LLMs Adversarially Robust
HellaSwag or HellaBad? 36% of this popular LLM benchmark contains errors
How TikTok is Evolving the Next Generation of Search
Evaluating Generative AI: Did Astral Codex Ten Win His Bet on AI Progress?
Why Instagram is Losing Gen Z: We Asked 100 Users to Compare TikTok vs. Reels
The $250K Inverse Scaling Prize and Human-AI Alignment
The AI Bottleneck: High-Quality, Human-Powered Data
Is Google Search Deteriorating? Measuring Google's Search Quality in 2022
Search Behind-the-Scenes: How Neeva Uses Human Evaluation to Measure Search Quality
The average number of ads on a Google Search recipe? 8.7
Humans vs. Gary Marcus vs. Slate Star Codex: When is an AI failure actually a failure?
AI Red Teams and Adversarial Data Labeling with Redwood Research
Google Search is Falling Behind
Moving Beyond Engagement: Optimizing Facebook's Algorithms for Human Values
30% of Google's Emotions Dataset is Mislabeled
10 Egregious Failures in Gmail Spam Detection
5 Examples of the Importance of Context-Sensitivity in Data-Centric AI
Human Evaluation of Large Language Models: How Good is Hugging Face’s BLOOM?