Surge AI | Human Infrastructure for NLP

Edwin Chen

DALL·E 3 and Midjourney Fail Astral Codex Ten's Image Generation Bet

DALL·E 3 and Midjourney Fail Astral Codex Ten's Image Generation Bet

How Anthropic uses Surge AI’s RLHF platform to train their LLM Assistant on Human Feedback

How Anthropic uses Surge AI’s RLHF platform to train their LLM Assistant on Human Feedback

Large Language Models

How RLHF Shifts LLMs from Autocompletion to Conversational Understanding

How RLHF Shifts LLMs from Autocompletion to Conversational Understanding

Introduction to Reinforcement Learning with Human Feedback

Introduction to Reinforcement Learning with Human Feedback

Large Language Models

2022 Blog Recap: Trends in AI, Language, & Data

2022 Blog Recap: Trends in AI, Language, & Data

We Evaluated ChatGPT vs. Google on 500 Search Queries

We Evaluated ChatGPT vs. Google on 500 Search Queries

Large Language Models

AI Red Teams for Adversarial Training: How to Make ChatGPT and LLMs Adversarially Robust

AI Red Teams for Adversarial Training: How to Make ChatGPT and LLMs Adversarially Robust

Large Language Models

HellaSwag or HellaBad? 36% of this popular LLM benchmark contains errors

HellaSwag or HellaBad? 36% of this popular LLM benchmark contains errors

Large Language Models

How TikTok is Evolving the Next Generation of Search

How TikTok is Evolving the Next Generation of Search

Evaluating Generative AI: Did Astral Codex Ten Win His Bet on AI Progress?

Evaluating Generative AI: Did Astral Codex Ten Win His Bet on AI Progress?

The $250K Inverse Scaling Prize and Human-AI Alignment

The $250K Inverse Scaling Prize and Human-AI Alignment

Large Language Models

Human Evaluation of Large Language Models: How Good is Hugging Face's BLOOM?

Human Evaluation of Large Language Models: How Good is Hugging Face's BLOOM?

Large Language Models

30% of Google's Emotions Dataset is Mislabeled

30% of Google's Emotions Dataset is Mislabeled

Search Behind-the-Scenes: How Neeva Uses Human Evaluation to Measure Search Quality

Search Behind-the-Scenes: How Neeva Uses Human Evaluation to Measure Search Quality

Humans vs. Gary Marcus vs. Slate Star Codex: When is an AI failure actually a failure?

Humans vs. Gary Marcus vs. Slate Star Codex: When is an AI failure actually a failure?

How Surge AI Built OpenAI's GSM8K Dataset of 8,500 Math Problems

How Surge AI Built OpenAI's GSM8K Dataset of 8,500 Math Problems

10 Egregious Failures in Gmail Spam Detection

10 Egregious Failures in Gmail Spam Detection

Content Moderation

We asked 100 humans to draw the DALL·E prompts

We asked 100 humans to draw the DALL·E prompts

Google Search is Falling Behind

Google Search is Falling Behind

Holy $#!t: Are popular toxicity models simply profanity detectors?

Holy $#!t: Are popular toxicity models simply profanity detectors?

Content Moderation

Is Google Search Deteriorating? Measuring Google's Search Quality in 2022

Is Google Search Deteriorating? Measuring Google's Search Quality in 2022

Human Evaluation

The AI Bottleneck: High-Quality, Human-Powered Data

The AI Bottleneck: High-Quality, Human-Powered Data

Welcome to
the world's largest RLHF platform

The latest in AI, language, and RLHF

Thank you! Your submission has been received!

Oops! Something went wrong while submitting the form.