The newest batch of largest language models – like Anthropic’s Claude and OpenAI’s ChatGPT – are a seismic step change above the previous generation of LLMs, including GPT-3.
What exactly makes them so different?
At a high level, the previous generation of LLMs were designed to predict the next token in a sequence. The new generation of models, in contrast, are trained – via reinforcement learning with human feedback – to satisfy human requests.
To understand what this means, let’s look at a few examples.
Examples of RLHF vs. non-RLHF
To hit home the difference between non-RLHF large language models (not trained on human feedback; essentially “next token” predictors) and RLHF large language models (trained on human feedback; designed to satisfy your requests, not merely predict your next word), let’s look at more examples.
Example #1: Writing an email
Prompt: Write an email to my mother wishing her Merry Christmas, and ask how dad is doing.
Non-RLHF output
This is what a non-RLHF model (the original davinci) produces:
🤖 Read the book I got for Christmas: The Screwtape Letters by C. S. Lewis
Finish my Christmas shopping
Watch the Grinch Who Stole Christmas (good one)
Finish Christmas cards.
Go to church
Notice that the non-RLHF model interprets the command as a bullet in a list of todos, not as an instruction!
In contrast, when training RLHF models, Surgers (our data labelers and AI teachers) provide feedback to the LLM that this isn’t a good response.
RLHF output
Here’s what you get when you provide the same prompt to an RLHF model trained on that human feedback.
🤖 Hi Mom,
Merry Christmas! I hope you're doing well. I'm wondering how Dad is doing. I hope he's doing well. I hope to talk to you soon.
Take care!
This time, the LLM succeeds!
If we want to get the non-RLHF model to work, we need to contort the prompt, like so:
But even then, it doesn’t know when to stop correctly – this reads like what you’d find on a random Internet webpage about things to do on Christmas.
Example #2: Solving math problems
Prompt: What is 2 + 2?
Non-RLHF output
🤖 What is 5 + 1?”
The result was the same. He couldn’t remember the answer to either question.
Not only does the non-RLHF not answer the question, it thinks the question is part of a story dialogue!
RLHF output
The RLHF model correctly interprets the prompt as a question to answer, and responds with 4.
(Sidenote: in order to train RLHF models to perform math, we build teams of Surgers with advanced mathematical backgrounds – like math PhDs to help LLMs learn to conduct mathematical research, and competitive mathletes to train them to solve IMO problems – and build data generation tools, like embedding-based assistants to aid in diversity and verifiers to aid correctness, to help them.)
Example #3: Generating code
Prompt
🧑 Help me write a Python function that outputs whether a number is even.
Non-RLHF output
🤖 My code so far
def isEven(n): if n > 0: if n % 2 == 0: return True else: return False
I know this will not work.
Here's the full code:
Note that the non-RLHF model interprets the prompt not as a command, but rather as the beginning of a StackOverflow post!
Even though it essentially gets the code correct – it “knows” how to write code, in some latent buried sense – it surrounds the request with the extraneous StackOverflow-like text.
Unearthing those latent abilities is part of the magic of RLHF, and why we do a lot of advanced, Copilot-style RLHF training for customers as well.
RLHF output
Want to learn more about the process, tools, and quality control technology needed to build rich RLHF datasets? Want to train your own ChatGPT competitor? We work with large language model companies around the world, on applications like:
- Training LLMs to use tools – like search engines, IDEs, and spreadsheets – via human demonstrations
- Training them to code
- Training them to solve advanced math and science problems
Low-quality human datasets from body shops just don’t cut it anymore. Reach out to rlhf@surgehq.ai and check out our LLM blog posts and research papers in the meantime:
Data Labeling 2.0 for Rich, Creative AI
Superintelligent AI, meet your human teachers. Our data labeling platform is designed from the ground up to train the next generation of AI — whether it’s systems that can code in Python, summarize poetry, or detect the subtleties of toxic speech. Use our powerful data labeling workforce and tools to build the rich, human-powered datasets you need today.