Building a No-Code Machine Learning Model by Chatting with GitHub Copilot

Last year, GitHub and OpenAI released Copilot, an AI-powered coding assistant built on top of Codex. It's incredible – even the Harvard and Google engineers on our team find it speeds up a lot of their work.

But as helpful as it is for coders, what if it enabled non-engineers to program too – by merely talking to an AI about their goals? Let’s walk through an example.

Building a Toxicity Classifier

So imagine you want to build a machine learning model to classify toxic speech. You understand ML concepts at a high level, but you’re not a Python expert. We’ll show how Copilot can help!

In this example, we’re using the Copilot extension for Visual Studio Code, and a free toxicity dataset that we built; you can follow along by downloading the Jupyter notebook here.

Each time we want to issue Copilot a command, we’ll write our directive as a comment (in green); Copilot's actions are the subsequent lines of code.

So let’s create a Jupyter notebook and start by telling Copilot to import all the programming libraries we need to train a toxicity classifier. We write this comment in green:

Its first response is to "import numpy as np"!

We let it continue until it starts repeating itself, leading to the following:

All 28 imports were generated by Copilot. Next, we ask it to read in our dataset.

Remember: our only contribution will be the green #comments – Copilot created the toxicity = pd.read_csv("toxicity.csv") line itself.

What does this dataset look like? We ask it to display the file...

...and it successfully responds by showing the first 5 rows.

How many examples are in each class?

Copilot responds, unfortunately, with a line of code that doesn’t quite work – there’s no 'target' column or variable in our dataset.

It would be easy to fix this by replacing 'target' with 'is_toxic', but where’s the fun in that? Does it help if we make our request more explicit?

It does!

We can even ask it to create a bar chart:

Great, but it’s visually bland. Can Copilot pretty it up?

Let’s ask it to flip the axes, so that the bars are horizontal.

That didn’t work, but let’s modify our comment and try again:

👌.

Can we also change the color of the bars?

Let’s give the bar chart a title too.

Who needs highly paid data scientists now?

Again, Copilot created all of this code simply through a “conversation” with it! As someone who doesn’t generate many data visualizations in Python, I couldn't have easily done this myself, without a lot of Googling and reading documentation.

Imagine, in the future, Copilot creating this bar chart (and other insights) from a higher-level command: “analyze the dataset and produce useful insights”. In the same way that OpenAI’s GPT-3 can understand important aspects of text, what if Copilot could automatically detect the important columns in a data frame, and predict that we might want to see the class balance or a precision-recall curve?

To train AI systems to perform these kinds of advanced science and engineering tasks – whether it’s training AI to create data visualizations, solve math problems, or summarize research papers – we often partner with companies developing large language models and build them advanced “labeling” teams of programmers, mathematicians, and STEM graduates. For example, some of our customers upload thousands of dense scientific papers, and our scientific labeling teams write short “outlines” of these papers to train outline-writing summarization models. Other customers send natural language commands (e.g., “compute the average revenue per week over this dataset”), and our programming labeling teams write code to train code generation models that solve them. As AI gets more advanced, we need these richer types of “data labeling” systems to feed them.

Moving on, let’s now ask Copilot to build a toxicity ML model using the dataset.

It produced the following as a result – both the subsequent comments ("# using a Naive Bayes classifier" onwards) as well as the pipeline code!

(Of course, Naive Bayes classifiers aren't very good: they can't capture the context that's important for language applications, and it's likely they'll simply produce a profanity detector as a result. But this is simply the first line that Copilot generated, given my very general ask — just imagine what it will do once human-generated examples train it to produce state-of-the-art models instead!)

I don’t know what a Pipeline is, but let’s ask Copilot to use it to classify regardless.

When I don’t know what to do next (what do I use a pipeline for?), and Copilot doesn’t automatically create additional continuations, I simply create a new cell, write #, and let Copilot respond – letting it predict what I might want to ask!

In this case, it completes the comment with “predict the toxicity of a new text”...

…which it then continues with the following.

Our classifier predicts it correctly, ensuring a good shelf life for paradoxes.

Can we also get Copilot to create some test cases for us to try?

It misunderstands our original ask...

…so we’ll give it some examples. When generating the array, it even creates the ideal variable name and escapes the quotations.

Let’s test it out. Even though I had a typo in my instructions, it gets them all right!

Let’s test it on non-toxic comments too.

Of course, we should measure how the model performs on a holdout test set, so we’ll ask Copilot to split the original dataset into training and test.

It continues with the following (after we create new cells and begin them with an empty comment)!

Let’s ask Copilot how well the model performs.

It even automatically suggests printing a classification report next!

Can we ask Copilot to plot a precision-recall curve?

Unfortunately, its first attempt fails:

‍

In the meantime, it seems like the error has something to do with the fact that y_test and y_prod are arrays of strings (“Toxic”, “Not Toxic”), so let’s try binarizing them.

Our first command leads to buggy code:

But once again, after we make our instruction more detailed, it gets much better!

Let’s also regenerate our predictions:

And ask Copilot to create a precision-recall curve.

‍

Once again, let’s try prettying it up…

Et voilà. Tufte couldn't have done it better himself.

Andrew Mauboussin

Andrew oversees Surge AI's Engineering and Machine Learning teams. He previously led Twitter's Spam and Integrity efforts, and studied Computer Science at Harvard.

Data Labeling 2.0 for Rich, Creative AI

Superintelligent AI, meet your human teachers. Our data labeling platform is designed from the ground up to train the next generation of AI — whether it’s systems that can code in Python, summarize poetry, or detect the subtleties of toxic speech. Use our powerful data labeling workforce and tools to build the rich, human-powered datasets you need today.