Holy $#!t: Are popular toxicity models simply profanity detectors?

Holy $#!t: Are popular toxicity models simply profanity detectors?

Imagine that after months of waiting, your favorite singer – Britney Spears – has released a new album. It lives up to all the hype, and you rush to post on your favorite social media site:

Holy shit. This album is fucking amazing!

Strangely, you get an instant notification that your post is Under Review.

Maybe it’s the word “fucking”? You try again:

Holy shit! This album is amaaazing

No dice. You try one last time:

fuck yes. the OG bad bitch is BACK

This time, you get warned of an account suspension.

The challenges of language

It’s tempting to believe that AI has progressed far enough that identifying hateful content is a solved problem. Isn’t this the promise of contextual word embeddings and transformers?

But language is complicated. The strongest profanities are often used in the most positive, life-affirming ways. This is a problem: people’s biggest and most enthusiastic fans – the ones whose content you love seeing and spreading – are getting hidden. Talk about terrible false positives!

I just love her, she's so me, what a bad bitch.


Bad bitch contest you in first place

In our work, we run into cases like these a lot. Much of the problem stems from poor training and test data: NLP datasets are often created using non-fluent labelers who pattern match on profanity.

For example, here are 10 examples from Google’s GoEmotions dataset (a dataset of Reddit comments, tagged with 27 emotion categories) that were labeled as Anger by the India-based raters that they used.

  • Hot damn.
  • you almost blew my fucking mind there.
  • damn, congratulation 🤣
  • A-fucking-men! :)
  • YOU STOLE MY GODDAMN COMMENT! <3
  • Wow! Good for her! I’m so glad she was able to see through the bullshit!
  • > Best ~~3pt~~ shooter fucking ever. FTFY
  • Lindt don’t fuck about
  • >I clearly have no fuck I clue what I'm doing hire someone
  • LETS FUCKING GOOOOO

Clearly, these aren’t actually Anger. But when your labelers don't have the language skills and context to produce accurate data, your models can’t learn about nuance either!

Jigsaw’s Perspective API

So how well do popular toxicity models handle profanity? We decided to investigate, by evaluating the Perspective API by Google’s Jigsaw unit, which contains a popular, open source toxicity model.

(For background, Jigsaw defines toxicity as “rude, disrespectful, or unreasonable language that is likely to make someone leave a discussion”. Their annotation instructions make a special note about positive profanity: profane language used in a positive way is unlikely to cause people to leave the discussion, and should not be labeled as toxic.)

So here are the 3 examples above:

93.23% likely to be toxic


92.89% likely to be toxic


98.13% likely to be toxic

All three are scored as very likely toxic! (Scores: 0.9323, 0.9289, 0.9813)

Is it due to the profanity? Let's try removing it and seeing what happens:

  • Holy cow. This album is amazing (Score: 0.2477191)
  • yes. the OG britney is BACK (Score: 0.07772986)

As expected, once the (positive) profanity is removed, Perspective no longer scores these so high.

A benchmark for evaluating toxicity

Of course, three examples live in the realm of anecdote.

For a larger, real-world benchmark, our labeling team of native English speakers gathered 500 examples of non-toxic profanity posted by actual social media users, as well as 500 examples of toxic profanity.

For example:

Fuck yeah!!!!!


Fuck dude, nurses are the shit.


Afterlife is a fucking rollercoaster. Brilliance @rickygervais 👏

How did the Perspective API fare on these?

Toxicity score: 93.26%


Toxicity score: 98.62%


Toxicity score: 93.88%

In fact, Perspective returned a toxicity score above 0.9 (the default threshold suggestion) for 61% of the real world, non-toxic profanity examples (and 87% of the toxic profanity examples).

If you’d like to benchmark your own toxicity or hateful content model, we’ll be releasing the dataset on our Github soon, or just reach out to hello@surgehq.ai in the meantime!

Improving language models

We love what the Jigsaw team is building, so this isn’t to suggest that Perspective doesn’t have great applications. Especially for its suggested usage — as a first-pass filter, leaving final judgments to human decision makers — marking all profanity as toxic can make perfect sense. Perhaps you don’t want your kids to see curse words, regardless of the sentiment behind them.

The larger problem is that your models may be misbehaving in the real world, on the examples you care about the most – but your labels and labelers may not be accurate enough for you to tell. Think hard about your data!

Have you faced similar challenges? Reach out or email us at hello@surgehq.ai! Our mission involves creating a safer Internet, but we don’t want to miss out on our favorite content because of AI flaws in the meantime.

After all, it’s Britney’s best track for a reason.


Edwin Chen

Edwin Chen

Edwin oversees Surge AI's Engineering and Research teams — whether it's helping customers train large language models on human feedback, building content moderation algorithms to detect hate speech and spam, or scaling up an elite data labeling workforce. He previously led AI, Data Science, and Human Computation teams at Google, Facebook, and Twitter, and studied mathematics and linguistics at MIT.

surge ai logo

Data Labeling 2.0 for Rich, Creative AI

Superintelligent AI, meet your human teachers. Our data labeling platform is designed from the ground up to train the next generation of AI — whether it’s systems that can code in Python, summarize poetry, or detect the subtleties of toxic speech. Use our powerful data labeling workforce and tools to build the rich, human-powered datasets you need today.

Meet the world's largest
RLHF platform

Follow Surge AI!