Unsexy AI Failures: Still Confidently Hallucinating Image Text

Table of contents

In our Unsexy AI Failures series, we’re highlighting the real-world examples where frontier models still struggle. These examples might not make headlines, but they represent the day-to-day frustrations of real users.

Not so long ago, you could always spot an AI generated image if it contained one of two things: human hands, or text. Fast forward to today, and AI generated images are nearly indistinguishable from photographs. These issues have become less obvious, but by no means disappeared.

The issue isn’t that models have limits. It’s that they don’t know their limits. Company press releases carefully spell out what a new release can and can’t do — but 95% of users will never read them. Instead, they’ll do the obvious thing: ask the model itself. And that’s where things often go terribly wrong.

This week’s example is a seemingly simple request to tweak a business logo. Each time, it confidently asserts its ability to make this tweak, even when asked point-blank. Luckily, it’s obvious to the user that the model is failing, and the textual limitations of image generation models are well-known, but what if this were a higher-stakes, more subtle task in law, medicine, or finance?

This Week’s Failure: Failing to tweak the text in a logo

Meet Kimberly

Meet Kimberly, a CPA and business owner with an MBA from Wharton, who uses ChatGPT on a daily basis:

I use ChatGPT every day. I use it to help draft contracts, streamline a lot of my accounting workflows, brainstorm new business ideas, and plan my goals for the week.

Her Seemingly Simple Goal

I was trying to create a logo for my business. I don't have any design skills of my own, so I was hoping ChatGPT could finalize my logo, to save me the time and money of having to find a designer.

But what seemed like a simple change turned into a marathon of confident hallucinations.

Here’s Kimberly describing her attempt:

What Went Wrong: A Step-by-Step Breakdown

Round 1: TA (Try Again)

I’d already generated a logo for my business that was almost perfect. I thought it would be easy to make a small change: emphasize the letters “ABILITY” within the word “STABILITY”.

Instead, ChatGPT randomly highlighted the letters “TA”. Not what I asked, but at least it shows that it can emphasize specific letters!

Round 2: Fix One Thing, Break Another

I pointed out the mistake from the first try, and gave it another shot. I don’t even know whether this is better or worse than the first attempt… but it’s definitely not what I wanted.

Round 3: Try Again (again)

I tried to be patient, because it seemed so close! First, it picked the wrong letters but emphasized them correctly. Then, it picked the right letters but deleted the others.

I just needed it to put those two steps together.

But when I pointed that out, it went back to emphasizing the letters “TA” again. I don’t know why it liked those two letters so much…

Round 4: Empty Promises

By then, it seemed like it just wasn’t going to work. I decided to simply ask whether it could do it, so I’d know for sure and, if not, I could move on.

It confidently said yes, and even described exactly what I wanted, which really got my hopes up…

But then what came back was easily the worst so far. Basically the only thing it didn’t change was the word “STABILITY”.

Round 5: Giving Up

After five attempts, I gave up. I don’t understand why it would pretend to be able to do something that it obviously couldn’t do.

What seemed like an easy change turned out to be a complete failure. What was most frustrating was the disconnect between what it was telling me it could do, and the actual results.

Gemini: Nano-Banana…nah

Kimberly then passed her prompt to Gemini, fresh off the hype around its “nano-banana” image-editing model. Surely a model touted for image modification could handle this task.

Round 1: Fast… But wrong

On the plus side, Gemini was a lot faster. ChatGPT took almost two minutes for each image. Gemini took only a few seconds to make its edits.

The catch? There were no edits… it just regenerated the original image.

Round 2: Clear Instructions, Messy Results

The tip given in Round 1 advised: “The more specific you are, the better Gemini can create images.” So I spelled out exactly what I wanted, in even more detail.

Like ChatGPT, the text described one thing, while the result showed something completely different.

Round 3: Understanding ≠ Doing

I prompted Gemini to check whether it actually understood the image. It could describe everything in precise detail, but wasn’t able to reflect that in the actual image.

Round 4: Questionable Explanations

Gemini offered more details on why its image generation tool wasn’t working.

Round 5: Maddeningly Close

Since changing the colors seemed to be a problem, I tried a different approach. The result was maddeningly close, but still wrong.

Round 6: Going in Circles

I eventually gave up.

Importantly: just like ChatGPT, there was no hint of trepidation. The model was completely confident in its abilities, even when failing repeatedly.

The Problem: Confidence Without Capability

What made this particularly frustrating wasn't just the failure – it was the disconnect between ChatGPT's confident assertions and its actual performance. Each time, the model would confidently describe exactly what Kimberly wanted, then deliver something completely different. There was never any hint of uncertainty, even a “I’ll give it a go but I’m not great at fixing text”. Have you ever heard a model say such a thing?

Models are getting more and more powerful, and starting to be trusted in high-stakes domains like finance and medicine. Yet they still lack one essential quality: humility.

Frontier labs know their models have weaknesses, yet still let them claim they can do anything. Training/system prompting for humility might make a model seem less “helpful” on the surface — but without it, they’ll never be trustworthy.

Most users aren’t tracking AI blogs and model release notes. They expect, reasonably, that if a model says it can do something, it actually can.

Until frontier systems meet that bar, every new feature or tool also creates more chances for the kind of “unsexy” failures that frustrate real users – and, in higher stakes domains, potentially cause real-world harm.

After all, if an expert AI model known for its ability to solve IMO gold medals repeatedly asserted it was correct, would you believe it?

‍