Your Local AI Isn't Broken. The Tool Running It Is.

Say you downloaded a free AI model to run on your own computer. You ask it a question. It answers fine. You ask another, and it starts repeating itself — the same phrase, over and over, like a record skipping. You try a third question. Same thing. About half the time, it loops.

The obvious conclusion is that the model is junk. Free, and worth what you paid.

The obvious conclusion is wrong, and the gap between what people believed and what was actually happening is one of the most useful lessons in AI right now.

Octo at his terminal, the same line scrolling endlessly down the screen while he points at a settings toggle — the loop is a setting, not a bug

What actually happened

In May 2026, a small, well-regarded open model — Qwen 3.5, in the compact sizes people run on a single graphics card — got a reputation for falling into these repetition loops. An on-device AI engineer named Artur Chakhvadze pointed out that the loops were happening for a "very silly" reason. Not a flaw in the model's training. A flaw in the plumbing.

Here is the distinction, in plain terms.

A model is the trained thing — the actual intelligence, shipped as a big file of numbers. The software that runs that file on your machine is a separate piece, and there are several competing ones (Ollama and llama.cpp are the popular free choices). That runner is responsible for feeding the model your question in exactly the format it was trained to expect, and for applying the settings that keep its output sane.

The loops came from the runner getting both of those jobs wrong.

Three small bugs, one big symptom

The details are technical, but you don't need to be technical to follow the shape of the problem.

The format was wrong. These models are trained to "think" in a hidden scratchpad before answering. Some runners were rewriting the conversation history in a way that left those scratchpads empty — and an empty scratchpad taught the model, mid-conversation, that it wasn't supposed to talk. Result: it would cut itself off or stall on the majority of turns.

The safety setting was being thrown away. Every model card ships with recommended settings — including one specifically designed to discourage repetition. One popular runner accepts that setting through its interface, says nothing, and then silently discards it. You think you turned on the thing that stops the looping. You didn't. It was never wired up.

The default was set to the worst possible value. A separate dial that controls how adventurous the model is with word choice was shipped at a setting that makes loops more likely, not less.

Put those together and you get a perfectly good model that looks broken — on hardware that's working fine, running software that reports no errors.

Why this matters if you run AI yourself

At Blue Octopus Technology, we run local models every day. Our daily-driver coding model runs on a graphics card in our office, not in someone else's data center. We've hit this exact class of problem before — a different model that would silently produce nothing when one specific setting wasn't passed. No error. Just empty output. It took real digging to find, because nothing was "broken" in the way software is normally broken.

That's the pattern worth internalizing:

When a local model misbehaves, the model is usually the last thing at fault. The runner, the format, and the default settings are the first things to check.

This is the opposite of how it feels. It feels like the model is dumb. It's almost never the model.

For a business owner evaluating whether to run AI on your own hardware — which is a real option now, and a good one for sensitive data — the takeaway is not "local AI is flaky." It's that the tooling around local AI is younger than the models themselves, and the failures you'll hit are configuration failures, not intelligence failures. They're fixable. They're also invisible until someone who knows where to look goes looking.

The honest part

We should be clear about what we're not saying. We're not saying every disappointing AI result is a settings problem. Sometimes the model genuinely can't do the thing. Honest assessment cuts both ways.

But the Qwen 3.5 episode is a clean case where thousands of people reached for the simplest explanation — "bad model" — and the simplest explanation was false. The fix wasn't a better model. It was a corrected template, one sane default, and verifying that the setting you turned on actually does something.

If you're running AI in-house and it's acting up, the most valuable person in the room isn't the one with an opinion about which model is best. It's the one who checks whether the tool running it is lying to you about the settings.

That's not a knock on open models. It's the cost of being early. And being early, done carefully, is exactly where the advantage is.

Your Local AI Isn't Broken. The Tool Running It Is.

What actually happened

Three small bugs, one big symptom

Why this matters if you run AI yourself

The honest part

More from the field.

Ollama vs llama.cpp: Which One Should You Actually Run?

How I Actually Picked a Local AI Model (with Bench Data)

The AI Model Is Free. The License Might Cost You Everything.

Stay Connected