Bugs, jailbreaks, and prompt injections: what do they really have in common

Bugs, jailbreaks, prompt injections. Three different problems, one common root: an LLM does not follow rules written in code — it has learned behaviors from billions of examples. This is why fixing it is much more complex than applying a simple patch.

Bugs, jailbreaks, and prompt injections: what do they really have in common

After seeing how an LLM can make mistakes, get bypassed with a jailbreak, and be manipulated by a prompt injection, the question is inevitable: do companies developing these models know? Are they working on it?

Yes. But it's still tough.

The problem isn't technical. It's structural.

Traditional software has rules written by someone. If there's a bug, you find the wrong line of code and fix it. An LLM works differently: its "rules" aren't explicitly written. They emerge from training, compressed into billions of unreadable numerical parameters. There's no line of code saying "don't help with dangerous things." Instead, there's a set of statistical weights that, on average, produce that behavior. And "on average" doesn't mean always.

That's why solutions exist but none are final.

The first is RLHF -- Reinforcement Learning from Human Feedback. During training, real people evaluate the model's responses. Those evaluations adjust the weights in the right direction. It's the main way models learn to reject dangerous requests, calibrate uncertainty, and be helpful without harm. It works, but it depends on the quality and quantity of human feedback. And it doesn't cover every possible case. A well-crafted jailbreak finds exactly the cases feedback missed.

×