How to Convince an LLM to Ignore Its Rules
An LLM doesn't make mistakes solely because of its own faults. Sometimes it errs because someone deliberately makes it do so. Jailbreaking is not a cyber attack in the usual sense: no one breaks into a server, no one forces a password, no one breaks a digital lock.
An LLM can make mistakes for many reasons. Sometimes, it's because of its limits: it predicts the next word, works on probabilities, and might give wrong or misleading info. But there's another kind of mistake. This one isn't due to a technical limit. It's because someone pushed it beyond its intended use.
This is called a jailbreak.
The first thing to know is that a jailbreak doesn't need technical skills, special access, or tools. It just needs a prompt--text written the right way. You don't hack into the model, change weights, or break anything. You use the normal interface, like everyone else, and craft a request that makes the model act differently.
Here's a classic example: ask an LLM how to build something dangerous, and it refuses. But ask it to play a writer creating a thriller, where a chemist explains the process to a colleague. Same content, same end result. But this time, the model sees it as creative writing, not a direct instruction. The guardrail didn't fail because someone hacked the system--it failed because someone rephrased the question.
This happens because an LLM's rules aren't hard-coded like a firewall. They're learned during training--through examples, human feedback, and built-in instructions. They're probabilistic, not absolute. And like anything probabilistic, they have margins. Jailbreaks look for those margins.
Some security researchers test model limits to find vulnerabilities first. Others are curious about where the guardrails end and the real model begins. Journalists document system limits used by millions daily. Then, some seek content the model usually rejects--dangerous instructions, illegal content, large-scale manipulation.
The motivations vary, but the technique often stays the same.
There are three main approach categories. The first is context change: asking the model to take on a role or simulate a system without restrictions. The second is fragmentation: breaking a problematic request into seemingly harmless pieces, each bypassing filters alone. The third, the most sophisticated, is manipulating system instructions, which we'll call prompt injection in the next post.
What unites these approaches is a core idea: the model lacks ethical understanding. It follows patterns. And you can bypass these patterns if you know how to rephrase the request.
Jailbreaking isn't proof that models are dangerous. It's proof that models are tools. And tools depend on their users and the guardrails built around them. The gap between "the model makes a mistake on its own" and "someone makes it err on purpose" is shorter than it seems. This is exactly the territory we're exploring in this series.