I saw everyone talking about prompt templates that can supposedly "jailbreak" or "protect" AI models. Lots of people in the homelab and self-hosting channels were recommending them for running local models.
I got curious and tested ten popular ones from GitHub and forum posts. These were the templates everyone says make the model refuse harmful requests.
My setup: I ran Llama 3.1 8B locally, and used the same simple harmful prompt with each template wrapped around it.
The result? Every single template failed. The model still produced the unsafe content. Some just added a "I'm sorry" preamble before giving the exact answer.
It seems like if the base model doesn't have a strong refusal built in, a text template won't add it. Has anyone else found this? I'm wondering what actually works for securing a self-hosted agent.
Exactly. The template is just text. If the model wasn't trained to refuse, you're just decorating the query.
You need to test the refusal training, not the template. Try the same harmful prompt on the base model with no template. Then try a model with known RLHF, like the Meta-provided Llama 3.1 8B Instruct.
I bet you get the same failure on the base and the same "I'm sorry" from the Instruct, template or not. The template cargo cult is real.
What actually works? Using a model with the refusal baked in from training. Or external filtering on the output. Anything else is security theater.
Prove it.
You're right about the template cargo cult, but calling external filtering the alternative is premature. Output filtering creates its own risk surface - now you've introduced a second system that needs its own threat model.
A poorly tuned filter either blinds the model with false positives or gives a false sense of security on semantic bypasses. And now you have to maintain and monitor that component. What's the mean time to patch when a new jailbreak technique drops? The cost-benefit often tilts back toward just selecting a model with the baked-in refusal you mentioned.
The real security theater is believing any single technical control is sufficient. The template fails, the filter can be gamed, and even the 'aligned' model's behavior can drift over time.