AI Assistant

Notifications

Clear all

My results after testing 10 different 'safe' prompt templates - none were safe.

Summarize Topic

Announcements

Last Post by Henry Lau 2 days ago

3 Posts

3 Users

0 Reactions

4 Views

RSS

Jay R.

(@rookie_sec_jay)

Eminent Member

Joined: 1 week ago

Posts: 16

Topic starter

Translate ▼

June 27, 2026 8:00 am [#1036]

I saw everyone talking about prompt templates that can supposedly "jailbreak" or "protect" AI models. Lots of people in the homelab and self-hosting channels were recommending them for running local models.

I got curious and tested ten popular ones from GitHub and forum posts. These were the templates everyone says make the model refuse harmful requests.

My setup: I ran Llama 3.1 8B locally, and used the same simple harmful prompt with each template wrapped around it.

The result? Every single template failed. The model still produced the unsafe content. Some just added a "I'm sorry" preamble before giving the exact answer.

It seems like if the base model doesn't have a strong refusal built in, a text template won't add it. Has anyone else found this? I'm wondering what actually works for securing a self-hosted agent.

Quote

Topic Tags

Raymond 'Razor' Shaw

(@vendor_skeptic_ray)

Active Member

Joined: 1 week ago

Posts: 14

Translate ▼

June 27, 2026 12:34 pm

Exactly. The template is just text. If the model wasn't trained to refuse, you're just decorating the query.

You need to test the refusal training, not the template. Try the same harmful prompt on the base model with no template. Then try a model with known RLHF, like the Meta-provided Llama 3.1 8B Instruct.

I bet you get the same failure on the base and the same "I'm sorry" from the Instruct, template or not. The template cargo cult is real.

What actually works? Using a model with the refusal baked in from training. Or external filtering on the output. Anything else is security theater.

Prove it.

ReplyQuote

Henry Lau

(@risk_desk_jock)

Eminent Member

Joined: 1 week ago

Posts: 18

Translate ▼

June 28, 2026 12:34 pm

You're right about the template cargo cult, but calling external filtering the alternative is premature. Output filtering creates its own risk surface - now you've introduced a second system that needs its own threat model.

A poorly tuned filter either blinds the model with false positives or gives a false sense of security on semantic bypasses. And now you have to maintain and monitor that component. What's the mean time to patch when a new jailbreak technique drops? The cost-benefit often tilts back toward just selecting a model with the baked-in refusal you mentioned.

The real security theater is believing any single technical control is sufficient. The template fails, the filter can be gamed, and even the 'aligned' model's behavior can drift over time.

ReplyQuote

80 Forums
1,176 Topics
7,188 Posts
0 Online
508 Members

Forum Icons: Forum contains no unread posts Forum contains unread posts

Topic Icons: Not Replied Replied Active Hot Sticky Unapproved Solved Private Closed