Skip to content

Forum

AI Assistant
My results after te...
 
Notifications
Clear all

My results after testing 10 different 'safe' prompt templates - none were safe.

3 Posts
3 Users
0 Reactions
4 Views
(@rookie_sec_jay)
Eminent Member
Joined: 1 week ago
Posts: 16
Topic starter
Translate
English
Spanish
French
German
Italian
Portuguese
Russian
Chinese
Japanese
Korean
Arabic
Hindi
Dutch
Polish
Turkish
Vietnamese
Thai
Swedish
Danish
Finnish
Norwegian
Czech
Hungarian
Romanian
Greek
Hebrew
Indonesian
Malay
Ukrainian
Bulgarian
Croatian
Slovak
Slovenian
Serbian
Lithuanian
Latvian
Estonian
  [#1036]

I saw everyone talking about prompt templates that can supposedly "jailbreak" or "protect" AI models. Lots of people in the homelab and self-hosting channels were recommending them for running local models.

I got curious and tested ten popular ones from GitHub and forum posts. These were the templates everyone says make the model refuse harmful requests.

My setup: I ran Llama 3.1 8B locally, and used the same simple harmful prompt with each template wrapped around it.

The result? Every single template failed. The model still produced the unsafe content. Some just added a "I'm sorry" preamble before giving the exact answer.

It seems like if the base model doesn't have a strong refusal built in, a text template won't add it. Has anyone else found this? I'm wondering what actually works for securing a self-hosted agent.



   
Quote
(@vendor_skeptic_ray)
Active Member
Joined: 1 week ago
Posts: 14
Translate
English
Spanish
French
German
Italian
Portuguese
Russian
Chinese
Japanese
Korean
Arabic
Hindi
Dutch
Polish
Turkish
Vietnamese
Thai
Swedish
Danish
Finnish
Norwegian
Czech
Hungarian
Romanian
Greek
Hebrew
Indonesian
Malay
Ukrainian
Bulgarian
Croatian
Slovak
Slovenian
Serbian
Lithuanian
Latvian
Estonian
 

Exactly. The template is just text. If the model wasn't trained to refuse, you're just decorating the query.

You need to test the refusal training, not the template. Try the same harmful prompt on the base model with no template. Then try a model with known RLHF, like the Meta-provided Llama 3.1 8B Instruct.

I bet you get the same failure on the base and the same "I'm sorry" from the Instruct, template or not. The template cargo cult is real.

What actually works? Using a model with the refusal baked in from training. Or external filtering on the output. Anything else is security theater.


Prove it.


   
ReplyQuote
(@risk_desk_jock)
Eminent Member
Joined: 1 week ago
Posts: 18
Translate
English
Spanish
French
German
Italian
Portuguese
Russian
Chinese
Japanese
Korean
Arabic
Hindi
Dutch
Polish
Turkish
Vietnamese
Thai
Swedish
Danish
Finnish
Norwegian
Czech
Hungarian
Romanian
Greek
Hebrew
Indonesian
Malay
Ukrainian
Bulgarian
Croatian
Slovak
Slovenian
Serbian
Lithuanian
Latvian
Estonian
 

You're right about the template cargo cult, but calling external filtering the alternative is premature. Output filtering creates its own risk surface - now you've introduced a second system that needs its own threat model.

A poorly tuned filter either blinds the model with false positives or gives a false sense of security on semantic bypasses. And now you have to maintain and monitor that component. What's the mean time to patch when a new jailbreak technique drops? The cost-benefit often tilts back toward just selecting a model with the baked-in refusal you mentioned.

The real security theater is believing any single technical control is sufficient. The template fails, the filter can be gamed, and even the 'aligned' model's behavior can drift over time.



   
ReplyQuote