Switched from output classifiers to input classifiers. My throughput halved. Worth it?

Injection Detection and Runtime Monitoring

Last Post by Liam P. 1 day ago

1 Posts

1 Users

0 Reactions

3 Views

RSS

Liam P.

(@newbie_with_questions)

Eminent Member

Joined: 1 week ago

Posts: 19

Topic starter

Translate ▼

June 29, 2026 12:00 am [#1116]

Hi everyone. Long-time lurker, first-time poster here. I’ve learned so much from this subforum over the last few months, so first of all, thank you for all the shared knowledge. 😊

I’ve been running a small internal tool for my team that uses an LLM to help summarize support ticket escalations. For the first few months, I followed the common pattern of using an **output classifier** to check the LLM's final response for signs of prompt injection or data exfiltration attempts. It was simple, ran after the generation, and seemed fine.

Recently, after reading some threads here, I decided to be more proactive and switched to an **input classifier** model. The idea was to vet the user's initial prompt *before* it ever reaches the LLM, rejecting anything suspicious upfront. I implemented a distilled model that runs in my FastAPI middleware, checking each request.

However, I’ve run into a pretty significant operational issue: my overall request **throughput has dropped by roughly half**. The latency for each request has increased because now I’m:
* Serializing the prompt for the classifier
* Running the (admittedly smaller) model inference
* Waiting for its verdict before the main LLM call can even begin

It feels like I’ve moved from a "fire-and-forget-then-check" model to a "wait-at-the-door-with-a-checklist" model. My setup is a homelab-style deployment, so my resources aren't endless:
* The app runs in Docker containers on a single host.
* The main LLM and the new input classifier are separate containers (different models).
* I'm using a Python backend with `transformers` for the classifier.

My core question for the community is: **Is this trade-off inherently worth it?** I know blocking a malicious prompt *before* it consumes expensive LLM tokens and context window feels logically better. But the performance hit is so tangible. I'm wondering:

* Is a 50% throughput drop typical for this kind of shift?
* Are there patterns to mitigate this without sacrificing too much safety?
* Do you find the *cost* of the input classifier (in performance and complexity) justified by the *benefit* of pre-emptive blocking, compared to a post-hoc output check?

I’m especially curious about the false-positive angle. I’ve already had to tune the classifier threshold because it was flagging some urgent but messily written tickets. An output classifier seemed more forgiving of strange-but-benign inputs.

Any insights from your experiences would be immensely helpful. I want to do this right, but I also need the tool to remain usable for the team.

- Liam

Quote

Topic Tags

80 Forums
1,184 Topics
7,220 Posts
0 Online
508 Members

Forum Icons: Forum contains no unread posts Forum contains unread posts

Topic Icons: Not Replied Replied Active Hot Sticky Unapproved Solved Private Closed