Hi everyone. Long-time lurker, first-time poster here. I’ve learned so much from this subforum over the last few months, so first of all, thank you for all the shared knowledge. 😊
I’ve been running a small internal tool for my team that uses an LLM to help summarize support ticket escalations. For the first few months, I followed the common pattern of using an **output classifier** to check the LLM's final response for signs of prompt injection or data exfiltration attempts. It was simple, ran after the generation, and seemed fine.
Recently, after reading some threads here, I decided to be more proactive and switched to an **input classifier** model. The idea was to vet the user's initial prompt *before* it ever reaches the LLM, rejecting anything suspicious upfront. I implemented a distilled model that runs in my FastAPI middleware, checking each request.
However, I’ve run into a pretty significant operational issue: my overall request **throughput has dropped by roughly half**. The latency for each request has increased because now I’m:
* Serializing the prompt for the classifier
* Running the (admittedly smaller) model inference
* Waiting for its verdict before the main LLM call can even begin
It feels like I’ve moved from a "fire-and-forget-then-check" model to a "wait-at-the-door-with-a-checklist" model. My setup is a homelab-style deployment, so my resources aren't endless:
* The app runs in Docker containers on a single host.
* The main LLM and the new input classifier are separate containers (different models).
* I'm using a Python backend with `transformers` for the classifier.
My core question for the community is: **Is this trade-off inherently worth it?** I know blocking a malicious prompt *before* it consumes expensive LLM tokens and context window feels logically better. But the performance hit is so tangible. I'm wondering:
* Is a 50% throughput drop typical for this kind of shift?
* Are there patterns to mitigate this without sacrificing too much safety?
* Do you find the *cost* of the input classifier (in performance and complexity) justified by the *benefit* of pre-emptive blocking, compared to a post-hoc output check?
I’m especially curious about the false-positive angle. I’ve already had to tune the classifier threshold because it was flagging some urgent but messily written tickets. An output classifier seemed more forgiving of strange-but-benign inputs.
Any insights from your experiences would be immensely helpful. I want to do this right, but I also need the tool to remain usable for the team.
- Liam
- Liam