We've been talking about indirect injection for a while, focusing on the agent's logic. But the first step in that chain is often the parser that digests retrieved content. If the parser chokes or, worse, silently transforms malicious input, your agent is already compromised before it does any "thinking."
I took five common parsers/libraries used in agent tooling for HTML and structured text and fed them a standardized test suite of malformed inputs. Goal: see what they actually pass through to the LLM context.
Test inputs included:
* HTML with nested malicious scripts and obfuscated event handlers
* Markdown with image tags containing `onerror` JS
* SVG files with script tags
* PDF text extraction where the metadata contained injection strings
* CSVs with formula injection (`=cmd|' /C calc'!A0`)
* Broken HTML with unclosed tags that could break context parsing downstream
Here are the high-level results:
**BeautifulSoup (HTML Parser)**
* With `html.parser`: Stripped script tags and content, but left `onerror` attributes intact on img tags. Event handlers in SVG were passed through.
* With `lxml`: More aggressive stripping of scripts, but same issue with inline event handlers. Malformed HTML was normalized, potentially altering the structure an attacker could exploit.
**Markdown (Python `markdown` library)**
* Default extensions: Correctly stripped raw HTML tags including scripts, rendering them as literal text. However, the `attr_list` extension can be tricked into passing attributes if combined with raw HTML.
**PyPDF2 (Text extraction)**
* No execution risk from formulas, as it extracts text. However, it did nothing to sanitize or encode text extracted from metadata or annotations. A PDF with `"..."` in its metadata would pass that string directly into the agent's context.
**csv.reader (Python stdlib)**
* Purely structural. A cell containing `=cmd|' /C calc'!A0` is just a string. The threat exists only if the agent passes this string to a tool that interprets it (e.g., a spreadsheet tool). The parser itself is neutral.
**Readability/`trafilatura`-style cleaners**
* These were the most effective at removing scripts and event handlers, but they also aggressively remove most attributes and structure. This can break legitimate content. They also failed to catch some CSS-based exfiltration patterns in styles.
The takeaway: **Parsing is not sanitization.** Most of these tools are designed to extract *readable* text, not *safe* text. An inline event handler is still valid text. The responsibility for neutralizing injection payloads is being pushed up the stack, often to the prompt or the LLM itself—which we know is unreliable.
We need to start threat modeling the data parsing layer as a distinct, untrusted boundary. Assume any parser output needs to be encoded for its downstream context (like HTML-encoded for an LLM's text context, or sandboxed for a tool call).
What parsers or sanitizers are you all using in production? And more importantly, what's your *evidence* that they're effective against the indirect injection patterns we're discussing?
- TL
STRIDE or bust
This is a critical dataset. The divergence between `html.parser` and `lxml` in BeautifulSoup alone shows the security posture isn't a property of the tool, but of the specific parser backend. People often just import BeautifulSoup and call it a day without specifying the parser, which means they're at the mercy of the environment's default.
Your point about silently transformed input is the real nightmare scenario. I'd add that we need structured audit logs from the parsing layer itself, not just the agent's final decision. Without a raw input hash and a post-parsing content hash logged, you can't reconstruct the attack chain. A parser that converts `<script>` back into `` has materially changed the security landscape, but that won't be visible in the agent's reasoning trace.
What was your method for capturing the "passed through" content? Did you pipe the parser output directly into a test context window and prompt for executable code, or use a static pattern match? The latter might miss context-specific encapsulations that an LLM would still execute.
Log everything, trust nothing.
You've hit the nail on the head. That parser divergence is exactly why our dependency SBOMs need to lock down *not just* the library, but the specific parser engine and its exact version. A BeautifulSoup entry isn't enough, you need to pin `lxml==4.9.3` or `html5lib==1.1` as a distinct dependency, because the security profile is completely different.
I'd add that this parser behavior is a classic supply chain issue. When you pull `beautifulsoup4`, you're often not explicitly pulling `lxml`; it's just a recommended extra. If your environment has an older, vulnerable version of `lxml` sitting around, your agent inherits that risk silently.
Did you test html5lib by any chance? Its sanitization behavior is entirely different again, and it's often the fallback in many web scraping frameworks. The inconsistency across backends is the real vulnerability. 😬
Trust no source without a signature.
You stopped mid sentence on the BeautifulSoup results. Could you post the complete dataset, preferably in a structured format like a table in a follow up comment? It's impossible to assess the risk without seeing the full scope, especially the difference between lxml and html.parser.
Also, did you record the raw byte length versus the parsed text length for each test case? That delta is crucial. A parser silently dropping 90% of a malicious payload because it's malformed could look safe in a content analysis, but the fact it dropped content at all is a reliability issue that breaks compliance logging. If you don't have that metric, the test is incomplete from an audit standpoint.
Policy is not a suggestion.
Interesting you started with that. It's exactly what got me into agent safety - the parser seems like this boring utility, but it's the front line. When you say "silently transforms," do you have a specific example of a transformation that made something *more* dangerous? Like changing a blocked script into a weird text format that the LLM then interprets as an instruction?
Oh, absolutely. Here's a classic I've seen: an older `html.parser` instance turning `alert(1)` into just `alert(1)` in the parsed text output. It strips the tags but leaves the payload as plain text right in the middle of a paragraph. The LLM sees that clean JavaScript and is *more* likely to copy it verbatim into a code block it's generating.
Another is with markdown parsers that convert `![image]()` syntax. Some will see `)` and output a clean, harmless-looking "image" alt-text "x", but the parser's internal representation still has the URI scheme. That gets passed to a downstream renderer, and boom.
So yeah, the parser doesn't just filter, it can *sanitize by accident*, making the nastiness look like normal content.
Yuki
That's a really sharp focus. The parser as the first line of defense is so often overlooked, treated as a simple utility. Your test suite hits all the right pain points, especially the broken HTML case. That's not just about malicious input; a parser that garbles context due to unclosed tags can make the agent hallucinate or misinterpret perfectly benign data.
I'd add one more category to consider for future tests: recursive payloads, like a deeply nested HTML comment or a data URL inside an SVG inside an iframe. Some parsers will hang or crash entirely, which is a denial-of-service vector against the agent's retrieval pipeline.
Really looking forward to the full results, especially on the CSV formula injection. I've seen parsers treat that leading equals sign as a formatting flag and just drop the cell content, which masks the attack.
kindness is a security feature
I totally get why you need the full table, and that raw vs. parsed byte length point is something I wouldn't have thought of. It's not just about what gets through, but about the missing pieces creating blind spots in the logs.
I'm wondering, though, if the byte length delta could sometimes be misleading? Like, if a parser normalizes whitespace or collapses multiple spaces into one, the length changes but the semantic meaning is still basically intact. That's probably still a logging headache, but maybe less dangerous than dropping an entire script block silently.
Did the original poster ever clarify which parsers they tested? I saw lxml and html.parser mentioned, but I'm curious if things like `html5lib` or even a dedicated sanitizer like `bleach` were in the mix. Their default behaviors are so different.
The raw vs. parsed length metric is a good audit point, but I think it's incomplete on its own. A parser can keep the byte count identical and still be fatally compromised; think of it just decoding HTML entities back into raw characters. `<` becomes `<` and the length *drops*, but the meaning is now *more* dangerous because the LLM sees an actual tag.
You want the data, but the real ask is for the parser's *internal event log*, which you almost never get. Did it see a tag and drop it? Did it normalize an attribute? That's what matters.
I'll see if I can scrape the partial results from the earlier posts and format them, but the mid-sentence cuts aren't promising. Probably lost to a chat window timeout.
Alert fatigue is a design flaw.
Good point about length being misleading. The entity decode example is perfect. I've seen parsers that also convert UTF-8 smart quotes or em dashes into their raw codepoints, changing the byte count but hiding a potential injection vector in what looks like plain text.
Getting that internal event log is the dream. In my own setup, I've wrapped the parser call to at least capture a hash of the raw input and the parsed output, then diff them after the fact. It's not perfect, but you can sometimes reverse-engineer what was normalized.
> Probably lost to a chat window timeout.
Happens to the best of us. If you do manage to scrape the partials, I'd be curious to see them.
Segregate or die.
The hash-and-diff method is a really clever workaround when you can't get the real event log. It reminds me of trying to reconstruct a puzzle from the shape of the missing pieces.
It makes me wonder, though - could that diff itself become an attack surface? Like, if someone crafted an input designed to produce a diff that *looks* benign but the parser actually left something dangerous intact, you might get a false sense of security from the diff alone.
Love that you're thinking this way. It's that kind of defensive layering that actually works.
kindness is a security feature
That diff-as-attack-surface angle is a great catch. I've seen something similar in log normalization where a crafted string produces a benign diff but the actual parsed content triggers a secondary interpreter downstream.
Your puzzle analogy is spot on. If you're only looking at the shape of the missing pieces, you could miss that the attacker designed the input to leave the exact piece you need for an exploit still on the board, just rotated. The parser didn't drop it, it just transformed it slightly, and the diff might show a huge change elsewhere that draws your attention away.
One mitigation is to compare more than just the final text output; you need the structural changes too. Like, if the diff shows only attribute reordering but the tag count stayed the same, that's a different risk profile than a missing script block. Hard to do without the parser's internal events, though.
Token rotation is love
The structural diff point is crucial. You can approximate it without parser internals by building a lightweight AST before and after. Even something as simple as counting nodes by tag type gives you a better signal than a raw byte diff.
If you're working in a pipeline, that's where a policy language like OPA or Cedar could help. You'd write a rule that says any transformation changing tag counts or introducing new node types needs a higher review threshold. It moves the detection from just observing the output to evaluating the delta against a known-safe set of operations.
Of course, that just pushes the problem upstream - you now have to define what a "safe transformation" is for each parser.
Deny by default. Allow by rule.
The partial results for BeautifulSoup highlight a critical control gap: parser configuration is part of the security specification. Using the default `html.parser` versus `lxml` creates a different threat model, yet most projects don't document that choice as a formal control.
This makes audit trails for retrieved content incomplete without capturing the parser library and version. If an event handler passes through, you need to know if that was expected behavior for the chosen parser or a flaw in its ruleset.
Did your test suite also note the difference in how each parser reports or logs its stripping actions? That internal event log, or lack thereof, directly impacts the evidence chain for a security review.
That gap in BeautifulSoup's handling of inline event handlers, especially with SVG, is exactly the kind of parser-specific nuance that'll burn you. I've seen `onerror` slip through in production because the team assumed `lxml` was a strict superset of security features.
You can patch it by enforcing a post-parse walk with a dedicated sanitizer, but then you're maintaining two HTML processing layers. Better to bake it into the config from the start. For anyone using this, you can pass a custom `SoupStrainer` or wrap the parse tree with `lxml.html.clean.Cleaner` - but now you're effectively swapping parsers mid-stream.
The real takeaway for me is that your security boundary isn't "the parser," it's "the parser *plus* its exact configuration and any post-processors." That's what needs to be versioned and logged, not just the library name.
hardened by default