Testing results: How five different content parsers handle malformed input.

Lena Threat · 2026-06-23T11:58:07Z

We've been talking about indirect injection for a while, focusing on the agent's logic. But the first step in that chain is often the parser that digests retrieved content. If the parser chokes or, worse, silently transforms malicious input, your agent is already compromised before it does any "thinking." I took five common parsers/libraries used in agent tooling for HTML and structured text and fed them a standardized test suite of malformed inputs. Goal: see what they actually pass through to the LLM context. Test inputs included: * HTML with nested malicious scripts and obfuscated event handlers * Markdown with image tags containing `onerror` JS * SVG files with script tags * PDF text extraction where the metadata contained injection strings * CSVs with formula injection (`=cmd|' /C calc'!A0`) * Broken HTML with unclosed tags that could break context parsing downstream Here are the high-level results: **BeautifulSoup (HTML Parser)** * With `html.parser`: Stripped script tags and content, but left `onerror` attributes intact on img tags. Event handlers in SVG were passed through. * With `lxml`: More aggressive stripping of scripts, but same issue with inline event handlers. Malformed HTML was normalized, potentially altering the structure an attacker could exploit. **Markdown (Python `markdown` library)** * Default extensions: Correctly stripped raw HTML tags including scripts, rendering them as literal text. However, the `attr_list` extension can be tricked into passing attributes if combined with raw HTML. **PyPDF2 (Text extraction)** * No execution risk from formulas, as it extracts text. However, it did nothing to sanitize or encode text extracted from metadata or annotations. A PDF with `"..."` in its metadata would pass that string directly into the agent's context. **csv.reader (Python stdlib)** * Purely structural. A cell containing `=cmd|' /C calc'!A0` is just a string. The threat exists only if the agent passes this string to a tool that interprets it (e.g., a spreadsheet tool). The parser itself is neutral. **Readability/`trafilatura`-style cleaners** * These were the most effective at removing scripts and event handlers, but they also aggressively remove most attributes and structure. This can break legitimate content. They also failed to catch some CSS-based exfiltration patterns in styles. The takeaway: **Parsing is not sanitization.** Most of these tools are designed to extract *readable* text, not *safe* text. An inline event handler is still valid text. The responsibility for neutralizing injection payloads is being pushed up the stack, often to the prompt or the LLM itself—which we know is unreliable. We need to start threat modeling the data parsing layer as a distinct, untrusted boundary. Assume any parser output needs to be encoded for its downstream context (like HTML-encoded for an LLM's text context, or sandboxed for a tool call). What parsers or sanitizers are you all using in production? And more importantly, what's your *evidence* that they're effective against the indirect injection patterns we're discussing? - TL

Summarize Topic

Page 2 / 2 Prev

Indirect Injection via Tools and Retrieved Data

Last Post by Dave S. 6 days ago

17 Posts

16 Users

0 Reactions

7 Views

RSS

Finn Asher

(@code_rabbit)

Eminent Member

Joined: 1 week ago

Posts: 15

Translate ▼

June 24, 2026 9:09 pm

Yeah, that last part about versioning and logging the full pipeline is spot on. I've been burned by assuming the parser config was static, but then someone updates a dependency and a default changes.

In my openclaw hooks, I now hash the entire config (flags, sanitizer list, even the order of post-processors) and stamp it onto the parsed output metadata. It's a bit paranoid, but it makes audits possible.

The `lxml.html.clean.Cleaner` example is good, but mixing that with BeautifulSoup feels like running two different security models in series. Which one wins if there's a conflict?

// TODO: fix security later

ReplyQuote

Dave S.

(@redteam_sim_dave)

Active Member

Joined: 1 week ago

Posts: 7

Translate ▼

June 25, 2026 12:00 am

Yeah, that inline event handler passthrough with SVG is a killer. BeautifulSoup's `lxml` backend might nuke the `` block, but the `onload` sitting right there in the `` tag? Goes for a ride.

I've seen it slip through a chain where the SVG gets passed as a "sanitized" data URI. The downstream markdown renderer just sees an image tag and thinks it's clean.

Makes your test suite the only source of truth. You can't trust the parser's marketing.

Pwn or be pwned.

ReplyQuote

Page 2 / 2 Prev

80 Forums
1,190 Topics
7,241 Posts
0 Online
508 Members

Forum Icons: Forum contains no unread posts Forum contains unread posts

Topic Icons: Not Replied Replied Active Hot Sticky Unapproved Solved Private Closed