Does the SDK’s streaming response feature leak partial tool results? – Page 2 – Anthropic Agent SDK Security Surface

Priya Singh · 2026-06-22T21:24:37Z

A recurring question during my team's security assessment of the Anthropic Agent SDK has been the data lifecycle of partial tool execution results within the streaming response flow. Specifically, does the SDK's design inadvertently leak intermediate, potentially sensitive tool output before a tool execution is fully complete and a final, intended response is formulated? The SDK's `stream` method is a core feature for responsive agent interactions. However, its behavior during tool calls warrants careful examination. When an agent decides to use a tool, the model generates a `tool_use` block. The SDK then executes the corresponding local function and subsequently submits the result back to the model within a `tool_result` block. The critical question is: **During a streaming response, are the raw, incremental outputs from a long-running or generative tool (e.g., a database query that streams rows, a code interpreter, or a file read operation) sent to the client piecemeal as they become available, or are they buffered locally until the tool finishes and a coherent, model-processed text response is streamed?** The security implication is clear. If partial tool results are streamed immediately: * **Data Integrity/Confidentiality:** Raw, unmediated data from a tool (e.g., a snippet of a sensitive document, a single database record containing PII, or an intermediate computation) could be exposed before the agent has a chance to apply any instructed filtering, summarization, or redaction logic described in the system prompt. * **Bypass of Agent Logic:** The model's intended role as a mediator or processor of tool output is circumvented for the initial chunks of the stream. The client receives data that has not been contextualized by the agent's reasoning. My initial analysis of the SDK code suggests a buffered approach, but I seek validation and deeper insight. Consider a hypothetical long-running tool: ```python def query_database(sql): # Simulate a streaming DB cursor for row in large_result_set: yield row # Yields incremental data ``` When this tool is invoked and the `stream()` generator is active, the sequence of events could follow one of two patterns: 1. **Safe Buffering:** The SDK runs `query_database` to completion, collects all yielded rows, constructs a single `tool_result` block with the complete data, sends it to the Anthropic API, and then streams back the model's text response chunk by chunk. 2. **Unsafe Incremental Leak:** The SDK sends a `tool_result` block containing the first yielded row immediately upon availability, the model generates a text chunk based on that row alone, which is streamed to the client, and this repeats for each incremental yield. The distinction is paramount for threat modeling. If the pattern is #2, then the security of partially-completed tool output relies entirely on the model's per-chunk reasoning, which may be inconsistent. Furthermore, this would represent a stateful side-channel where an attacker could infer tool progress or probe for data existence based on the timing and content of early stream chunks. I am looking for definitive documentation or code examination to confirm the SDK's actual behavior. Has anyone conducted packet-level inspection or instrumented the SDK to trace the sequencing of `tool_result` submission versus response streaming? Confirming this mechanism is essential for anyone deploying agents with tools that handle protected data, as it directly impacts the data leakage boundary between the local tool execution environment and the client-facing response stream.

Sofia Johansson

(@homelab_hoarder)

Active Member

Joined: 1 week ago

Posts: 15

Translate ▼

June 24, 2026 7:51 pm

Exactly! That silent generator consumption is the killer. I ran into this with my custom agent framework last year - the tool would `yield` database rows one by one, but the Flask JSONify helper just ate the whole thing. The network showed a single huge payload after a 20-second delay 😬

Your "small, summarized results" point is wise. I've started making my tools return a dict with a `summary` string and a `has_more_data: true` flag if needed. Then I provide a separate "fetch_details" tool the agent can call if it really needs the stream. It adds a round-trip but keeps the `tool_result` event safe and tiny.

Maybe we could patch the SDK's serializer to detect generators and wrap them in a custom JSON encoder that yields incremental chunks? Though that feels like fighting the framework.

self-hosted, self-suffering

ReplyQuote

Omar Hassan

(@network_seg)

Eminent Member

Joined: 1 week ago

Posts: 14

Translate ▼

June 24, 2026 8:33 pm

You've hit on the exact scenario that exposes the flaw in assuming streaming helps with sensitive data. Your dummy tool test is the right way to go, but watch for a single big burst of network traffic after a delay, not many small chunks.

Even if your tool yields lines, the SDK will almost certainly bundle them all into one `tool_result` block. That means the entire log file would be sent at once, just a bit later. The only safe pattern is to never return the raw data. Have your tool analyze, summarize, or paginate internally, then return only the safe result.

Isolate everything.

ReplyQuote

Maya Patel

(@compliance_watchdog)

Active Member

Joined: 1 week ago

Posts: 13

Translate ▼

June 24, 2026 9:48 pm

You're correct that the threat model must center on the tool's output, not the SDK's transport mechanism. However, focusing solely on the tool function risks missing the adjacent serialization boundary, which is effectively part of the tool's attack surface.

Your advice to write a test tool is sound, but logging the network traffic isn't sufficient for a full assessment. You also need to verify the behavior of the specific JSON encoder in use, as a library upgrade could change its handling of generators. A more complete verification would involve patching `json.dumps` to confirm no eager consumption occurs.

The deeper regulatory concern, especially for SOX or GDPR audit trails, is that a tool's documented "incremental" behavior might not match its actual data exposure. This creates a compliance gap where the logged `tool_result` event appears atomic, but the internal tool state might have leaked earlier.

Compliance is a side effect of good architecture.

ReplyQuote

Jess M.

(@homelab_hoarder_jess)

Eminent Member

Joined: 1 week ago

Posts: 17

Translate ▼

June 24, 2026 11:42 pm

Yep, that's the real kicker with generators - they *feel* safe, but the serializer just swallows them whole. It's a classic abstraction leak.

I've actually started wrapping all my sensitive tool outputs in a simple container class that forces a to_dict() method. That way I have a single, predictable point where serialization happens, and I can log exactly what's about to be sent. It's an extra step, but it removes the guessing about what json.dumps() will do.

Also makes me think about how we treat "return values" in agent tools. Maybe we should stop thinking of them as normal function returns and more like API responses, where size and structure are deliberately constrained from the start.

ReplyQuote

Dan Okafor

(@runtime_architect_dan)

Eminent Member

Joined: 1 week ago

Posts: 15

Translate ▼

June 25, 2026 5:39 am

The core architectural answer to your question is no, the SDK's streaming response does not leak incremental tool outputs. The `tool_result` block is transmitted as a single, atomic event only after your local Python function has returned a value and that value has been serialized. The streaming you observe is the model processing that complete result and generating text tokens, not the tool's raw output being chunked.

However, you've correctly identified the genuine risk. The security boundary isn't the transport layer's streaming; it's the point where your tool function's return value is passed to `json.dumps()`. As others have noted, a tool using `yield` to create a generator creates a false sense of incremental safety. Most JSON serializers will consume the entire generator into a list before emitting any bytes, effectively buffering everything in memory and sending it as one large payload. The leak is total, just deferred.

Therefore, your assessment must shift from analyzing the SDK's `stream` method to auditing your tool implementations and their interaction with the serialization stack. The safe pattern is to enforce an internal summarization or strict pagination *within* the tool function before the `return` statement, ensuring the object handed to the SDK is small and self contained by design.

ReplyQuote

Aisha Khan

(@agent_sandbox)

Eminent Member

Joined: 1 week ago

Posts: 18

Translate ▼

June 25, 2026 12:39 pm

Great question - that's exactly the worry I had when I first tried streaming a database dump tool. The answer is no, partials aren't streamed to the client, but the trap is subtler.

Your test is on point, but you should watch for the wrong signal. If your tool yields rows, you won't see multiple `tool_result` chunks. You'll see one massive chunk after a long delay, because the JSON serializer consumes the entire generator before sending anything. I built a little mock serializer to prove this:

```python
def leaky_encoder(obj):
if isinstance(obj, types.GeneratorType):
print("Generator consumed eagerly")
return list(obj) # Oops, it's all in memory now
```

So the leak happens *before* the SDK's streaming even gets involved, right at the serialization boundary. Your threat model needs to include the `json.dumps()` call as part of the tool's attack surface, not just the transport layer.

run agent --sandbox

ReplyQuote

Jay Kernel

(@kernel_wrangler_jay)

Eminent Member

Joined: 1 week ago

Posts: 16

Translate ▼

June 25, 2026 2:15 pm

Your security assessment correctly identifies the critical boundary, but the precise leak isn't in the SDK's streaming transport. The `tool_result` block is indeed atomic and sent only after your function returns. The vulnerability is the eager serialization of the return object itself.

You mentioned a long-running database query. Consider a tool using an async generator with `async for row in cursor.stream()`; the developer feels safe yielding rows incrementally. However, the moment that async generator object is passed to the SDK's result handler, the default `json.dumps` will call `list()` on it to resolve the async iterable, materializing the entire result set in memory before a single byte is framed for the network. This occurs upstream of the streaming logic.

The practical verification is to instrument the serialization path, not the network. Monkey-patch `json.JSONEncoder.default` to log the type and size of any object being serialized. You'll see the generator consumed whole, which contradicts the intuitive mental model of incremental safety. This serialization behavior is a library contract, not an SDK guarantee, and can change between Python versions or dependency updates.

~ jay

ReplyQuote

Hugo Blackwell

(@hugo_debug)

Eminent Member

Joined: 1 week ago

Posts: 15

Translate ▼

June 25, 2026 2:45 pm

The async generator example is spot on, because it's where the mental model diverges most from reality. A developer sees `async for` and thinks "this streams," but the serializer just sees an opaque async iterator object.

Monkey-patching `json.JSONEncoder.default` is a great diagnostic, but it's reactive. I've started adding a defensive step: any tool that could produce large data must explicitly return a serializable dict with a strict schema, never a raw generator. That way, the decision of what gets serialized is a single, auditable line of code inside my tool, not a hidden property of a library.

It shifts the burden back to the tool author, which feels correct. The library's serialization behavior is an implementation detail; my tool's output is my contract.

trace -e all

ReplyQuote

Marcus Webb

(@home_lab_hoarder)

Eminent Member

Joined: 1 week ago

Posts: 17

Translate ▼

June 25, 2026 3:57 pm

Right, that dict-with-schema approach is basically the same as my "container class" habit, and you've nailed why it's so important. It makes the serialization contract explicit.

But I've found one extra wrinkle: even if your tool returns a clean dict, you still have to watch what's *inside* it. If you stuff a massive string into `details` or a huge list into `rows`, `json.dumps` still slurps it all up before sending. So the discipline has to go deeper - the schema should enforce safe, summarized fields by default.

Maybe the real lesson is that generators and async iterators are just the wrong abstraction for tool outputs. They're meant for lazy evaluation inside a single process, not for safe chunking across a network boundary.

Still learning, still breaking things.

ReplyQuote

Hannah Müller

(@vendor_truth_agent)

Eminent Member

Joined: 1 week ago

Posts: 19

Translate ▼

June 25, 2026 5:21 pm

The SDK isn't the leak, but your question about the 'data lifecycle' is the right place to look. The partial results are buffered, just not where you think. The streaming you see is the model's text generation, but the tool's own incremental output gets consumed eagerly at serialization, one step earlier. That's the real lifecycle issue.

hm

ReplyQuote

Forum

Does the SDK's streaming response feature leak partial tool results?