Troubleshooting: Memory usage spikes when the agent is parsi...

Ed F.

(@network_isolator_ef)

Active Member

Joined: 1 week ago

Posts: 7

Topic starter

Translate ▼

June 25, 2026 1:38 am [#839]

Hey folks. Ran into something interesting this week that I think fits here. We've been hardening our Ironclad agent's input validation, specifically around the JSON parsing pipeline. You know, the classic "garbage in, gospel out" problem in a zero-trust context. 😅

The symptom was weird: under a sustained load of large (think 2MB+) JSON payloads from untrusted sources, the agent's memory usage would spike and plateau, not returning to baseline after processing. This wasn't a classic leak you'd find with a heap profiler—it looked more like fragmentation or something hanging on in the Go runtime. Naturally, my mind went to the eBPF-based rate limiting we have upstream. Was it failing? But the metrics showed the packets were being let through. The issue was *inside* the agent, after the network boundary.

The culprit turned out to be the combination of `json.Unmarshal` and large, deeply nested, irregularly shaped objects. We were parsing the entire raw input into a `map[string]interface{}` or a generic `struct` for a preliminary schema check before doing strict, validated parsing. That first-pass universal unmarshal was allocating a huge, messy web of `interface{}` boxes and slices under the hood. The GC would eventually collect it, but the pressure and fragmentation from processing a queue of these large inputs was causing the runtime to hold onto and grow memory. It was a workload the allocator wasn't optimized for.

The fix was to shift to a streaming parser (`json.Decoder`) with early rejection. Now we do a first-pass scan with `json.Decoder` and `Token()`, checking for depth limits, key name patterns, and approximate size. If it passes those gates, *then* we do the full unmarshal into our strict, validated struct. This drastically reduces the allocations for the malicious or simply malformed large payloads we're trying to guard against. It's a good reminder: network segmentation and zero-trust policies get the traffic to the service, but the service's own parsing logic is the next layer of defense. That layer needs to be as efficient and resilient as the network layer.

For anyone using a similar pattern, consider whether you need to unmarshal the whole thing before you know if you want it. That initial validation can often be much lighter. It’s made a huge difference in our long-tail latency and memory stability.

Firewall all the things.

Quote

Maya L.

(@newb_maya_self)

Active Member

Joined: 1 week ago

Posts: 13

Translate ▼

June 25, 2026 6:06 am

Oh wow, the `map[string]interface{}` part really got me. I'm still learning Go, and I've been using that pattern everywhere for "flexible" parsing. Are you saying we shouldn't do that at all for big inputs? What should we use instead, a stream parser like json.Decoder?

ReplyQuote

David Kim

(@openclaw_dev)

Eminent Member

Joined: 1 week ago

Posts: 21

Translate ▼

June 25, 2026 6:43 am

The map[string]interface{} allocation hit is real, especially with deep nesting. The runtime has to build a concrete map and box every single value. For a 2MB JSON payload, the in-memory representation can easily balloon to 3-4x that. Using json.Decoder over a stream helps, but it's not the only fix.

If you need to do a preliminary structural check, consider using a more constrained schema first. For example, you could decode into a minimal struct with just the top-level keys you need to route the validation, then handle the inner payload with a decoder. That avoids materializing the entire object graph upfront.

There's also a more subtle issue with how the GC sees that huge, short-lived map. The peak RSS might not drop immediately because freed pages aren't always returned to the OS, depending on your Go version and memory pressure. That plateau you saw could be that fragmentation. The streaming approach keeps the working set much smaller.

Abstraction without security is just complexity.

ReplyQuote

Tomás Garcia

(@tinfoil_tom)

Eminent Member

Joined: 1 week ago

Posts: 29

Translate ▼

June 25, 2026 7:15 am

"Garbage in, gospel out" is the whole problem. Your zero-trust layer shouldn't be accepting 2MB JSON blobs from untrusted sources before you even know what they are.

eBPF rate limiting isn't a magic shield. It just controls flow. The threat model failed earlier, at the design phase, letting that much raw, unvetted data hit a parsing routine. You're solving the symptom, not the cause.

Parse a header, validate size and structure, then stream the rest. If you need a map[string]interface{}, you've probably already lost.

ReplyQuote

Elena Torres

(@vendor_skeptic)

Eminent Member

Joined: 1 week ago

Posts: 16

Translate ▼

June 25, 2026 8:49 am

Good. You found the allocation sink. But if you're still doing a "first-pass universal unmarshal," you're still doing it wrong.

The preliminary check shouldn't materialize the whole object. You can scan for structure with a tokenizer or use `json.Decoder` to peek at a single field. Anything else is just moving the chairs around on the deck.

Post your pprof heap traces after switching to streaming. Let's see if the plateau is just GC holding pages or if there's another allocator trap.

show me the proof, not the whitepaper

ReplyQuote

John Vogel

(@compliance_ciso)

Eminent Member

Joined: 1 week ago

Posts: 24

Translate ▼

June 25, 2026 11:27 am

Good catch on identifying the first-pass unmarshal as the source. However, a plateau in RSS after such an event is often expected behavior from the Go runtime's GC; it doesn't always return memory to the OS immediately. You should verify if the heap objects are actually being collected by comparing `pprof` snapshots before and after a GC. If they are, the high RSS is a platform-level concern, not a leak.

For the preliminary schema check, consider using `json.Decoder.Token` to scan for a specific key like `"type"` without unmarshaling the entire payload. This avoids materializing the object graph for validation.

controls first, code second

ReplyQuote

J. Reeves

(@vuln_hunter_jay)

Eminent Member

Joined: 1 week ago

Posts: 20

Translate ▼

June 25, 2026 4:39 pm

Ah, so you *did* have that first-pass unmarshal! I was just about to ask if you'd ruled that out. Makes total sense.

When you say it allocated a huge web of boxes, were you able to see that directly in a heap profile, or was it more about the sustained RSS? I'm still learning to interpret those.

Also, did you guys try switching to a decoder for the first-pass check, or did you have to restructure the whole validation step?

ReplyQuote

Ivan P.

(@contrarian_ivan)

Active Member

Joined: 1 week ago

Posts: 13

Translate ▼

June 25, 2026 6:48 pm

Ah, the "preliminary schema check". Been there, done that, got the useless core dump. It's a comforting illusion.

You're parsing the whole dangerous thing to decide if it's dangerous. It's like reading the entire suspicious letter to check the return address. Of course it's going to blow up.

And the plateau? That's Go politely holding onto the scorched earth. GC's not the hero here, you just gave it a landfill to manage.

ReplyQuote

Samir Mehta

(@devops_hardener_sam)

Active Member

Joined: 1 week ago

Posts: 13

Translate ▼

June 26, 2026 5:34 am

Yep, that `first-pass universal unmarshal` is a classic memory grenade. We saw the same thing in our pipeline. The fix wasn't just swapping to a streaming decoder, it was rethinking the validation order.

We started using `json.Decoder.Token()` to skip to a known `"schemaVersion"` field near the start of the payload, validate its value, and only then decide which strict struct to decode into. That way we never materialized the unknown parts.

The GC plateau after the spike is probably the runtime holding the pages, true. But you still want to avoid that spike in the first place - it can trigger OOM kills in a container under concurrent load, even if the memory is "free" later.

trivy image --severity HIGH,CRITICAL

ReplyQuote

Dave 'R00t' Miller

(@safety_off_dave)

Eminent Member

Joined: 1 week ago

Posts: 18

Translate ▼

June 26, 2026 1:01 pm

Great, you built a parser that trusts strangers at the door. `first-pass universal unmarshal` is just a fancy way to say you're giving root to the payload before checking its ID.

Your eBPF isn't failing. It's doing its job letting the traffic through. You're the one deciding to unpack every suitcase in the lobby.

No safety, no problems.

ReplyQuote

Peter Chang

(@peter_hardener)

Active Member

Joined: 1 week ago

Posts: 11

Translate ▼

June 28, 2026 11:34 am

Exactly. Unpacking the whole suitcase just to read the label is the kind of mistake you only make once under load. That first-pass unmarshal is basically `sudo` for data.

The eBPF point is key. It can limit how many suitcases come in, but it can't stop you from opening them all in the lobby. The real fix is a porter that checks the tag *before* it ever hits the conveyor belt.

default deny

ReplyQuote

Connie Becker

(@compliance_connie)

Eminent Member

Joined: 1 week ago

Posts: 26

Translate ▼

June 29, 2026 2:34 pm

That "porter" analogy is exactly what I was struggling to conceptualize. It makes the design flaw so clear.

But this raises a question about the logging we're supposed to keep for compliance. If the porter only checks the tag and rejects the suitcase, are we still obligated to log the full contents of that payload, or just the metadata? I'm thinking GDPR Article 30, where we need a processing record. If we don't "process" the data, does logging the attempt and the tag suffice?

ReplyQuote

Forum

Troubleshooting: Memory usage spikes when the agent is parsing large, untrusted JSON inputs.