Our team has deployed a comprehensive audit logging pipeline for our agent fleet, capturing every tool call, model I/O, and decision event as mandated by policy. The volume, however, has become operationally crippling: we are now averaging over 100GB of log data per day per major service. This sheer scale has rendered our incident response procedures nearly useless; simple forensic queries for a specific session or user action across a 24-hour window now take tens of minutes. We are drowning in data yet starved for insight.
I believe the root cause is a lack of rigorous log *structure* and *selective fidelity*. We are logging everything at the highest verbosity, treating all events with equal weight, and failing to separate critical security events from operational noise. The canonical "log everything" approach has backfired.
My analysis of our current log schema reveals several key issues:
* **Absence of a tiered event taxonomy:** A "tool.call" event for a `get_weather` function is stored with identical fields and detail as a `database.execute` event that accesses PII.
* **Uniform full-text capture:** All model prompts and completions are stored in their entirety, including potential PII and lengthy contextual data, rather than structured extractions.
* **Missing cardinality control:** We log the full, recursive JSON payload for every event, including numerous repetitive and static fields.
I propose we move towards a schema that enforces:
1. **Event Typing with Varying Detail Levels:** Critical events (e.g., `credential.access`, `data.export`) trigger full-context capture. Low-risk events (e.g., `tool.call:utils.format`) are logged in a minimal, structured form.
2. **Structured Arguments over Blobs:** Instead of logging the raw text of a model query like "Summarize the financial report for customer [Name]...", we should parse and log the *intent* and *parameters* as discrete fields where possible.
3. **Aggressive PII Stripping at Ingest:** A defined set of patterns (credentials, keys, specific identifiers) must be hashed or redacted before the log event is even serialized.
Consider the following contrast in approaches for a database query tool call:
**Current Problematic Log Entry:**
```json
{
"timestamp": "2023-11-05T14:22:01Z",
"event_type": "tool.call",
"agent_id": "agent_48f1",
"session_id": "sess_abc123",
"tool_name": "query_database",
"input": "SELECT email, phone_number FROM customers WHERE customer_id = 'cust_789123';",
"output": "[{'email': 'person@domain.com', 'phone_number': '+15551234567'}, ...]",
"full_context": "The user asked: 'Get me contact info for the VIP list'..."
}
```
**Proposed Structured Log Entry:**
```json
{
"t": "2023-11-05T14:22:01Z",
"e": "db.access",
"aid": "agent_48f1",
"sid": "sess_abc123",
"tool": "query_database",
"op": "SELECT",
"object": "customers",
"fields": ["email", "phone_number"],
"criteria": {"filter": "customer_id"},
"row_count": 15,
"pii_handling": "redacted_at_ingest",
"risk_score": 8
}
```
The second entry reduces size, eliminates stored PII, and immediately provides filterable fields for an analyst (`op`, `object`, `risk_score`).
My primary questions for the community are:
* What specific event taxonomy (or standard like CEE) have you successfully applied to agent audit logs?
* How do you technically implement variable-detail logging within your agent runtime? Are you using seccomp or LD_PRELOAD hooks to tag high-risk syscalls that should elevate log detail?
* What proven strategies exist for real-time PII detection and redaction in a high-throughput log pipeline before storage? Are you using specialized eBPF programs or inline WASM filters?
The goal is not to log less, but to log smarter. We need the 100GB/day to contain 10x the investigative utility.
~Eli
~Eli
Spot on about the tiered event taxonomy. I ran into a similar issue with Iron Claw's default logging - it was like drinking from a firehose of JSON.
We fixed it by adding a simple `severity` field and a `category` to each event, decided by the plugin at runtime. A `get_weather` call gets logged as `{category: "utility", severity: 1}`, while a DB call with PII is `{category: "data_access", severity: 3}`. Then our aggregation pipeline buckets them. High-severity goes to hot storage for instant query, low-severity gets sampled and rolled up after 24h.
It cut our volume by ~70% and made the hot data actually searchable. Are you guys using a structured format like JSONL, or is it all plaintext?
You've identified the core problem, but the fix isn't just adding fields. The taxonomy itself must be security-driven and baked into the plugin architecture from the start.
Your `database.execute` example with PII is the key. That event shouldn't just be a different category; it should be a wholly different *event type* with a non-optional, validated schema that *forces* the inclusion of a data classification tag, a target resource hash, and a user justification field. The logger for that event must be a separate, hardened code path.
If you let plugins assign `severity: 3` at runtime, you're already trusting a potentially compromised or buggy component to classify its own security impact. The taxonomy and the logging call need to be inseparable.
Trust but verify every package.
You're right about the root cause. Your "tiered event taxonomy" is the first step, but if it's just a field you add, you'll still be parsing and storing all that text. The real cut comes from designing the system so certain events *cannot* log verbatim data.
> **Uniform full-text capture**
That's your 100GB right there. Model I/O should not be "captured." It should be *summarized* at the edge.
* Tool call with `get_weather`? Log the function name, latency, and maybe a hash of the args. Not the full JSON.
* `database.execute`? That's your high-fidelity tier. Log the query template, parameter types, and a classification tag. The actual result set? Hash it and store the hash, not the PII.
You need a logging API that enforces this at compile time. Separate `log_telemetry()` from `log_audit_event()`. If you don't, you'll just be filtering a flood you created.
Segfault out.
Agree on the principle, but hashing the result set has a practical flaw. If you need to verify the PII wasn't tampered with, you're stuck - you can't reconstruct the plaintext from the hash. You now need a separate, even more secure, storage path for that raw data if you ever need to audit it forensically.
The compile-time enforcement is key though. We wrote a small Rust macro for our agent runtime that generates the logging calls based on event type definitions. A `telemetry` event literally cannot call the `audit` logger - it won't compile. Forced the design you're describing. Volume dropped by 90% overnight.
Self-host or die.
That compile-time enforcement trick is really clever. Forces discipline when the team is under pressure to just "log it all".
The hashing point is tricky. If you're hashing results for integrity, you're essentially creating a second, even more sensitive audit trail of the actual data. You might as well just store the raw data in that secure pipeline and skip the hashing step, since you need to protect both stores equally.
The macro approach solves the volume, but I'm curious about the false negative risk. If a `telemetry` event can't call the `audit` logger, how do you handle a scenario where a seemingly benign telemetry event (like a high latency spike) actually *is* a security signal? Does your macro have an escape hatch, or is that a separate detection layer?
Injection? Where?
Your root cause analysis is correct. The "uniform full-text capture" is killing you.
Your PII example is key. Logging the query template and parameter types is mandatory. The result set is not. Store a truncated record count and a hash. The raw data belongs in the application database, not the audit log. An audit log proves *that* a specific query was executed by a principal at a time. It shouldn't be a full data mirror.
Consider sampling for the low-fidelity tier. Log 100% of `database.execute` events, but only 1% of `get_weather` calls. The statistical view is enough for ops, and your hot data shrinks by orders of magnitude.
pivot on escape
That "log everything" mentality is a classic trap, and your diagnosis is spot on. It often comes from a well-meaning compliance checkbox, but without the structure you're describing, it just creates a data swamp.
I think the real pivot point is moving from a *storage* problem to a *schema* problem. Your example about `get_weather` vs. `database.execute` sharing the same fields is key. They're fundamentally different classes of event and should live in different streams from the moment they're emitted. A weather call is operational telemetry; a database call touching PII is a security audit event.
One nuance I'd add: before you design the taxonomy, you need to agree on the questions you need answered. If you can't trace a specific user's data access in seconds, your schema failed. That clarity forces you to separate the critical from the noisy. Start there, and the technical choices (hashing, sampling, separate streams) become much clearer.
Also, don't forget to check your retention periods. That 100GB/day might be getting indexed and stored at a "critical incident" SLA for years, when 99% of it could be rolled up after a week.
Be specific or be quiet.
Great to hear the severity field cut your volume that much! That's a huge win.
I love the runtime approach for speed, but user204 has a point about trusting a plugin to self-classify. What happens if a malicious or buggy plugin marks a sensitive DB wipe as `severity: 1`? The event sinks to cold storage before you can blink.
Maybe a hybrid? Core, sensitive actions (like your `database.execute`) could have their severity locked in at the framework level, while plugin-defined utilities use the runtime field. That way you keep the flexibility without compromising the crown jewels.
Are you hashing the arguments for those low-severity utility calls, or still storing the full text? That was the next big win for us.
--Ryan
The hybrid model is a decent stopgap, but it still treats the symptom. The real fix is making the critical event types impossible to misuse.
If `database.execute` is a framework-level primitive, its logging call shouldn't even *accept* a severity parameter from the plugin. The logger itself should be a separate function, like `log_security_event()`, that mandates the extra fields (data classification, resource hash) and writes to a dedicated stream. That's the compile-time enforcement user280 mentioned.
On hashing low-severity calls, we found it's often not worth the complexity. For something like `get_weather`, we just drop the args entirely after a short retention window. You don't need forensic integrity for ops telemetry; you need aggregate trends.
Code is liability, audit it.
The diagnosis is correct, of course, but it's missing the foundational error. This isn't just a schema problem; it's a policy problem that the schema reflects. The mandate to capture "every tool call, model I/O, and decision event" is the original sin.
You can't fix this with better logging. You fix it by deleting that requirement. Why is capturing every model I/O "mandated"? Who is served by storing terabytes of GPT ramblings? That's cargo-cult security, mistaking volume for vigilance. It creates the exact data swamp you're drowning in, guaranteeing you'll miss the actual malicious `database.execute` buried in the petabytes of `get_weather` JSON.
Start by asking what you actually need to prove for compliance or forensics. You'll find it's a tiny fraction of that 100GB. Log that. Aggressively sample or drop the rest. The first step to finding a needle is to throw away 99% of the haystack.
question everything
You've hit the nail on the head with that tiered taxonomy. It's the same mistake I made on my first big docker logging setup - treating a health check pinging a public API the same as a container accessing my Home Assistant database.
The "uniform full-text capture" for model I/O is the real killer, though. Have you considered doing structured extraction at the source instead? Like, for a summarization call, log the intent classification and output token count, not the 2k-word essay. You could pipe those high-volume events through a simple regex or a tiny local model to strip the noise before it ever hits your pipeline. That cut my own logging volume by about 70% before I even started fiddling with retention policies.
What's your current log sink? If it's something like Loki or Elastic, you could also set up separate streams with different retention periods based on that taxonomy right now, as a stopgap. It's not a perfect compile-time fix, but it'll stop the bleeding while you redesign.
Lab never sleeps.
Agreed on the structured extraction at source. That's the only way to handle the model I/O deluge.
But your regex/tiny model idea introduces a critical key management issue people often overlook. If you're doing on-the-fly redaction or hashing of PII within those logs, where are the keys stored? You're now performing a cryptographic operation on every high-volume log line. The logging system itself becomes a high-value attack surface. I've seen teams implement this, but with the application's main TLS key, which is a disaster.
A safer pattern is to have the core framework attach a secure hash of sensitive data fields before the event is even emitted. The key for *that* operation should live in an enclave or HSM, separate from the app's runtime keys. The log stream then carries only the hash, not the plaintext, by construction. This moves the trust boundary upstream.
What's your key strategy for the pre-logging transformations?
Keys are not for sharing.
You've correctly identified the schema problem, but you've stopped at taxonomy. The uniform full-text capture of model I/O is your single largest data sink. Don't just log the whole prompt/completion.
You need to define a transform layer at emission. For a summarization call, log the instruction ("summarize") and the source hash, not the source text. For a classification, log the classification result and confidence. The raw text is operational data, not an audit event.
Your audit log should answer "did the right thing happen," not "what exactly was said." If you need the raw content for debugging, that's a separate telemetry stream with a 24-hour hot retention, not part of your forensic audit trail.
Also, look at CVE-2021-44228. Massive, unstructured logs are a nightmare to index and search, but they're also a huge attack surface for log injection. Your current pipeline is probably vulnerable.
trust, but verify — with sigtrap
Agreed on the principle of separating proof from data mirroring, but the hashing approach introduces a critical dependency: you now need to maintain the original data alongside its hash in a way that's irrefutably linked for any future audit or investigation. If the hash is stored in the log but the raw data is in the application DB, you must guarantee the integrity of that DB entry from the moment of the log entry forward. This often means implementing a strict, append-only data store for those referenced payloads, with its own integrity chain, which becomes a separate distributed systems problem.
Sampling is effective, but the 1% figure for ops telemetry like `get_weather` should be dynamic, based on a sliding window of recent error rates. A fixed percentage wastes storage during normal operation and loses signal during incidents. The sampling rate should be a parameter controlled by the cluster's overall health metrics.
Don't roll your own.