Audit logs are ballooning to 100GB/day, can't find anything. Help?

Eli J. · 2026-06-23T04:00:59Z

Our team has deployed a comprehensive audit logging pipeline for our agent fleet, capturing every tool call, model I/O, and decision event as mandated by policy. The volume, however, has become operationally crippling: we are now averaging over 100GB of log data per day per major service. This sheer scale has rendered our incident response procedures nearly useless; simple forensic queries for a specific session or user action across a 24-hour window now take tens of minutes. We are drowning in data yet starved for insight. I believe the root cause is a lack of rigorous log *structure* and *selective fidelity*. We are logging everything at the highest verbosity, treating all events with equal weight, and failing to separate critical security events from operational noise. The canonical "log everything" approach has backfired. My analysis of our current log schema reveals several key issues: * **Absence of a tiered event taxonomy:** A "tool.call" event for a `get_weather` function is stored with identical fields and detail as a `database.execute` event that accesses PII. * **Uniform full-text capture:** All model prompts and completions are stored in their entirety, including potential PII and lengthy contextual data, rather than structured extractions. * **Missing cardinality control:** We log the full, recursive JSON payload for every event, including numerous repetitive and static fields. I propose we move towards a schema that enforces: 1. **Event Typing with Varying Detail Levels:** Critical events (e.g., `credential.access`, `data.export`) trigger full-context capture. Low-risk events (e.g., `tool.call:utils.format`) are logged in a minimal, structured form. 2. **Structured Arguments over Blobs:** Instead of logging the raw text of a model query like "Summarize the financial report for customer [Name]...", we should parse and log the *intent* and *parameters* as discrete fields where possible. 3. **Aggressive PII Stripping at Ingest:** A defined set of patterns (credentials, keys, specific identifiers) must be hashed or redacted before the log event is even serialized. Consider the following contrast in approaches for a database query tool call: **Current Problematic Log Entry:** ```json { "timestamp": "2023-11-05T14:22:01Z", "event_type": "tool.call", "agent_id": "agent_48f1", "session_id": "sess_abc123", "tool_name": "query_database", "input": "SELECT email, phone_number FROM customers WHERE customer_id = 'cust_789123';", "output": "[{'email': 'person@domain.com', 'phone_number': '+15551234567'}, ...]", "full_context": "The user asked: 'Get me contact info for the VIP list'..." } ``` **Proposed Structured Log Entry:** ```json { "t": "2023-11-05T14:22:01Z", "e": "db.access", "aid": "agent_48f1", "sid": "sess_abc123", "tool": "query_database", "op": "SELECT", "object": "customers", "fields": ["email", "phone_number"], "criteria": {"filter": "customer_id"}, "row_count": 15, "pii_handling": "redacted_at_ingest", "risk_score": 8 } ``` The second entry reduces size, eliminates stored PII, and immediately provides filterable fields for an analyst (`op`, `object`, `risk_score`). My primary questions for the community are: * What specific event taxonomy (or standard like CEE) have you successfully applied to agent audit logs? * How do you technically implement variable-detail logging within your agent runtime? Are you using seccomp or LD_PRELOAD hooks to tag high-risk syscalls that should elevate log detail? * What proven strategies exist for real-time PII detection and redaction in a high-throughput log pipeline before storage? Are you using specialized eBPF programs or inline WASM filters? The goal is not to log less, but to log smarter. We need the 100GB/day to contain 10x the investigative utility. ~Eli

Summarize Topic

Page 2 / 2 Prev

Agent Audit Log Design

Last Post by Carlos M. 5 days ago

17 Posts

17 Users

0 Reactions

2 Views

RSS

Emma W.

(@selftaught_sec)

Active Member

Joined: 1 week ago

Posts: 11

Translate ▼

June 25, 2026 5:15 am

I completely agree about separating the loggers at the framework level. A dedicated `log_security_event` function that mandates extra fields is the right direction. But that brings up a question about scope creep - what qualifies as a "security event" that gets forced into that stream?

I've seen teams start with database calls, then add user authentication, then file system access, and before you know it, the "secure" stream is just a renamed version of the old bloated log because everything feels important. The framework needs a brutally strict definition, maybe tied directly to a data classification label or a specific resource namespace.

And on dropping args for `get_weather`, you're right about complexity. But doesn't that just move the problem to defining the retention window? If a buggy plugin starts hammering the API, you'd want those raw arguments to debug the surge, but they'd be gone if your window is too short. How do you decide what's short enough for ops but long enough for incident response?

ReplyQuote

Carlos M.

(@newbie_shield)

Eminent Member

Joined: 1 week ago

Posts: 21

Translate ▼

June 25, 2026 5:24 am

Scope creep is exactly why our team gave up on trying to define "security events." It's always "just this one more thing." 😅 We ended up using the data classification label idea you mentioned. If a function touches data labeled "confidential" or "restricted," it goes to the secure stream. Everything else goes to the ops drain. It's not perfect, but it's a hard line.

For the retention window, we got burned by a short one during a DDoS. Our ops logs for the affected service were gone. Now we keep a tiny sample (like 0.1%) of all low-sev calls forever, just the metadata. The raw args still vanish after 24h, but we at least have a record that *something* spiked. It's a cheap compromise.

ReplyQuote

Page 2 / 2 Prev

80 Forums
1,188 Topics
7,236 Posts
0 Online
508 Members

Forum Icons: Forum contains no unread posts Forum contains unread posts

Topic Icons: Not Replied Replied Active Hot Sticky Unapproved Solved Private Closed