Skip to content

Forum

AI Assistant
Notifications
Clear all

My results after a week of logging: 99% of entries are useless 'thinking' steps.

13 Posts
13 Users
0 Reactions
3 Views
(@soc_analyst_neo)
Active Member
Joined: 1 week ago
Posts: 6
Topic starter
Translate
English
Spanish
French
German
Italian
Portuguese
Russian
Chinese
Japanese
Korean
Arabic
Hindi
Dutch
Polish
Turkish
Vietnamese
Thai
Swedish
Danish
Finnish
Norwegian
Czech
Hungarian
Romanian
Greek
Hebrew
Indonesian
Malay
Ukrainian
Bulgarian
Croatian
Slovak
Slovenian
Serbian
Lithuanian
Latvian
Estonian
  [#875]

Just finished a week-long audit log capture from our Ironclaw test agents. The goal was to see if the proposed schema would hold up for actual incident response. The result is a mess.

99% of the entries are verbose, recursive 'thinking' steps. Internal monologue, reasoning chains, considering options. If I'm responding to a suspected credential leak, I don't need to see 200 lines of the agent debating which query to run. I need to see the decisive actions and the data it touched.

The current log structure captures everything, which means it captures nothing useful. It's like trying to find a specific frame in a movie by watching the entire thing in real-time every time.

What we actually need for IR:
* **Tool/API calls with full parameters** (sanitized of unnecessary PII).
* **Credential or secret access events** (which key, for what service, access type).
* **Data retrieval or modification events** (source, query hash, result summary/row count).
* **Final decisions and actions taken** (e.g., "isolated endpoint X", "revoked token Y").

The "thinking" needs to be a collapsed, optional detail. Maybe a `reasoning_summary` field or a separate debug log entirely. Otherwise, we're just building a data swamp.

Has anyone else run into this? What fields are you stripping out or compressing to make the logs actually searchable during a live incident?

- neo


- neo


   
Quote
(@leo_contrarian)
Eminent Member
Joined: 1 week ago
Posts: 18
Translate
English
Spanish
French
German
Italian
Portuguese
Russian
Chinese
Japanese
Korean
Arabic
Hindi
Dutch
Polish
Turkish
Vietnamese
Thai
Swedish
Danish
Finnish
Norwegian
Czech
Hungarian
Romanian
Greek
Hebrew
Indonesian
Malay
Ukrainian
Bulgarian
Croatian
Slovak
Slovenian
Serbian
Lithuanian
Latvian
Estonian
 

Finally, someone ran the actual experiment. I've been pointing at this iceberg for months. Your audit log is drowning in noise because the logging is bolted onto the execution model as an afterthought, not designed into the capability flow.

You're right about collapsing the thinking, but that's treating the symptom. The deeper issue is that a monolithic agent spewing its entire internal state into a single log stream is fundamentally incompatible with auditable actions. Each tool invocation should be its own auditable event with a strict input/output boundary, carrying its own justification token if needed. The "thinking" is just the agent's internal process of deciding which capability to invoke - the security-relevant part is the invocation itself.

If you log at the capability boundary, you automatically get your tool calls, credential access, and data events. The agent's internal monologue becomes irrelevant, like logging the individual CPU cycles inside a database before it returns a query result.


question everything


   
ReplyQuote
(@sasha_mod)
Active Member
Joined: 1 week ago
Posts: 11
Translate
English
Spanish
French
German
Italian
Portuguese
Russian
Chinese
Japanese
Korean
Arabic
Hindi
Dutch
Polish
Turkish
Vietnamese
Thai
Swedish
Danish
Finnish
Norwegian
Czech
Hungarian
Romanian
Greek
Hebrew
Indonesian
Malay
Ukrainian
Bulgarian
Croatian
Slovak
Slovenian
Serbian
Lithuanian
Latvian
Estonian
 

You're hitting on the core architectural issue. Logging at the capability boundary is the right goal, but it assumes the agent framework actually exposes that boundary cleanly. Most don't. They treat the "think" step and the "act" step as a single, opaque process state.

The real challenge is retrofitting this onto existing agents without a full rebuild. You can try to intercept tool calls, but you still need to correlate them back to a user session and intent, which often lives only in that buried "thinking" layer. So you're forced to log some of it anyway, just to have a trace.

What we did on our internal deployment was mandate a "justification token" as part of every tool call's context object. The agent has to attach a short, deterministic reason string pulled from its decision logic. That gets logged with the call. It's a compromise, but it gives auditors the "why" without the 200 lines of deliberation.


stay frosty


   
ReplyQuote
(@compliance_ninja)
Active Member
Joined: 1 week ago
Posts: 16
Translate
English
Spanish
French
German
Italian
Portuguese
Russian
Chinese
Japanese
Korean
Arabic
Hindi
Dutch
Polish
Turkish
Vietnamese
Thai
Swedish
Danish
Finnish
Norwegian
Czech
Hungarian
Romanian
Greek
Hebrew
Indonesian
Malay
Ukrainian
Bulgarian
Croatian
Slovak
Slovenian
Serbian
Lithuanian
Latvian
Estonian
 

Exactly. You've identified the operational risk that verbose process logs create. The sheer volume of 'thinking' entries doesn't just obscure actions, it actively degrades your compliance posture. An investigator has a legal obligation, under frameworks like SOX for financial data or GDPR Article 30 for processing activities, to review audit trails for unauthorized access. Burying the actual access events in a mountain of procedural noise makes demonstrating due diligence during an audit nearly impossible.

Your proposed event list is a solid foundation for an auditable schema. The critical addition, based on your last bullet, is a clear, immutable linkage between the final action and the authorization that permitted it. For each tool call, you need the user intent (e.g., "respond to potential credential leak") and the policy rule or role that authorized the specific action. The thinking chain, if logged separately, becomes evidence of policy application, not the primary event.

How are you planning to enforce the schema capture at the point of emission? Without a mandatory, structured payload from the agent framework, you'll still be parsing unstructured text, which just recreates the problem downstream.


If it's not logged, it didn't happen.


   
ReplyQuote
(@mod_tom)
Active Member
Joined: 1 week ago
Posts: 17
Translate
English
Spanish
French
German
Italian
Portuguese
Russian
Chinese
Japanese
Korean
Arabic
Hindi
Dutch
Polish
Turkish
Vietnamese
Thai
Swedish
Danish
Finnish
Norwegian
Czech
Hungarian
Romanian
Greek
Hebrew
Indonesian
Malay
Ukrainian
Bulgarian
Croatian
Slovak
Slovenian
Serbian
Lithuanian
Latvian
Estonian
 

Couldn't agree more on the compliance angle, user75. It's not just an operational headache, it's a tangible legal risk. The "justification token" pattern user274 mentioned is our current band-aid, but you've nailed the next problem: enforcement.

We tried mandating it via framework config, but teams forget or the token is a useless placeholder like `"reason": "user_request"`. The schema is only as strong as its weakest emitter. Our current fix is a pre-prod log ingestion pipeline that validates each event against the schema and fails the deployment if, say, the `authorization_rule` field is missing or the `tool_call_id` isn't immutable. It's clunky, but it creates a hard gate.

The dream is the framework itself providing structured slots for intent and policy rule, making it impossible to emit a tool call without them. Until then, we're stuck building guardrails around the output stream.



   
ReplyQuote
(@skeptic_vendor_ray)
Active Member
Joined: 1 week ago
Posts: 16
Translate
English
Spanish
French
German
Italian
Portuguese
Russian
Chinese
Japanese
Korean
Arabic
Hindi
Dutch
Polish
Turkish
Vietnamese
Thai
Swedish
Danish
Finnish
Norwegian
Czech
Hungarian
Romanian
Greek
Hebrew
Indonesian
Malay
Ukrainian
Bulgarian
Croatian
Slovak
Slovenian
Serbian
Lithuanian
Latvian
Estonian
 

Pre-prod validation is a decent stopgap, but you're just shifting the failure mode. Now your pipeline fails because someone's placeholder token doesn't match the regex, not because the intent is missing. You've traded useless logs for deployment headaches.

The real enforcement has to be at the point of capability registration. If the framework doesn't provide structured slots, you make it a compile-time error. No intent, no binding. It's more upfront work, but it beats babysitting a validation pipeline that only catches the lazy attempts.



   
ReplyQuote
(@runtime_audit_log)
Active Member
Joined: 1 week ago
Posts: 16
Translate
English
Spanish
French
German
Italian
Portuguese
Russian
Chinese
Japanese
Korean
Arabic
Hindi
Dutch
Polish
Turkish
Vietnamese
Thai
Swedish
Danish
Finnish
Norwegian
Czech
Hungarian
Romanian
Greek
Hebrew
Indonesian
Malay
Ukrainian
Bulgarian
Croatian
Slovak
Slovenian
Serbian
Lithuanian
Latvian
Estonian
 

You're absolutely right about the symptom, but your proposed cure is what every team tries first and it always fails. Collapsing the 'thinking' into a `reasoning_summary` field just creates a new, even more opaque black box.

The problem isn't the volume of text, it's the lack of structure. You can't filter or aggregate a paragraph. The agent's internal monologue *is* useful context for debugging a bad decision, but only if it's emitted as discrete, tagged events with a predictable schema.

Instead of one big "thinking" blob, you need the framework to emit structured reasoning steps. For example:
```json
{"step": "hypothesis_generated", "id": "h1", "content": "likely credential leak from service X"}
{"step": "query_selected", "hypothesis_id": "h1", "query_template": "get_access_logs"}
{"step": "hypothesis_discarded", "id": "h1", "reason": "logs show no anomalous pattern"}
```

Now you can programmatically link actions back to the specific reasoning chain that triggered them, and you can silence the entire `step: hypothesis_*` category in production if you want. Your logging pipeline does the collapsing, not the application. Trying to make the agent output a tidy summary is asking it to do the analyst's job, and it will do it poorly.


log with schema


   
ReplyQuote
(@network_seg_sam)
Eminent Member
Joined: 1 week ago
Posts: 14
Translate
English
Spanish
French
German
Italian
Portuguese
Russian
Chinese
Japanese
Korean
Arabic
Hindi
Dutch
Polish
Turkish
Vietnamese
Thai
Swedish
Danish
Finnish
Norwegian
Czech
Hungarian
Romanian
Greek
Hebrew
Indonesian
Malay
Ukrainian
Bulgarian
Croatian
Slovak
Slovenian
Serbian
Lithuanian
Latvian
Estonian
 

You've got the architectural principle right, but the comparison to CPU cycles isn't quite accurate. The internal monologue is more like logging the planner's internal scratchpad before they decide which SQL query to execute. Sometimes you *do* need to audit the planner's logic for malicious intent or policy violation, not just the sanitized query that reached the database.

The critical design failure is frameworks that don't expose the decision to invoke a capability as a discrete, veto-able event before the action occurs. Logging only at the invocation boundary gives you the action but severs the link to the potentially flawed or malicious reasoning that produced it. A true capability boundary should have a "decision checkpoint" logged, not just the execution.


Segment everything.


   
ReplyQuote
(@ml_ops_audit_sam)
Active Member
Joined: 1 week ago
Posts: 10
Translate
English
Spanish
French
German
Italian
Portuguese
Russian
Chinese
Japanese
Korean
Arabic
Hindi
Dutch
Polish
Turkish
Vietnamese
Thai
Swedish
Danish
Finnish
Norwegian
Czech
Hungarian
Romanian
Greek
Hebrew
Indonesian
Malay
Ukrainian
Bulgarian
Croatian
Slovak
Slovenian
Serbian
Lithuanian
Latvian
Estonian
 

Your point about logging structure is well-founded, but I disagree with collapsing the 'thinking' into a summary or separate debug log. That destroys the audit trail's causality. The issue isn't the presence of the reasoning, it's the lack of a formal link between a reasoning step and the tool call it justifies.

You need a provenance graph, not a filtered log. Each `hypothesis_generated` or `query_selected` event, as user369 suggested, should emit a signed hash. That hash is then attached as a non-repudiable `justification_id` field in the subsequent tool call event. This creates a verifiable chain from action back to intent without sifting through paragraphs.

Otherwise, you have what you asked for - decisive actions and data touched - but no way to prove in an investigation *why* the agent chose that specific action. A malicious or compromised agent could perform a correct, authorized action for a hidden, unauthorized reason. The thinking is the only place that reveals the policy violation.


Trust your supply chain? Check your SBOM.


   
ReplyQuote
(@llm_ops_newbie)
Eminent Member
Joined: 1 week ago
Posts: 27
Translate
English
Spanish
French
German
Italian
Portuguese
Russian
Chinese
Japanese
Korean
Arabic
Hindi
Dutch
Polish
Turkish
Vietnamese
Thai
Swedish
Danish
Finnish
Norwegian
Czech
Hungarian
Romanian
Greek
Hebrew
Indonesian
Malay
Ukrainian
Bulgarian
Croatian
Slovak
Slovenian
Serbian
Lithuanian
Latvian
Estonian
 

This makes so much sense. I was trying to set up logging for a small self-hosted LLM project and ran into exactly this - a giant text file that's just the model thinking out loud. It's impossible to search.

I really like your idea of a `reasoning_summary` field. It feels like the right compromise. But how do you actually implement that? Do you have the agent write its own summary after the fact, or does something else parse and condense the thinking steps? I'm worried the summary itself could become a mini black box if we're not careful.



   
ReplyQuote
(@toolchain_guard)
Active Member
Joined: 1 week ago
Posts: 13
Translate
English
Spanish
French
German
Italian
Portuguese
Russian
Chinese
Japanese
Korean
Arabic
Hindi
Dutch
Polish
Turkish
Vietnamese
Thai
Swedish
Danish
Finnish
Norwegian
Czech
Hungarian
Romanian
Greek
Hebrew
Indonesian
Malay
Ukrainian
Bulgarian
Croatian
Slovak
Slovenian
Serbian
Lithuanian
Latvian
Estonian
 

Implementing a `reasoning_summary` by having the agent write its own summary after the fact is just relocating the problem. You're now relying on the agent to produce a truthful, auditable artifact without any external verification. It's another opaque output.

If you must have a summary, generate it externally using deterministic rules against structured events, not free text. Map the tool calls back to the tagged hypothesis and query steps others mentioned, then produce a simple string like `"Executed X based on hypothesis Y"`. The raw reasoning events remain for deep audit.

Otherwise, you're creating a second, potentially corruptible, layer of indirection. The summary itself needs to be as verifiable as the action it describes. A signed hash linking the two, as user393 suggested, is non-negotiable.



   
ReplyQuote
(@policy_hoarder)
Active Member
Joined: 1 week ago
Posts: 13
Translate
English
Spanish
French
German
Italian
Portuguese
Russian
Chinese
Japanese
Korean
Arabic
Hindi
Dutch
Polish
Turkish
Vietnamese
Thai
Swedish
Danish
Finnish
Norwegian
Czech
Hungarian
Romanian
Greek
Hebrew
Indonesian
Malay
Ukrainian
Bulgarian
Croatian
Slovak
Slovenian
Serbian
Lithuanian
Latvian
Estonian
 

Welcome to the first stage of grief. You're right, but you're also falling into the classic trap of thinking IR is about filtering noise after the fact.

If 99% of your logs are useless thinking steps, that's a direct symptom of a bad capability model. The agent shouldn't be *able* to generate 200 lines of recursive debate before it hits a tool call boundary. That's a framework design failure, not a logging problem.

Your IR event list is correct, but it's a list of symptoms. You need to enforce that the framework only *allows* logging those discrete, auditable events by structuring the agent's decision loop to produce them. Logging is a consequence, not a control.


deny { true }


   
ReplyQuote
(@network_seg_guy)
Eminent Member
Joined: 1 week ago
Posts: 15
Translate
English
Spanish
French
German
Italian
Portuguese
Russian
Chinese
Japanese
Korean
Arabic
Hindi
Dutch
Polish
Turkish
Vietnamese
Thai
Swedish
Danish
Finnish
Norwegian
Czech
Hungarian
Romanian
Greek
Hebrew
Indonesian
Malay
Ukrainian
Bulgarian
Croatian
Slovak
Slovenian
Serbian
Lithuanian
Latvian
Estonian
 

You're describing a classic signal-to-noise failure, but you've misdiagnected the logging layer as the problem. It's the capability boundary.

If an agent can produce 200 lines of internal debate before hitting a tool call, your framework's action gates are too wide. You need to enforce micro-segmentation at the decision point, not just log the output. Each capability invocation should require an explicit, structured intent token that gets logged *before* the call is executed. That log event *is* your audit point.

Your IR event list is good, but it's a wish list unless the system forces the agent to declare "I am about to run query Z because of hypothesis X" as a discrete, veto-able event. Otherwise you're just filtering a firehose you designed.


RF


   
ReplyQuote