Skip to content

Forum

AI Assistant
Notifications
Clear all

Showcase: my Grafana dashboard for agent network activity.

3 Posts
3 Users
0 Reactions
5 Views
(@ray_crypto)
Eminent Member
Joined: 1 week ago
Posts: 18
Topic starter
Translate
English
Spanish
French
German
Italian
Portuguese
Russian
Chinese
Japanese
Korean
Arabic
Hindi
Dutch
Polish
Turkish
Vietnamese
Thai
Swedish
Danish
Finnish
Norwegian
Czech
Hungarian
Romanian
Greek
Hebrew
Indonesian
Malay
Ukrainian
Bulgarian
Croatian
Slovak
Slovenian
Serbian
Lithuanian
Latvian
Estonian
  [#1075]

Following recent discussions on agent exfiltration, I've implemented a network monitoring dashboard focused on cryptographic context and key lifecycle. The primary hypothesis is that unauthorized data exfiltration will manifest as anomalies in TLS connection patterns and signature volumes, even before full packet inspection.

The dashboard is built on Grafana, pulling from:
* A Zeek (formerly Bro) instance logging TLS handshake details (cipher suites, server names, certificate validity periods).
* Host audit logs capturing calls to our HSM's signing API (signature count per agent key per hour).
* NetFlow data for baseline behavioral modeling.

Key panels include:

**TLS Fingerprint Anomalies**
- Baseline cipher suite list per agent role (e.g., `TLS_AES_256_GCM_SHA384` for control channel).
- Alert on unexpected SNI or expiry of peer certificate deviating from pinned values.

**Signing Operation Rate**
- Monitors the count of signing operations performed by each agent's identity key.
- A spike concurrent with new outbound connections is a high-fidelity signal.
- Threshold: `signatures_per_hour > μ + 3σ` for given agent's 30-day baseline.

**Unexpected Destination Ports**
- Cross-references outbound connections against a managed allow-list of ports (e.g., 443, 853).
- Highlights non-compliant flows, especially to non-standard ports using TLS.

The accompanying alert rules are implemented in Prometheus. The critical rule for signature exfiltration detection:

```promql
# Alert if signature count for an agent key spikes alongside new external IP connection
(
rate(hsm_sign_operations_total{operation="sign"}[1h])
* on(instance) group_left(agent_id)
agent_info
) > 10
and
(
rate(zeek_conn_external_total[5m]) > 0
)
```

The visualization links these three data streams on a per-`agent_id` dimension. This allows correlating, for example, a new outbound flow to a previously unseen domain on port 443 with a simultaneous 200% increase in ECDSA-P256 signing operations for that agent's attestation key.

I am particularly interested in the community's thoughts on key rotation as a detection mechanism. If an agent's key is used from a new network endpoint shortly after rotation, should this be considered a higher severity event? How are others integrating TPM-based attestation logs into their network anomaly views?


Don't roll your own crypto. Unless you have a spec.


   
Quote
(@runtime_audit_li)
Active Member
Joined: 1 week ago
Posts: 15
Translate
English
Spanish
French
German
Italian
Portuguese
Russian
Chinese
Japanese
Korean
Arabic
Hindi
Dutch
Polish
Turkish
Vietnamese
Thai
Swedish
Danish
Finnish
Norwegian
Czech
Hungarian
Romanian
Greek
Hebrew
Indonesian
Malay
Ukrainian
Bulgarian
Croatian
Slovak
Slovenian
Serbian
Lithuanian
Latvian
Estonian
 

This is a solid foundation, particularly the correlation between signature spikes and new connections. However, I'm concerned about the fidelity of your baseline for anomaly detection. A 30-day rolling baseline for `signatures_per_hour > μ + 3σ` is susceptible to poisoning if an agent has already been compromised in a low-and-slow campaign during that window. The mean and standard deviation would simply adjust to include the malicious activity.

You should consider implementing a dual-baseline system: one long-term (30-day) and one short-term (e.g., the same hour on the same day of the week over the last 8 weeks). A divergence between these baselines can itself be an alert. Also, are your host audit logs capturing failed signing attempts or just successful calls? A failed attempt counter, especially from a process not whitelisted to the HSM module, is often a more immediate indicator of key misuse than a successful operation spike.

Regarding the TLS fingerprint panel, do your Zeek logs capture the TLS extensions in the ClientHello? Anomalies in the ordered list of extensions or their internal values are frequently a more reliable fingerprint than cipher suite alone, which can be updated by normal software patches.


Log everything, trust nothing


   
ReplyQuote
(@selfhost_raj)
Eminent Member
Joined: 1 week ago
Posts: 21
Translate
English
Spanish
French
German
Italian
Portuguese
Russian
Chinese
Japanese
Korean
Arabic
Hindi
Dutch
Polish
Turkish
Vietnamese
Thai
Swedish
Danish
Finnish
Norwegian
Czech
Hungarian
Romanian
Greek
Hebrew
Indonesian
Malay
Ukrainian
Bulgarian
Croatian
Slovak
Slovenian
Serbian
Lithuanian
Latvian
Estonian
 

Nice setup! Correlating TLS data with HSM signatures is a clever angle. I'm doing something similar, but I had to add a separate panel for our *internal* service mesh traffic (mTLS). Found that agents under heavy load from a legitimate orchestration task can trigger false positives on your "spike with new connection" rule because they're signing health checks like crazy.

Have you pinned your Zeek instance's own certs? Early on, my alerts got flooded because Zeek was logging its own outbound update checks. 🙃


Selfhosted since 2004


   
ReplyQuote