\n\n\n\n Monitoring Agents with Grafana: My Tried-and-True Approach - ClawGo \n

Monitoring Agents with Grafana: My Tried-and-True Approach

📖 5 min read874 wordsUpdated Mar 16, 2026

The first time one of my agents silently stopped working, I didn’t notice for three days. Three days of missed scheduled reports. Three days of unanswered automated messages. Three days of a monitoring job that wasn’t monitoring anything.

My client noticed before I did. That was embarrassing.

So I set up Grafana to watch my agents the way my agents watch everything else. Now I know within minutes when something goes wrong, and I usually know why before I even open a terminal.

What to Monitor (And What Not To)

When I first set up monitoring, I tracked everything. Response times per request, token counts per message, model confidence scores, memory usage by session, error rates by type, latency histograms… 47 metrics on 12 panels.

I looked at that dashboard for a week. Then I realized I was only ever looking at 4 things:

Is it running? Simple up/down check. Green dot = process is alive and responding. This catches crashes, hangs, and infrastructure failures.

Is it slow? Average response time over the last 5 minutes. Normally 2-3 seconds. When it creeps past 8 seconds, something’s wrong — usually context bloat or API provider issues.

Is it failing? Error rate as a percentage of total requests. Below 2% is normal (occasional API timeouts). Above 5% means systematic problems.

Is it expensive? Running cost for the current day compared to the daily average. A 2x spike means something is generating unexpectedly long or frequent requests.

I stripped my dashboard down to these four metrics. One row, four big numbers with color coding. That’s what I look at 10 times a day. Everything else is on a “details” page I visit only when debugging.

The Setup

Data collection: I wrote a small script that parses OpenClaw logs and exposes metrics in Prometheus format. It runs as a sidecar process and scrapes the log file every 30 seconds. About 50 lines of code. Nothing fancy.

The metrics it exposes:
openclaw_requests_total (counter, labeled by type)
openclaw_response_seconds (histogram)
openclaw_errors_total (counter, labeled by error type)
openclaw_tokens_used (counter, labeled by direction: input/output)
openclaw_process_up (gauge, 1 or 0)

Prometheus scrapes these metrics every 15 seconds. Default retention is 15 days, which is enough for my needs. Prometheus runs on the same server as OpenClaw — it uses about 100MB of RAM for this small workload.

Grafana visualizes the metrics. I use Grafana Cloud’s free tier (10,000 metrics, which is plenty). You can also self-host Grafana on the same server — it’s lightweight.

Total setup time: about 2 hours the first time. Most of that was writing the log parser.

The Alerts That Work

I have four alerts. I’ve tuned their thresholds over three months to minimize false positives:

Process down for > 2 minutes. Fires if the up/down check fails for two consecutive minutes. Two minutes gives enough buffer for restarts and brief network blips. This sends a push notification to my phone.

Response time p95 > 15 seconds for 5 minutes. A single slow response doesn’t matter. Five minutes of consistently slow responses means something is systematically wrong. This posts to my Slack alerts channel.

Error rate > 10% for 3 minutes. I set this higher than you might expect (10% instead of 5%) because brief API timeout bursts are normal during provider maintenance. Three minutes of sustained high errors means it’s not a blip. Phone notification.

Daily cost > 3x rolling average. Checked hourly. Catches runaway loops and unexpected usage patterns before they get expensive. Slack alert only — this is informational, not urgent.

I removed two alerts that were too noisy: “any single request > 30 seconds” (happened too often during complex agent tasks) and “memory usage > 80%” (irrelevant — Node.js manages its own garbage collection and brief spikes are normal).

The Dashboard That Caught Real Problems

February: Gradual context bloat. Response times crept from 2.5s to 7s over two weeks. The trend line was obvious on the dashboard — individual requests looked fine, but the daily average was climbing. Root cause: conversation contexts were growing because compaction wasn’t triggering correctly. A config fix brought response times back to normal within an hour.

March: Cost spike from a loop. A cron job had a retry mechanism that, due to a bug, kept retrying indefinitely when the API returned a specific error code. The daily cost alert fired at 2x the average. I caught it within 2 hours. Without the alert, it would have run until the API key hit its spending limit.

March: Silent cron failure. My daily report job stopped running. No error — it just didn’t execute. The dashboard showed the expected daily spike in activity at 8 AM was missing. The cron scheduler had crashed after an update. Restarting it fixed the issue, and I added the process-down alert for the scheduler specifically.

What I’d Tell Past Me

Start with the four basic metrics. Add complexity only when you have a specific debugging need. Most monitoring dashboards fail because they’re too complex — you build 20 panels, get overwhelmed by data, and stop looking at the dashboard entirely.

The dashboard you actually use beats the dashboard that tracks everything. Make it glanceable, make the alerts actionable, and review the thresholds monthly. That’s the whole strategy.

🕒 Last updated:  ·  Originally published: January 8, 2026

🤖
Written by Jake Chen

AI automation specialist with 5+ years building AI agents. Previously at a Y Combinator startup. Runs OpenClaw deployments for 200+ users.

Learn more →

Leave a Comment

Your email address will not be published. Required fields are marked *

Browse Topics: Advanced Topics | AI Agent Tools | AI Agents | Automation | Comparisons
Scroll to Top