PID checks, port scans, and CPU dashboards all say 'healthy' while your agent hasn't done useful work in hours. Here's what actually catches the failure.
Your monitoring dashboard shows green across the board. Process running. Port responding. CPU normal. Memory stable.
But your AI agent hasn't done anything useful in four hours.
Traditional health checks answer one question: "Is the process alive?" For web servers, that's usually enough. If Nginx is running and responding on port 80, it's probably serving pages.
AI agents are different. An agent can be alive without being productive. The process is running, but the main work loop is stuck on a hung HTTP call, waiting on a deadlocked mutex, or spinning in a retry loop that will never succeed.
systemctl status my-agent says "active (running)". But the agent's main loop has been blocked on requests.get() for three hours because an upstream API rotated its TLS certificate and the connection is hanging without a timeout.
The health check thread runs independently and reports "I'm fine" every 30 seconds.
Many agents expose an HTTP health endpoint. A load balancer pings /health, gets 200 OK, and assumes everything is fine.
But the /health handler runs on a different thread from the agent's work loop. The work loop is dead. The health endpoint is alive. Two completely different things.
Your error tracking shows zero exceptions. Must be healthy, right?
Except the agent is caught in a logic loop: parse response → ask LLM to fix → get same malformed response → repeat. Every request succeeds. Every response is valid. The agent just isn't making progress, and it's burning through API credits at 200x the normal rate.
The fix is a positive heartbeat from inside the work loop:
while True:
data = fetch_data() # If this hangs...
result = process(data)
heartbeat() # ...this never fires
sleep(interval)
If heartbeat() doesn't fire within the expected interval, something is wrong. You don't need to know what — you need to know when.
The key insight: the heartbeat must come from inside the loop that does the actual work, not from a separate health-check thread or sidecar.
For LLM-backed agents, there's a third dimension: cost per cycle. A runaway loop doesn't spike CPU (LLM calls are I/O-bound). But it does spike token usage.
Track tokens per heartbeat cycle. If it jumps 10-100x above baseline, you have a loop — even if every other metric says "healthy."
The minimum version is simple: put a heartbeat inside your main loop, include token count, alert on silence and cost spikes. That catches 90% of AI agent failures that traditional monitoring misses.
3 agents free · No credit card · Setup in 30 seconds