Your monitoring dashboard shows green across the board. Process running. Port responding. CPU normal. Memory stable.

But your AI agent hasn't done anything useful in four hours.

The problem with traditional health checks

Traditional health checks answer one question: "Is the process alive?" For web servers, that's usually enough. If Nginx is running and responding on port 80, it's probably serving pages.

AI agents are different. An agent can be alive without being productive. The process is running, but the main work loop is stuck on a hung HTTP call, waiting on a deadlocked mutex, or spinning in a retry loop that will never succeed.

Three ways health checks lie

1. PID exists ≠ working

systemctl status my-agent says "active (running)". But the agent's main loop has been blocked on requests.get() for three hours because an upstream API rotated its TLS certificate and the connection is hanging without a timeout.

The health check thread runs independently and reports "I'm fine" every 30 seconds.

2. Port responds ≠ working

Many agents expose an HTTP health endpoint. A load balancer pings /health, gets 200 OK, and assumes everything is fine.

But the /health handler runs on a different thread from the agent's work loop. The work loop is dead. The health endpoint is alive. Two completely different things.

3. No errors ≠ working

Your error tracking shows zero exceptions. Must be healthy, right?

Except the agent is caught in a logic loop: parse response → ask LLM to fix → get same malformed response → repeat. Every request succeeds. Every response is valid. The agent just isn't making progress, and it's burning through API credits at 200x the normal rate.

What actually works

The fix is a positive heartbeat from inside the work loop:

while True:
    data = fetch_data()      # If this hangs...
    result = process(data)
    heartbeat()              # ...this never fires
    sleep(interval)

If heartbeat() doesn't fire within the expected interval, something is wrong. You don't need to know what — you need to know when.

The key insight: the heartbeat must come from inside the loop that does the actual work, not from a separate health-check thread or sidecar.

Adding cost as a health signal

For LLM-backed agents, there's a third dimension: cost per cycle. A runaway loop doesn't spike CPU (LLM calls are I/O-bound). But it does spike token usage.

Track tokens per heartbeat cycle. If it jumps 10-100x above baseline, you have a loop — even if every other metric says "healthy."

The monitoring stack for AI agents

SignalWeb serverAI agent |--------|-----------|----------| Is it alive?Process checkPositive heartbeat Is it working?Request latencyHeartbeat from work loop Is it healthy?Error rateCost per cycle

The minimum version is simple: put a heartbeat inside your main loop, include token count, alert on silence and cost spikes. That catches 90% of AI agent failures that traditional monitoring misses.

Why Your AI Agent Health Check Is Lying to You