Silent crashes, zombie processes, and runaway token loops — three production failures that process checks, log watchers, and CPU dashboards completely miss.
I run several AI agents in production — trading bots, data scrapers, monitoring agents. They run 24/7, unattended. Over the past few months, I've hit three failure modes that my existing monitoring (process checks, log watchers, CPU/memory alerts) completely missed.
These aren't exotic edge cases. If you're running any long-lived AI agent, you'll probably hit all three eventually.
One of my agents exited cleanly at 3 AM. No traceback. No error log. No crash dump. The Python process simply stopped. My log monitoring saw nothing because there was nothing to log.
I found out six hours later when I noticed the bot hadn't posted since 3 AM.
The OS killed the process for memory. The agent was slowly leaking — a library was caching LLM responses in memory without any eviction policy. RSS grew from 200MB to 4GB over a few days. The OOM killer sent SIGKILL, which leaves no Python traceback.
Positive heartbeat. Instead of monitoring for bad signals (errors, crashes), monitor for the *absence* of a good signal. The agent must actively report "I'm alive" every N seconds. If the heartbeat stops for any reason — clean exit, OOM, segfault, kernel panic — you know immediately.
Inside your agent's main loop
while True:
result = do_work()
heartbeat() # This is the line that matters
sleep(interval)
If heartbeat() doesn't fire, something is wrong. You don't need to know *what* — you need to know *when*.
This one is more insidious. The process was running. CPU usage normal. Memory stable. Every health check said "healthy."
But the agent hadn't done useful work in four hours.
The agent was stuck on an HTTP request. An upstream API had rotated its TLS certificate, and the request was hanging — the socket was open, the connection was established, but the TLS handshake was deadlocked. No timeout was set on the request (a classic oversight).
From the outside, the process was "running." From the inside, the main loop was blocked on line 47 of api_client.py, and it would stay blocked forever.
The health check thread was fine. The *work* thread was dead.
Application-level heartbeat. The heartbeat must come from *inside the work loop*, not from a separate health-check thread or sidecar process.
Bad — heartbeat from a separate thread
threading.Thread(target=lambda: while True: heartbeat(); sleep(30)).start()Good — heartbeat from the actual work loop
while True:
data = fetch_from_api() # If this hangs...
process(data)
heartbeat() # ...this never fires
sleep(interval)
The difference is critical. If your heartbeat runs independently from your work loop, it's measuring "is the process alive?" not "is the agent working?" These are two very different questions.
This is the scariest failure mode because the agent looks *great*. It's running. It's doing work. It's calling the LLM API, getting responses, processing them, and calling again. Every metric says "healthy."
Except your bill is exploding.
The agent received a malformed response from an API. It asked the LLM to parse it. The LLM returned a structured output that triggered the same code path again. The agent asked the LLM to re-parse. Same result. Repeat.
Token usage went from 200/min (normal) to 40,000/min. In 40 minutes, it burned through about $50 of API credits. Not catastrophic for a single incident, but imagine this happening overnight with a larger model.
Cost as a health metric. Track token usage (or API cost) per heartbeat cycle. If it spikes 10-100x above baseline, flag it.
while True:
start_tokens = get_token_count()
result = do_llm_work()
end_tokens = get_token_count() heartbeat(
tokens_used=end_tokens - start_tokens,
cost_estimate=calculate_cost(end_tokens - start_tokens)
)
sleep(interval)
This is the one metric that's unique to LLM-backed agents. Traditional services don't have a per-request cost that can spike 200x. AI agents do.
After dealing with all three failures, I realized the monitoring requirements for AI agents are fundamentally different from web services:
The minimum viable version of this is surprisingly simple:
That alone would have caught all three of my failures within 60 seconds instead of hours.
After reimplementing this pattern across multiple agents, I packaged it into ClevAgent — an open monitoring service for AI agents. Two lines of code to add heartbeat + cost tracking:
import clevagent
clevagent.init(api_key=os.environ["CLEVAGENT_API_KEY"], agent="my-bot")while True:
result = do_work()
clevagent.heartbeat(tokens=result.tokens_used)
It handles the alerting, auto-restart, loop detection, and daily reports. Free for up to 3 agents.
But honestly, the pattern matters more than the tool. Even if you roll your own with a simple webhook + PagerDuty, the three signals — heartbeat, application-level liveness, and cost tracking — will save you from 90% of production AI agent failures.
3 agents free · No credit card · Setup in 30 seconds