ClevAgent
← All posts
2026-03-29monitoringproductionfailure-modesdebugging

Three AI Agent Failure Modes That Traditional Monitoring Will Never Catch

Silent crashes, zombie processes, and runaway token loops — three production failures that process checks, log watchers, and CPU dashboards completely miss.

I run several AI agents in production — trading bots, data scrapers, monitoring agents. They run 24/7, unattended. Over the past few months, I've hit three failure modes that my existing monitoring (process checks, log watchers, CPU/memory alerts) completely missed.

These aren't exotic edge cases. If you're running any long-lived AI agent, you'll probably hit all three eventually.

Failure #1: The Silent Exit

One of my agents exited cleanly at 3 AM. No traceback. No error log. No crash dump. The Python process simply stopped. My log monitoring saw nothing because there was nothing to log.

I found out six hours later when I noticed the bot hadn't posted since 3 AM.

What happened

The OS killed the process for memory. The agent was slowly leaking — a library was caching LLM responses in memory without any eviction policy. RSS grew from 200MB to 4GB over a few days. The OOM killer sent SIGKILL, which leaves no Python traceback.

Why traditional monitoring missed it

  • Process monitoring (systemd, supervisor): Saw the exit code, but by the time you check alerts, the damage is done
  • Log monitoring (Datadog, CloudWatch): Nothing to see — OOM kill happens below the application layer
  • CPU/memory dashboards: Would have caught it *if* someone was watching. Nobody watches dashboards at 3 AM.
  • The pattern that catches this

    Positive heartbeat. Instead of monitoring for bad signals (errors, crashes), monitor for the *absence* of a good signal. The agent must actively report "I'm alive" every N seconds. If the heartbeat stops for any reason — clean exit, OOM, segfault, kernel panic — you know immediately.

    Inside your agent's main loop

    while True: result = do_work() heartbeat() # This is the line that matters sleep(interval)

    If heartbeat() doesn't fire, something is wrong. You don't need to know *what* — you need to know *when*.

    Failure #2: The Zombie Agent

    This one is more insidious. The process was running. CPU usage normal. Memory stable. Every health check said "healthy."

    But the agent hadn't done useful work in four hours.

    What happened

    The agent was stuck on an HTTP request. An upstream API had rotated its TLS certificate, and the request was hanging — the socket was open, the connection was established, but the TLS handshake was deadlocked. No timeout was set on the request (a classic oversight).

    From the outside, the process was "running." From the inside, the main loop was blocked on line 47 of api_client.py, and it would stay blocked forever.

    Why traditional monitoring missed it

  • PID checks: Process exists ✓
  • Port checks: Agent's HTTP server responds ✓ (the health endpoint runs on a separate thread)
  • CPU/memory: Normal ✓
  • The health check thread was fine. The *work* thread was dead.

    The pattern that catches this

    Application-level heartbeat. The heartbeat must come from *inside the work loop*, not from a separate health-check thread or sidecar process.

    Bad — heartbeat from a separate thread

    threading.Thread(target=lambda: while True: heartbeat(); sleep(30)).start()

    Good — heartbeat from the actual work loop

    while True: data = fetch_from_api() # If this hangs... process(data) heartbeat() # ...this never fires sleep(interval)

    The difference is critical. If your heartbeat runs independently from your work loop, it's measuring "is the process alive?" not "is the agent working?" These are two very different questions.

    Failure #3: The Runaway Loop

    This is the scariest failure mode because the agent looks *great*. It's running. It's doing work. It's calling the LLM API, getting responses, processing them, and calling again. Every metric says "healthy."

    Except your bill is exploding.

    What happened

    The agent received a malformed response from an API. It asked the LLM to parse it. The LLM returned a structured output that triggered the same code path again. The agent asked the LLM to re-parse. Same result. Repeat.

    Token usage went from 200/min (normal) to 40,000/min. In 40 minutes, it burned through about $50 of API credits. Not catastrophic for a single incident, but imagine this happening overnight with a larger model.

    Why traditional monitoring missed it

  • Process health: Running ✓
  • Heartbeat: Firing normally ✓ (the loop is *running*, just wastefully)
  • Error rate: Zero ✓ (no errors — the LLM is responding successfully every time)
  • CPU/memory: Normal ✓ (LLM calls are I/O-bound, not compute-bound)
  • The pattern that catches this

    Cost as a health metric. Track token usage (or API cost) per heartbeat cycle. If it spikes 10-100x above baseline, flag it.

    while True:
        start_tokens = get_token_count()
        result = do_llm_work()
        end_tokens = get_token_count()

    heartbeat( tokens_used=end_tokens - start_tokens, cost_estimate=calculate_cost(end_tokens - start_tokens) ) sleep(interval)

    This is the one metric that's unique to LLM-backed agents. Traditional services don't have a per-request cost that can spike 200x. AI agents do.

    The Monitoring Stack for AI Agents

    After dealing with all three failures, I realized the monitoring requirements for AI agents are fundamentally different from web services:

    What to monitorWeb serviceAI agent |-----------------|-------------|----------| Is it alive?Process checkPositive heartbeat (agent must prove it's alive) Is it working?Request latencyApplication-level heartbeat (from inside the work loop) Is it healthy?Error rateCost per cycle (token usage as health signal)

    The minimum viable version of this is surprisingly simple:

  • Put a heartbeat call inside your main loop (not in a health-check thread)
  • Include token/cost data in each heartbeat
  • Alert on silence (missed heartbeat) and on cost spikes
  • That alone would have caught all three of my failures within 60 seconds instead of hours.

    What I Built

    After reimplementing this pattern across multiple agents, I packaged it into ClevAgent — an open monitoring service for AI agents. Two lines of code to add heartbeat + cost tracking:

    import clevagent
    clevagent.init(api_key=os.environ["CLEVAGENT_API_KEY"], agent="my-bot")

    while True: result = do_work() clevagent.heartbeat(tokens=result.tokens_used)

    It handles the alerting, auto-restart, loop detection, and daily reports. Free for up to 3 agents.

    But honestly, the pattern matters more than the tool. Even if you roll your own with a simple webhook + PagerDuty, the three signals — heartbeat, application-level liveness, and cost tracking — will save you from 90% of production AI agent failures.

    Start monitoring free →

    3 agents free · No credit card · Setup in 30 seconds