ClevAgent
← All posts
2026-04-01monitoringdockertutorialproduction

How to Monitor AI Agents in Docker Compose

Step-by-step guide to adding heartbeat monitoring, crash detection, and auto-restart to your AI agents running in Docker Compose.

You deploy your AI agent in Docker Compose. It runs fine for two days. Then, at 2 AM on a Saturday, it crashes. No alert. No restart. You find out Monday morning when a client asks why the data pipeline has been dead for 40 hours.

Docker's built-in health checks don't help here. They can tell you whether a container responds to curl localhost:8080/health, but your AI agent doesn't serve HTTP requests — it runs a loop, calls LLMs, and processes results. A traditional health check has nothing to ping.

This guide shows how to add heartbeat monitoring to an AI agent running in Docker Compose, so crashes are detected in under 60 seconds and the container restarts automatically.

The setup

We'll work with a simple example: a Python AI agent that runs in a loop, calls an LLM, and processes the response. It runs as a Docker Compose service alongside whatever else your stack needs (database, Redis, API server, etc.).

Here's the agent code before monitoring:

agent.py

import time import openai

client = openai.OpenAI()

def run_agent(): while True: response = client.chat.completions.create( model="gpt-4o", messages=[{"role": "user", "content": "Analyze latest market data"}], ) result = response.choices[0].message.content process_result(result) time.sleep(60)

def process_result(result): # Your business logic here print(f"Processed: {result[:80]}...")

if __name__ == "__main__": run_agent()

And the Compose file:

docker-compose.yml

services: agent: build: . env_file: .env restart: unless-stopped

This works until it doesn't. The restart: unless-stopped policy restarts on crashes — but only Docker-level crashes. If the Python process hangs (deadlocked socket, stuck LLM call, zombie thread), Docker thinks the container is healthy because the PID is still running.


Try it now — zero-code monitoring in 1 command:

pip install clevagent-runner
clevagent-runner start --api-key cv_xxx --watch "python my_agent.py"

No code changes. Free for 3 agents. Get your API key →


Adding heartbeat monitoring

Two changes. First, install the SDK:

Dockerfile

FROM python:3.12-slim WORKDIR /app COPY requirements.txt . RUN pip install --no-cache-dir -r requirements.txt COPY . . CMD ["python", "agent.py"]

requirements.txt

openai>=1.0 clevagent>=0.3

Second, add two lines to your agent code:

agent.py

import os import time import openai import clevagent

Initialize monitoring — starts a background heartbeat every 60s

clevagent.init( api_key=os.environ["CLEVAGENT_API_KEY"], agent="market-analyzer", )

client = openai.OpenAI()

def run_agent(): while True: response = client.chat.completions.create( model="gpt-4o", messages=[{"role": "user", "content": "Analyze latest market data"}], ) result = response.choices[0].message.content process_result(result) time.sleep(60)

def process_result(result): print(f"Processed: {result[:80]}...")

if __name__ == "__main__": run_agent()

That's it for the code. clevagent.init() spawns a lightweight background thread that sends a heartbeat ping every 60 seconds. If the process crashes, the pings stop, and ClevAgent detects the silence within 47 seconds on average (based on the default 120-second timeout window).

The complete Docker Compose setup

Here's a production-ready docker-compose.yml with monitoring wired in:

docker-compose.yml

services: agent: build: . env_file: .env restart: unless-stopped environment: - CLEVAGENT_API_KEY=${CLEVAGENT_API_KEY} - OPENAI_API_KEY=${OPENAI_API_KEY} deploy: resources: limits: memory: 2G logging: driver: json-file options: max-size: "10m" max-file: "3"

# If you run multiple agents, each gets its own service scraper-agent: build: ./scraper env_file: .env restart: unless-stopped environment: - CLEVAGENT_API_KEY=${CLEVAGENT_API_KEY}

And your .env:

CLEVAGENT_API_KEY=cv_your_api_key_here
OPENAI_API_KEY=sk-your_openai_key_here

Each service that calls clevagent.init() with a unique agent name shows up as a separate agent on the dashboard. One API key, multiple agents.

What happens when it crashes

Let's walk through the timeline of a real crash:

02:00:00  Agent heartbeat ✓
02:01:00  Agent heartbeat ✓
02:01:34  OOM killer sends SIGKILL — process dies instantly
02:01:34  No more heartbeats
02:02:00  Expected heartbeat... missing
02:02:21  ClevAgent marks agent as "unresponsive" (47s after last heartbeat)
02:02:21  Alert sent → Telegram, Slack, or webhook
02:02:21  Auto-restart triggered → Docker container restarted
02:02:24  Container back up, agent.py starts
02:02:25  First heartbeat from new process ✓
02:02:25  ClevAgent marks agent as "healthy"

Total downtime: 51 seconds. Compare that to finding out Monday morning.

The detection works the same way regardless of *how* the agent dies:

Failure modeDocker sees it?ClevAgent detects?Time to detect |--------------|-----------------|---------------------|----------------| OOM kill (SIGKILL)Yes (exit code 137)Yes~47s Unhandled exceptionYes (exit code 1)Yes~47s Hung API call / deadlockNo (PID alive)Yes~47s Infinite loop (no crash)NoYes (via loop detection)~2 min Network partitionNoYes~47s

Docker's restart: unless-stopped handles the first two cases on its own. ClevAgent covers the last three — the ones where Docker thinks everything is fine.

Configuring auto-restart

By default, ClevAgent sends alerts but doesn't restart anything. To enable auto-restart for Docker Compose agents, go to the ClevAgent dashboard:

  • Select your agent
  • Under Recovery, set "On crash" to Restart container
  • Set the restart method to Docker and provide the container name (or let ClevAgent auto-detect it from the heartbeat metadata)
  • You can also set a maximum restart count (e.g., 3 restarts within 10 minutes) to prevent restart loops if the crash is caused by a persistent issue like a bad config or expired API key.

    For agents that run scheduled work (batch jobs, daily reports), you can set the recovery action to Alert only — so you get notified but the container isn't restarted until the next scheduled run.

    Healthier containers

    One useful pattern is combining Docker's native health check with the ClevAgent heartbeat. Docker handles container-level restarts, ClevAgent handles application-level detection:

    services:
      agent:
        build: .
        env_file: .env
        restart: unless-stopped
        environment:
          - CLEVAGENT_API_KEY=${CLEVAGENT_API_KEY}
        healthcheck:
          test: ["CMD", "python", "-c", "import urllib.request; urllib.request.urlopen('http://localhost:9191/health')"]
          interval: 30s
          timeout: 10s
          retries: 3
          start_period: 10s
    

    The ClevAgent SDK exposes a local health endpoint on port 9191 by default. Docker's health check queries it, and if three consecutive checks fail, Docker restarts the container — even for zombie processes where the main loop is stuck but the PID is alive.

    This gives you two layers:

  • Docker health check → restarts the container locally (fast, no network dependency)
  • ClevAgent heartbeat → alerts you remotely, tracks history, detects patterns across agents
  • Going further

    Once heartbeats are flowing, you get more than crash detection:

  • Token cost tracking: Pass tokens to clevagent.heartbeat() in your loop to track spend per agent per day
  • Multi-agent overview: See all your agents across all Docker hosts in one dashboard
  • Uptime history: SLA-style uptime tracking without building your own status page
  • The core pattern is simple: your agent proves it's alive every 60 seconds. Everything else — alerts, restarts, dashboards — is built on top of that signal.


    Free for 3 agents — start monitoring →

    Related reading:

  • How to Monitor AI Agents in Production
  • Three AI Agent Failure Modes That Traditional Monitoring Will Never Catch
  • How to Add Runtime Monitoring to Your FastAPI AI Agent
  • Get your API key in 30s →

    3 agents free · No credit card · Setup in 30 seconds