Step-by-step guide to adding heartbeat monitoring, crash detection, and auto-restart to your AI agents running in Docker Compose.
You deploy your AI agent in Docker Compose. It runs fine for two days. Then, at 2 AM on a Saturday, it crashes. No alert. No restart. You find out Monday morning when a client asks why the data pipeline has been dead for 40 hours.
Docker's built-in health checks don't help here. They can tell you whether a container responds to curl localhost:8080/health, but your AI agent doesn't serve HTTP requests — it runs a loop, calls LLMs, and processes results. A traditional health check has nothing to ping.
This guide shows how to add heartbeat monitoring to an AI agent running in Docker Compose, so crashes are detected in under 60 seconds and the container restarts automatically.
We'll work with a simple example: a Python AI agent that runs in a loop, calls an LLM, and processes the response. It runs as a Docker Compose service alongside whatever else your stack needs (database, Redis, API server, etc.).
Here's the agent code before monitoring:
agent.py
import time
import openaiclient = openai.OpenAI()
def run_agent():
while True:
response = client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": "Analyze latest market data"}],
)
result = response.choices[0].message.content
process_result(result)
time.sleep(60)
def process_result(result):
# Your business logic here
print(f"Processed: {result[:80]}...")
if __name__ == "__main__":
run_agent()
And the Compose file:
docker-compose.yml
services:
agent:
build: .
env_file: .env
restart: unless-stopped
This works until it doesn't. The restart: unless-stopped policy restarts on crashes — but only Docker-level crashes. If the Python process hangs (deadlocked socket, stuck LLM call, zombie thread), Docker thinks the container is healthy because the PID is still running.
Try it now — zero-code monitoring in 1 command:
pip install clevagent-runner
clevagent-runner start --api-key cv_xxx --watch "python my_agent.py"
No code changes. Free for 3 agents. Get your API key →
Two changes. First, install the SDK:
Dockerfile
FROM python:3.12-slim
WORKDIR /app
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
COPY . .
CMD ["python", "agent.py"]
requirements.txt
openai>=1.0
clevagent>=0.3
Second, add two lines to your agent code:
agent.py
import os
import time
import openai
import clevagentInitialize monitoring — starts a background heartbeat every 60s
clevagent.init(
api_key=os.environ["CLEVAGENT_API_KEY"],
agent="market-analyzer",
)client = openai.OpenAI()
def run_agent():
while True:
response = client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": "Analyze latest market data"}],
)
result = response.choices[0].message.content
process_result(result)
time.sleep(60)
def process_result(result):
print(f"Processed: {result[:80]}...")
if __name__ == "__main__":
run_agent()
That's it for the code. clevagent.init() spawns a lightweight background thread that sends a heartbeat ping every 60 seconds. If the process crashes, the pings stop, and ClevAgent detects the silence within 47 seconds on average (based on the default 120-second timeout window).
Here's a production-ready docker-compose.yml with monitoring wired in:
docker-compose.yml
services:
agent:
build: .
env_file: .env
restart: unless-stopped
environment:
- CLEVAGENT_API_KEY=${CLEVAGENT_API_KEY}
- OPENAI_API_KEY=${OPENAI_API_KEY}
deploy:
resources:
limits:
memory: 2G
logging:
driver: json-file
options:
max-size: "10m"
max-file: "3" # If you run multiple agents, each gets its own service
scraper-agent:
build: ./scraper
env_file: .env
restart: unless-stopped
environment:
- CLEVAGENT_API_KEY=${CLEVAGENT_API_KEY}
And your .env:
CLEVAGENT_API_KEY=cv_your_api_key_here
OPENAI_API_KEY=sk-your_openai_key_here
Each service that calls clevagent.init() with a unique agent name shows up as a separate agent on the dashboard. One API key, multiple agents.
Let's walk through the timeline of a real crash:
02:00:00 Agent heartbeat ✓
02:01:00 Agent heartbeat ✓
02:01:34 OOM killer sends SIGKILL — process dies instantly
02:01:34 No more heartbeats
02:02:00 Expected heartbeat... missing
02:02:21 ClevAgent marks agent as "unresponsive" (47s after last heartbeat)
02:02:21 Alert sent → Telegram, Slack, or webhook
02:02:21 Auto-restart triggered → Docker container restarted
02:02:24 Container back up, agent.py starts
02:02:25 First heartbeat from new process ✓
02:02:25 ClevAgent marks agent as "healthy"
Total downtime: 51 seconds. Compare that to finding out Monday morning.
The detection works the same way regardless of *how* the agent dies:
Docker's restart: unless-stopped handles the first two cases on its own. ClevAgent covers the last three — the ones where Docker thinks everything is fine.
By default, ClevAgent sends alerts but doesn't restart anything. To enable auto-restart for Docker Compose agents, go to the ClevAgent dashboard:
You can also set a maximum restart count (e.g., 3 restarts within 10 minutes) to prevent restart loops if the crash is caused by a persistent issue like a bad config or expired API key.
For agents that run scheduled work (batch jobs, daily reports), you can set the recovery action to Alert only — so you get notified but the container isn't restarted until the next scheduled run.
One useful pattern is combining Docker's native health check with the ClevAgent heartbeat. Docker handles container-level restarts, ClevAgent handles application-level detection:
services:
agent:
build: .
env_file: .env
restart: unless-stopped
environment:
- CLEVAGENT_API_KEY=${CLEVAGENT_API_KEY}
healthcheck:
test: ["CMD", "python", "-c", "import urllib.request; urllib.request.urlopen('http://localhost:9191/health')"]
interval: 30s
timeout: 10s
retries: 3
start_period: 10s
The ClevAgent SDK exposes a local health endpoint on port 9191 by default. Docker's health check queries it, and if three consecutive checks fail, Docker restarts the container — even for zombie processes where the main loop is stuck but the PID is alive.
This gives you two layers:
Once heartbeats are flowing, you get more than crash detection:
tokens to clevagent.heartbeat() in your loop to track spend per agent per dayThe core pattern is simple: your agent proves it's alive every 60 seconds. Everything else — alerts, restarts, dashboards — is built on top of that signal.
Free for 3 agents — start monitoring →
Related reading:
3 agents free · No credit card · Setup in 30 seconds