ClevAgent
← All posts
2026-04-09monitoringsystemdubuntuproduction

How to Monitor and Auto-Restart AI Agents with systemd on Ubuntu

A practical guide to running AI agents under systemd with heartbeat monitoring, alerts, cost tracking, and safe auto-restart on Ubuntu.

At 2:13 AM, systemctl status research-agent can still say active (running) while your queue has been flat for 47 minutes.

That is the trap with long-running AI agents on a single Ubuntu VM. systemd will restart a dead process, but a lot of the expensive failures happen while the process is still alive.

The Python worker is running. The service never exited. But the agent is doing no useful work. It might be stuck waiting on an API that never returns. It might be looping on the same tool call. It might be burning tokens without making progress.

That is the gap between process supervision and runtime monitoring.

This guide covers a production pattern that works well for long-running agents on Ubuntu:

  • Use systemd to keep the process alive.
  • Use ClevAgent heartbeats to confirm the agent is actually making progress.
  • Track iteration count, tool calls, and token cost so a silent loop does not become a billing surprise.
  • Add Runner if you want ClevAgent-managed restart actions for a systemd service.
  • If you want the broader overview first, read How to Monitor AI Agents in Production. If your agents run in containers instead of VM services, How to Monitor AI Agents in Docker Compose is the closer match.

    Why systemd alone misses real agent failures

    systemd is excellent at detecting that a process exited and starting it again.

    It does not know whether your agent is healthy between process start and process exit.

    For AI agents, the costly failures usually live in that gap:

  • Hung work loop: the agent blocks on an LLM call or external API and never reaches the next task.
  • Runaway reasoning loop: the process stays alive but keeps repeating the same step.
  • Cost spike without crash: the agent keeps calling the model, so CPU and RAM look normal while your bill climbs.
  • Partial downstream outage: the agent runs, but one dependency is timing out and throughput drops to zero.
  • From systemd's perspective, none of those are failures. From an operator's perspective, they are exactly the failures that matter.

    The production pattern

    Use two layers.

  • systemd handles boot, restart policy, and logs.
  • ClevAgent handles heartbeat freshness, cost visibility, loop signals, and alerts through Telegram, Slack, Discord, or email.
  • That gives you a clean split:

  • If the process dies, systemd restarts it.
  • If the process stays alive but stops making progress, ClevAgent alerts you.
  • If the agent starts spinning, ClevAgent shows the abnormal iteration and cost pattern before you find out from billing.
  • Step 1: create a systemd unit for the agent

    A simple service unit is enough for most agents. Keep the runtime explicit and set a restart policy.

    /etc/systemd/system/research-agent.service

    [Unit] Description=Research Agent After=network-online.target Wants=network-online.target

    [Service] Type=simple User=ubuntu WorkingDirectory=/opt/research-agent EnvironmentFile=/etc/research-agent.env ExecStart=/opt/research-agent/.venv/bin/python main.py Restart=on-failure RestartSec=10 TimeoutStopSec=30

    [Install] WantedBy=multi-user.target

    Then enable it:

    sudo systemctl daemon-reload
    sudo systemctl enable --now research-agent.service
    sudo systemctl status research-agent.service
    

    This covers the blunt failure case: your process crashes and needs to come back.

    Step 2: instrument the work loop, not just process startup

    The most useful heartbeat is the one sent from the part of the code that proves the agent is still moving.

    That usually means after one iteration completes, after one tool call finishes, or after one scheduled job succeeds. Do not treat process start as proof of health.

    Here is a minimal Python example:

    import os
    import signal
    import time

    import clevagent

    running = True

    def handle_shutdown(signum, frame): del signum, frame global running running = False

    signal.signal(signal.SIGTERM, handle_shutdown) signal.signal(signal.SIGINT, handle_shutdown)

    def run_once() -> dict[str, object]: """Replace this stub with one unit of real agent work.""" time.sleep(2) return { "tokens": 834, "cost_usd": 0.0124, "model": "gpt-4.1-mini", "tool_name": "search_docs", }

    clevagent.init( api_key=os.environ["CLEVAGENT_API_KEY"], agent="research-agent-prod", interval=30, agent_type="openai", cost_limit_usd=25.0, on_loop="alert_only", )

    iteration = 0

    try: while running: iteration += 1 result = run_once()

    clevagent.log_iteration(iteration) clevagent.log_tool_call(result["tool_name"], "query=release notes") clevagent.log_cost( tokens=int(result["tokens"]), cost_usd=float(result["cost_usd"]), model=str(result["model"]), ) clevagent.ping(status="ok", message="iteration complete")

    time.sleep(15) finally: clevagent.shutdown()

    The key detail is placement. If you only initialize at startup and never emit progress from the work loop, you will miss the failures that make AI agents painful to operate.

    Step 3: add Runner when you want monitored restart control

    For many teams, systemd restart policy plus SDK heartbeats is enough.

    If you want restart actions wired through ClevAgent itself, add Runner on the same machine and watch the service directly. Runner currently supports Docker, systemd, and launchd restart targets.

    export CLEVAGENT_API_KEY=cv_your_project_key
    clevagent-runner start \
      --watch "systemd:research-agent.service"
    

    This is also the cleanest option when you cannot modify agent code and need heartbeat proxying from outside the process.

    A few operating rules that hold up in practice

    Separate crash recovery from progress detection

    Do not force one tool to do both jobs. systemd is your local supervisor. ClevAgent is your runtime watchdog. That separation makes incidents easier to reason about.

    Heartbeat on completed work, not on a timer alone

    A timer thread tells you the process exists. A loop-complete ping tells you the agent is still useful. For most operators, the second signal is the one that matters.

    Log cost and iteration count from the same code path

    A stuck agent rarely looks expensive in CPU graphs. It looks expensive in repeated iterations and model spend. That is why cost tracking belongs next to the work loop.

    What this looks like during a real incident

    Imagine the agent starts timing out on an external retrieval API at 2:13 AM.

  • The process does not exit, so systemd does nothing.
  • Iterations stop completing, so heartbeats go stale.
  • ClevAgent sends an alert through your chosen channel.
  • If you also wired Runner, restart actions can target the watched systemd service directly.
  • That is the difference between reading an incident in the morning and catching it while the blast radius is still small.

    Bottom line

    If your AI agent runs on Ubuntu, systemd should absolutely be part of the stack. Just do not mistake process supervision for runtime monitoring.

    Use systemd to keep the service alive. Use ClevAgent to confirm the agent is still doing real work, catch loop and cost anomalies, and notify you when a healthy-looking process has actually stopped being useful.

    Start with SDK heartbeats inside the work loop. Add Runner when you want monitored restart control for the systemd service itself.

    Related reading

  • How to Monitor AI Agents in Production
  • How to Monitor AI Agents in Docker Compose
  • Your AI Agent Health Check Is Probably Lying

  • *If systemd is already keeping the process alive, the next missing signal is usually whether useful work is still happening. ClevAgent adds heartbeat freshness, loop and cost visibility, alerts, and optional Runner-based restart control for Docker, systemd, and launchd. Free for 3 agents — start monitoring.*

    Get your API key in 30s →

    3 agents free · No credit card · Setup in 30 seconds