At 2:13 AM, systemctl status research-agent can still say active (running) while your queue has been flat for 47 minutes.

That is the trap with long-running AI agents on a single Ubuntu VM. systemd will restart a dead process, but a lot of the expensive failures happen while the process is still alive.

The Python worker is running. The service never exited. But the agent is doing no useful work. It might be stuck waiting on an API that never returns. It might be looping on the same tool call. It might be burning tokens without making progress.

That is the gap between process supervision and runtime monitoring.

This guide covers a production pattern that works well for long-running agents on Ubuntu:

Use systemd to keep the process alive.

Use ClevAgent heartbeats to confirm the agent is actually making progress.

Track iteration count, tool calls, and token cost so a silent loop does not become a billing surprise.

Add Runner if you want ClevAgent-managed restart actions for a systemd service.

If you want the broader overview first, read How to Monitor AI Agents in Production. If your agents run in containers instead of VM services, How to Monitor AI Agents in Docker Compose is the closer match.

Why systemd alone misses real agent failures

systemd is excellent at detecting that a process exited and starting it again.

It does not know whether your agent is healthy between process start and process exit.

For AI agents, the costly failures usually live in that gap:

•Hung work loop: the agent blocks on an LLM call or external API and never reaches the next task.

•Runaway reasoning loop: the process stays alive but keeps repeating the same step.

•Cost spike without crash: the agent keeps calling the model, so CPU and RAM look normal while your bill climbs.

•Partial downstream outage: the agent runs, but one dependency is timing out and throughput drops to zero.

From systemd's perspective, none of those are failures. From an operator's perspective, they are exactly the failures that matter.

The production pattern

Use two layers.

•systemd handles boot, restart policy, and logs.

•ClevAgent handles heartbeat freshness, cost visibility, loop signals, and alerts through Telegram, Slack, Discord, or email.

That gives you a clean split:

•If the process dies, systemd restarts it.

•If the process stays alive but stops making progress, ClevAgent alerts you.

•If the agent starts spinning, ClevAgent shows the abnormal iteration and cost pattern before you find out from billing.

Step 1: create a systemd unit for the agent

A simple service unit is enough for most agents. Keep the runtime explicit and set a restart policy.

/etc/systemd/system/research-agent.service
[Unit]
Description=Research Agent
After=network-online.target
Wants=network-online.target
[Service]
Type=simple
User=ubuntu
WorkingDirectory=/opt/research-agent
EnvironmentFile=/etc/research-agent.env
ExecStart=/opt/research-agent/.venv/bin/python main.py
Restart=on-failure
RestartSec=10
TimeoutStopSec=30[Install]
WantedBy=multi-user.target

Then enable it:

sudo systemctl daemon-reload
sudo systemctl enable --now research-agent.service
sudo systemctl status research-agent.service

This covers the blunt failure case: your process crashes and needs to come back.

Step 2: instrument the work loop, not just process startup

The most useful heartbeat is the one sent from the part of the code that proves the agent is still moving.

That usually means after one iteration completes, after one tool call finishes, or after one scheduled job succeeds. Do not treat process start as proof of health.

Here is a minimal Python example:

import os
import signal
import time
import clevagent
running = True

def handle_shutdown(signum, frame):
    del signum, frame
    global running
    running = False

signal.signal(signal.SIGTERM, handle_shutdown)
signal.signal(signal.SIGINT, handle_shutdown)

def run_once() -> dict[str, object]:
    """Replace this stub with one unit of real agent work."""
    time.sleep(2)
    return {
        "tokens": 834,
        "cost_usd": 0.0124,
        "model": "gpt-4.1-mini",
        "tool_name": "search_docs",
    }

clevagent.init(
    api_key=os.environ["CLEVAGENT_API_KEY"],
    agent="research-agent-prod",
    interval=30,
    agent_type="openai",
    cost_limit_usd=25.0,
    on_loop="alert_only",
)
iteration = 0
try:
    while running:
        iteration += 1
        result = run_once()
        clevagent.log_iteration(iteration)
        clevagent.log_tool_call(result["tool_name"], "query=release notes")
        clevagent.log_cost(
            tokens=int(result["tokens"]),
            cost_usd=float(result["cost_usd"]),
            model=str(result["model"]),
        )
        clevagent.ping(status="ok", message="iteration complete")        time.sleep(15)
finally:
    clevagent.shutdown()

The key detail is placement. If you only initialize at startup and never emit progress from the work loop, you will miss the failures that make AI agents painful to operate.

Step 3: add Runner when you want monitored restart control

For many teams, systemd restart policy plus SDK heartbeats is enough.

If you want restart actions wired through ClevAgent itself, add Runner on the same machine and watch the service directly. Runner currently supports Docker, systemd, and launchd restart targets.

export CLEVAGENT_API_KEY=cv_your_project_key
clevagent-runner start \
  --watch "systemd:research-agent.service"

This is also the cleanest option when you cannot modify agent code and need heartbeat proxying from outside the process.

A few operating rules that hold up in practice

Separate crash recovery from progress detection

Do not force one tool to do both jobs. systemd is your local supervisor. ClevAgent is your runtime watchdog. That separation makes incidents easier to reason about.

Heartbeat on completed work, not on a timer alone

A timer thread tells you the process exists. A loop-complete ping tells you the agent is still useful. For most operators, the second signal is the one that matters.

Log cost and iteration count from the same code path

A stuck agent rarely looks expensive in CPU graphs. It looks expensive in repeated iterations and model spend. That is why cost tracking belongs next to the work loop.

What this looks like during a real incident

Imagine the agent starts timing out on an external retrieval API at 2:13 AM.

•The process does not exit, so systemd does nothing.

•Iterations stop completing, so heartbeats go stale.

•ClevAgent sends an alert through your chosen channel.

•If you also wired Runner, restart actions can target the watched systemd service directly.

That is the difference between reading an incident in the morning and catching it while the blast radius is still small.

Bottom line

If your AI agent runs on Ubuntu, systemd should absolutely be part of the stack. Just do not mistake process supervision for runtime monitoring.

Use systemd to keep the service alive. Use ClevAgent to confirm the agent is still doing real work, catch loop and cost anomalies, and notify you when a healthy-looking process has actually stopped being useful.

Start with SDK heartbeats inside the work loop. Add Runner when you want monitored restart control for the systemd service itself.

How to Monitor and Auto-Restart AI Agents with systemd on Ubuntu