A practical guide to running AI agents under systemd with heartbeat monitoring, alerts, cost tracking, and safe auto-restart on Ubuntu.
At 2:13 AM, systemctl status research-agent can still say active (running) while your queue has been flat for 47 minutes.
That is the trap with long-running AI agents on a single Ubuntu VM. systemd will restart a dead process, but a lot of the expensive failures happen while the process is still alive.
The Python worker is running. The service never exited. But the agent is doing no useful work. It might be stuck waiting on an API that never returns. It might be looping on the same tool call. It might be burning tokens without making progress.
That is the gap between process supervision and runtime monitoring.
This guide covers a production pattern that works well for long-running agents on Ubuntu:
systemd to keep the process alive.systemd service.If you want the broader overview first, read How to Monitor AI Agents in Production. If your agents run in containers instead of VM services, How to Monitor AI Agents in Docker Compose is the closer match.
systemd is excellent at detecting that a process exited and starting it again.
It does not know whether your agent is healthy between process start and process exit.
For AI agents, the costly failures usually live in that gap:
From systemd's perspective, none of those are failures. From an operator's perspective, they are exactly the failures that matter.
Use two layers.
That gives you a clean split:
systemd restarts it.A simple service unit is enough for most agents. Keep the runtime explicit and set a restart policy.
/etc/systemd/system/research-agent.service
[Unit]
Description=Research Agent
After=network-online.target
Wants=network-online.target[Service]
Type=simple
User=ubuntu
WorkingDirectory=/opt/research-agent
EnvironmentFile=/etc/research-agent.env
ExecStart=/opt/research-agent/.venv/bin/python main.py
Restart=on-failure
RestartSec=10
TimeoutStopSec=30
[Install]
WantedBy=multi-user.target
Then enable it:
sudo systemctl daemon-reload
sudo systemctl enable --now research-agent.service
sudo systemctl status research-agent.service
This covers the blunt failure case: your process crashes and needs to come back.
The most useful heartbeat is the one sent from the part of the code that proves the agent is still moving.
That usually means after one iteration completes, after one tool call finishes, or after one scheduled job succeeds. Do not treat process start as proof of health.
Here is a minimal Python example:
import os
import signal
import timeimport clevagent
running = True
def handle_shutdown(signum, frame):
del signum, frame
global running
running = False
signal.signal(signal.SIGTERM, handle_shutdown)
signal.signal(signal.SIGINT, handle_shutdown)
def run_once() -> dict[str, object]:
"""Replace this stub with one unit of real agent work."""
time.sleep(2)
return {
"tokens": 834,
"cost_usd": 0.0124,
"model": "gpt-4.1-mini",
"tool_name": "search_docs",
}
clevagent.init(
api_key=os.environ["CLEVAGENT_API_KEY"],
agent="research-agent-prod",
interval=30,
agent_type="openai",
cost_limit_usd=25.0,
on_loop="alert_only",
)
iteration = 0
try:
while running:
iteration += 1
result = run_once()
clevagent.log_iteration(iteration)
clevagent.log_tool_call(result["tool_name"], "query=release notes")
clevagent.log_cost(
tokens=int(result["tokens"]),
cost_usd=float(result["cost_usd"]),
model=str(result["model"]),
)
clevagent.ping(status="ok", message="iteration complete")
time.sleep(15)
finally:
clevagent.shutdown()
The key detail is placement. If you only initialize at startup and never emit progress from the work loop, you will miss the failures that make AI agents painful to operate.
For many teams, systemd restart policy plus SDK heartbeats is enough.
If you want restart actions wired through ClevAgent itself, add Runner on the same machine and watch the service directly. Runner currently supports Docker, systemd, and launchd restart targets.
export CLEVAGENT_API_KEY=cv_your_project_key
clevagent-runner start \
--watch "systemd:research-agent.service"
This is also the cleanest option when you cannot modify agent code and need heartbeat proxying from outside the process.
Do not force one tool to do both jobs. systemd is your local supervisor. ClevAgent is your runtime watchdog. That separation makes incidents easier to reason about.
A timer thread tells you the process exists. A loop-complete ping tells you the agent is still useful. For most operators, the second signal is the one that matters.
A stuck agent rarely looks expensive in CPU graphs. It looks expensive in repeated iterations and model spend. That is why cost tracking belongs next to the work loop.
Imagine the agent starts timing out on an external retrieval API at 2:13 AM.
systemd does nothing.systemd service directly.That is the difference between reading an incident in the morning and catching it while the blast radius is still small.
If your AI agent runs on Ubuntu, systemd should absolutely be part of the stack. Just do not mistake process supervision for runtime monitoring.
Use systemd to keep the service alive. Use ClevAgent to confirm the agent is still doing real work, catch loop and cost anomalies, and notify you when a healthy-looking process has actually stopped being useful.
Start with SDK heartbeats inside the work loop. Add Runner when you want monitored restart control for the systemd service itself.
*If systemd is already keeping the process alive, the next missing signal is usually whether useful work is still happening. ClevAgent adds heartbeat freshness, loop and cost visibility, alerts, and optional Runner-based restart control for Docker, systemd, and launchd. Free for 3 agents — start monitoring.*
3 agents free · No credit card · Setup in 30 seconds