You built a multi-agent AutoGen system. An AssistantAgent drafts the plan, a UserProxyAgent executes the code, and they pass messages back and forth until the task is done. Works fine in your terminal.

Then you productionize it. The job runs overnight, processing a queue of incoming requests. On night two, the assistant agent generates code that fails silently — wrong output format, no exception raised. The executor tries to parse the result, sends confused feedback, and the assistant tries again. And again. Twelve iterations. Forty minutes. $18 in API calls. Your queue processes exactly zero items.

AutoGen's conversation loop is powerful precisely because it persists until an agent says it's done. That same property makes it dangerous in production: the loop never stops unless something terminates it or you're watching.

Why AutoGen needs dedicated production monitoring

AutoGen's multi-agent architecture has specific failure modes that generic infrastructure monitoring won't catch:

•Conversation loops: Two agents disagree on the solution and keep exchanging messages. AutoGen's max_turns is a safety net, not a monitor — it terminates the run but doesn't alert you.

•Code execution hangs: UserProxyAgent executes generated code. If that code hangs (waiting on a socket, stuck in a loop), the conversation stalls indefinitely.

•Token cost explosions: Long back-and-forth conversations pass the full conversation history as context on every turn. A 20-message conversation sends 20x the accumulated context to the LLM.

•Silent task failure: The agents reach a conclusion, the conversation ends normally, but the output is wrong. No exception, no error — just bad results silently committed to your database.

Process-level monitoring (Docker health checks, systemd watchdog) only knows your Python process is running. It has no idea whether the agents are making progress.

Monitor your AutoGen agents now:

pip install clevagent

import clevagent
clevagent.init(api_key="cv_xxx", agent="my-autogen-agent")

Free tier: 3 agents, no credit card. Get your API key →

Add ClevAgent to your AutoGen system

ClevAgent monitors AutoGen conversations at the job level — tracking heartbeats, iteration counts, and token costs so you can catch runaway loops and budget spikes before they get out of hand.

Step 1: Install

pip install clevagent

Step 2: Initialize before the conversation starts

import os
import autogen
import clevagent
clevagent.init(
    api_key=os.environ["CLEVAGENT_API_KEY"],
    agent="nightly-analysis-job",
)
assistant = autogen.AssistantAgent(
    name="assistant",
    llm_config={"model": "gpt-4o", "api_key": os.environ["OPENAI_API_KEY"]},
)
user_proxy = autogen.UserProxyAgent(
    name="user_proxy",
    human_input_mode="NEVER",
    max_consecutive_auto_reply=10,
    code_execution_config={"work_dir": "workspace"},
)
user_proxy.initiate_chat(assistant, message="Analyze Q1 sales data and output summary.json")clevagent.shutdown()

ClevAgent starts sending heartbeats from init(). If the process dies or hangs without calling shutdown(), you get alerted within 120 seconds.

Step 3: Track individual conversation turns

AutoGen fires callbacks you can hook. Use the reply_func mechanism to emit a ping per turn — this is the signal that tells ClevAgent the conversation is actively progressing rather than just alive at the process level:

import os
import autogen
import clevagent
clevagent.init(
    api_key=os.environ["CLEVAGENT_API_KEY"],
    agent="nightly-analysis-job",
)
turn_count = 0
def track_turn(recipient, messages, sender, config):
    global turn_count
    turn_count += 1
    last_msg = messages[-1] if messages else {}
    clevagent.ping(
        status="turn_complete",
        meta={
            "turn": turn_count,
            "sender": sender.name,
            "msg_length": len(str(last_msg.get("content", ""))),
        },
    )
    return False, None  # continue normal processing
assistant = autogen.AssistantAgent(
    name="assistant",
    llm_config={"model": "gpt-4o", "api_key": os.environ["OPENAI_API_KEY"]},
)
assistant.register_reply(
    trigger=autogen.ConversableAgent,
    reply_func=track_turn,
    position=0,
)
user_proxy = autogen.UserProxyAgent(
    name="user_proxy",
    human_input_mode="NEVER",
    max_consecutive_auto_reply=10,
    code_execution_config={"work_dir": "workspace"},
)
user_proxy.initiate_chat(assistant, message="Analyze Q1 sales data and output summary.json")clevagent.shutdown()

With per-turn pings, ClevAgent can detect if the conversation stalls mid-run — not just if the process dies.

Step 4: Log token costs per conversation

AutoGen's llm_config includes a filter_dict for model usage, but it doesn't report costs out of band. The simplest approach is to log costs at the end of each initiate_chat() call using the conversation's cost summary:

import os
import autogen
import clevagent
clevagent.init(
    api_key=os.environ["CLEVAGENT_API_KEY"],
    agent="nightly-analysis-job",
)
assistant = autogen.AssistantAgent(
    name="assistant",
    llm_config={"model": "gpt-4o", "api_key": os.environ["OPENAI_API_KEY"]},
)
user_proxy = autogen.UserProxyAgent(
    name="user_proxy",
    human_input_mode="NEVER",
    max_consecutive_auto_reply=10,
    code_execution_config={"work_dir": "workspace"},
)
chat_result = user_proxy.initiate_chat(
    assistant,
    message="Analyze Q1 sales data and output summary.json",
)
AutoGen 0.2 returns cost info in chat_result.cost
if hasattr(chat_result, "cost") and chat_result.cost:
    usage = chat_result.cost.get("usage_including_cached_inference", {})
    total_tokens = usage.get("total_tokens", 0)
    total_cost = usage.get("total_cost", 0.0)
    clevagent.log_cost(
        tokens=total_tokens,
        cost_usd=total_cost,
        model="gpt-4o",
    )clevagent.shutdown()

ClevAgent will alert you if cost per run exceeds your configured threshold — useful for catching the "20-message death spiral" before it happens three nights in a row.

Using the native ClevAgent AutoGen integration

If you prefer a wrapper approach over manual callbacks, the clevagent.integrations.autogen module provides a drop-in mixin:

import os
import autogen
from clevagent.integrations.autogen import MonitoredAssistantAgent
import clevagent
clevagent.init(
    api_key=os.environ["CLEVAGENT_API_KEY"],
    agent="nightly-analysis-job",
)
Drop-in replacement for autogen.AssistantAgent
assistant = MonitoredAssistantAgent(
    name="assistant",
    llm_config={"model": "gpt-4o", "api_key": os.environ["OPENAI_API_KEY"]},
)
user_proxy = autogen.UserProxyAgent(
    name="user_proxy",
    human_input_mode="NEVER",
    max_consecutive_auto_reply=10,
    code_execution_config={"work_dir": "workspace"},
)
user_proxy.initiate_chat(assistant, message="Analyze Q1 sales data and output summary.json")clevagent.shutdown()

MonitoredAssistantAgent wraps generate_reply() to emit pings and log token usage automatically. You get per-turn tracking without registering callbacks manually.

What to alert on

Once you have monitoring in place, here's what's worth alerting on for AutoGen jobs:

Heartbeat timeout — Set it to 2x your expected max conversation duration. If a typical run takes 5 minutes, alert at 10 minutes. AutoGen conversations can legitimately take longer than expected, so give enough headroom.

Iteration count — max_consecutive_auto_reply is a hard stop, but you want to know before you hit it. If your normal conversations finish in 4–6 turns and you're seeing 9–10, that's a signal the agents are stuck.

Cost per run — Establish a baseline over a week of normal runs. Alert if any single run exceeds 3x the median. One-off spikes happen; consistent spikes mean something changed.

Silent completion with no output file — This requires application-level logic, but it's worth adding: after shutdown(), check that the expected output artifact exists. A "successful" AutoGen run that produced no output is often worse than one that crashed.

AgentChat 0.4 compatibility

Microsoft renamed and restructured AutoGen as AgentChat in the 0.4.x series. The core monitoring pattern is identical — clevagent.init() before the conversation, clevagent.shutdown() after, and clevagent.ping() in message handlers. The specific callback registration API differs:

AgentChat 0.4 pattern
from autogen_agentchat.agents import AssistantAgent
from autogen_agentchat.teams import RoundRobinGroupChat
from autogen_ext.models.openai import OpenAIChatCompletionClient
import clevagent
clevagent.init(
    api_key=os.environ["CLEVAGENT_API_KEY"],
    agent="agentchat-job",
)
model_client = OpenAIChatCompletionClient(model="gpt-4o")
agent = AssistantAgent(name="assistant", model_client=model_client)
team = RoundRobinGroupChat([agent], max_turns=10)
import asyncio
async def run():
    result = await team.run(task="Analyze Q1 sales data")
    clevagent.ping(status="complete", meta={"turns": len(result.messages)})
    clevagent.shutdown()asyncio.run(run())

The heartbeat and alerting behavior is identical regardless of which AutoGen version you're on.

The actual failure this catches

The scenario at the top of this post — assistant and executor stuck in a disagreement loop for 40 minutes — looks like this on ClevAgent's dashboard:

•Heartbeats arrive every turn for the first 4 minutes (normal cadence)

•Then heartbeats keep arriving, but per-iteration pings slow down: 30 seconds between turns instead of 8

•Cost log shows $0.40 logged after 5 minutes, $1.20 after 10, $4.60 after 20

•Alert fires at minute 12: "Cost threshold exceeded ($3.00/run limit)"

Without monitoring, you find out at 7 AM when you check the results. With monitoring, you get a Slack message at minute 12 and can kill the job before it burns through $18.

How to Monitor AutoGen Agents in Production