ClevAgent
← All posts
2026-04-07autogenmonitoringtutorialai-agentsproduction

How to Monitor AutoGen Agents in Production

Add heartbeat monitoring, loop detection, and cost tracking to your AutoGen multi-agent systems. Works with both AutoGen 0.2 and AgentChat 0.4.

You built a multi-agent AutoGen system. An AssistantAgent drafts the plan, a UserProxyAgent executes the code, and they pass messages back and forth until the task is done. Works fine in your terminal.

Then you productionize it. The job runs overnight, processing a queue of incoming requests. On night two, the assistant agent generates code that fails silently — wrong output format, no exception raised. The executor tries to parse the result, sends confused feedback, and the assistant tries again. And again. Twelve iterations. Forty minutes. $18 in API calls. Your queue processes exactly zero items.

AutoGen's conversation loop is powerful precisely because it persists until an agent says it's done. That same property makes it dangerous in production: the loop never stops unless something terminates it or you're watching.

Why AutoGen needs dedicated production monitoring

AutoGen's multi-agent architecture has specific failure modes that generic infrastructure monitoring won't catch:

  • Conversation loops: Two agents disagree on the solution and keep exchanging messages. AutoGen's max_turns is a safety net, not a monitor — it terminates the run but doesn't alert you.
  • Code execution hangs: UserProxyAgent executes generated code. If that code hangs (waiting on a socket, stuck in a loop), the conversation stalls indefinitely.
  • Token cost explosions: Long back-and-forth conversations pass the full conversation history as context on every turn. A 20-message conversation sends 20x the accumulated context to the LLM.
  • Silent task failure: The agents reach a conclusion, the conversation ends normally, but the output is wrong. No exception, no error — just bad results silently committed to your database.
  • Process-level monitoring (Docker health checks, systemd watchdog) only knows your Python process is running. It has no idea whether the agents are making progress.


    Monitor your AutoGen agents now:

    pip install clevagent
    

    import clevagent
    clevagent.init(api_key="cv_xxx", agent="my-autogen-agent")
    

    Free tier: 3 agents, no credit card. Get your API key →


    Add ClevAgent to your AutoGen system

    ClevAgent monitors AutoGen conversations at the job level — tracking heartbeats, iteration counts, and token costs so you can catch runaway loops and budget spikes before they get out of hand.

    Step 1: Install

    pip install clevagent
    

    Step 2: Initialize before the conversation starts

    import os
    import autogen
    import clevagent

    clevagent.init( api_key=os.environ["CLEVAGENT_API_KEY"], agent="nightly-analysis-job", )

    assistant = autogen.AssistantAgent( name="assistant", llm_config={"model": "gpt-4o", "api_key": os.environ["OPENAI_API_KEY"]}, )

    user_proxy = autogen.UserProxyAgent( name="user_proxy", human_input_mode="NEVER", max_consecutive_auto_reply=10, code_execution_config={"work_dir": "workspace"}, )

    user_proxy.initiate_chat(assistant, message="Analyze Q1 sales data and output summary.json")

    clevagent.shutdown()

    ClevAgent starts sending heartbeats from init(). If the process dies or hangs without calling shutdown(), you get alerted within 120 seconds.

    Step 3: Track individual conversation turns

    AutoGen fires callbacks you can hook. Use the reply_func mechanism to emit a ping per turn — this is the signal that tells ClevAgent the conversation is actively progressing rather than just alive at the process level:

    import os
    import autogen
    import clevagent

    clevagent.init( api_key=os.environ["CLEVAGENT_API_KEY"], agent="nightly-analysis-job", )

    turn_count = 0

    def track_turn(recipient, messages, sender, config): global turn_count turn_count += 1 last_msg = messages[-1] if messages else {} clevagent.ping( status="turn_complete", meta={ "turn": turn_count, "sender": sender.name, "msg_length": len(str(last_msg.get("content", ""))), }, ) return False, None # continue normal processing

    assistant = autogen.AssistantAgent( name="assistant", llm_config={"model": "gpt-4o", "api_key": os.environ["OPENAI_API_KEY"]}, ) assistant.register_reply( trigger=autogen.ConversableAgent, reply_func=track_turn, position=0, )

    user_proxy = autogen.UserProxyAgent( name="user_proxy", human_input_mode="NEVER", max_consecutive_auto_reply=10, code_execution_config={"work_dir": "workspace"}, )

    user_proxy.initiate_chat(assistant, message="Analyze Q1 sales data and output summary.json")

    clevagent.shutdown()

    With per-turn pings, ClevAgent can detect if the conversation stalls mid-run — not just if the process dies.

    Step 4: Log token costs per conversation

    AutoGen's llm_config includes a filter_dict for model usage, but it doesn't report costs out of band. The simplest approach is to log costs at the end of each initiate_chat() call using the conversation's cost summary:

    import os
    import autogen
    import clevagent

    clevagent.init( api_key=os.environ["CLEVAGENT_API_KEY"], agent="nightly-analysis-job", )

    assistant = autogen.AssistantAgent( name="assistant", llm_config={"model": "gpt-4o", "api_key": os.environ["OPENAI_API_KEY"]}, )

    user_proxy = autogen.UserProxyAgent( name="user_proxy", human_input_mode="NEVER", max_consecutive_auto_reply=10, code_execution_config={"work_dir": "workspace"}, )

    chat_result = user_proxy.initiate_chat( assistant, message="Analyze Q1 sales data and output summary.json", )

    AutoGen 0.2 returns cost info in chat_result.cost

    if hasattr(chat_result, "cost") and chat_result.cost: usage = chat_result.cost.get("usage_including_cached_inference", {}) total_tokens = usage.get("total_tokens", 0) total_cost = usage.get("total_cost", 0.0) clevagent.log_cost( tokens=total_tokens, cost_usd=total_cost, model="gpt-4o", )

    clevagent.shutdown()

    ClevAgent will alert you if cost per run exceeds your configured threshold — useful for catching the "20-message death spiral" before it happens three nights in a row.

    Using the native ClevAgent AutoGen integration

    If you prefer a wrapper approach over manual callbacks, the clevagent.integrations.autogen module provides a drop-in mixin:

    import os
    import autogen
    from clevagent.integrations.autogen import MonitoredAssistantAgent
    import clevagent

    clevagent.init( api_key=os.environ["CLEVAGENT_API_KEY"], agent="nightly-analysis-job", )

    Drop-in replacement for autogen.AssistantAgent

    assistant = MonitoredAssistantAgent( name="assistant", llm_config={"model": "gpt-4o", "api_key": os.environ["OPENAI_API_KEY"]}, )

    user_proxy = autogen.UserProxyAgent( name="user_proxy", human_input_mode="NEVER", max_consecutive_auto_reply=10, code_execution_config={"work_dir": "workspace"}, )

    user_proxy.initiate_chat(assistant, message="Analyze Q1 sales data and output summary.json")

    clevagent.shutdown()

    MonitoredAssistantAgent wraps generate_reply() to emit pings and log token usage automatically. You get per-turn tracking without registering callbacks manually.

    What to alert on

    Once you have monitoring in place, here's what's worth alerting on for AutoGen jobs:

    Heartbeat timeout — Set it to 2x your expected max conversation duration. If a typical run takes 5 minutes, alert at 10 minutes. AutoGen conversations can legitimately take longer than expected, so give enough headroom.

    Iteration countmax_consecutive_auto_reply is a hard stop, but you want to know before you hit it. If your normal conversations finish in 4–6 turns and you're seeing 9–10, that's a signal the agents are stuck.

    Cost per run — Establish a baseline over a week of normal runs. Alert if any single run exceeds 3x the median. One-off spikes happen; consistent spikes mean something changed.

    Silent completion with no output file — This requires application-level logic, but it's worth adding: after shutdown(), check that the expected output artifact exists. A "successful" AutoGen run that produced no output is often worse than one that crashed.

    AgentChat 0.4 compatibility

    Microsoft renamed and restructured AutoGen as AgentChat in the 0.4.x series. The core monitoring pattern is identical — clevagent.init() before the conversation, clevagent.shutdown() after, and clevagent.ping() in message handlers. The specific callback registration API differs:

    AgentChat 0.4 pattern

    from autogen_agentchat.agents import AssistantAgent from autogen_agentchat.teams import RoundRobinGroupChat from autogen_ext.models.openai import OpenAIChatCompletionClient import clevagent

    clevagent.init( api_key=os.environ["CLEVAGENT_API_KEY"], agent="agentchat-job", )

    model_client = OpenAIChatCompletionClient(model="gpt-4o") agent = AssistantAgent(name="assistant", model_client=model_client) team = RoundRobinGroupChat([agent], max_turns=10)

    import asyncio

    async def run(): result = await team.run(task="Analyze Q1 sales data") clevagent.ping(status="complete", meta={"turns": len(result.messages)}) clevagent.shutdown()

    asyncio.run(run())

    The heartbeat and alerting behavior is identical regardless of which AutoGen version you're on.

    The actual failure this catches

    The scenario at the top of this post — assistant and executor stuck in a disagreement loop for 40 minutes — looks like this on ClevAgent's dashboard:

  • Heartbeats arrive every turn for the first 4 minutes (normal cadence)
  • Then heartbeats keep arriving, but per-iteration pings slow down: 30 seconds between turns instead of 8
  • Cost log shows $0.40 logged after 5 minutes, $1.20 after 10, $4.60 after 20
  • Alert fires at minute 12: "Cost threshold exceeded ($3.00/run limit)"
  • Without monitoring, you find out at 7 AM when you check the results. With monitoring, you get a Slack message at minute 12 and can kill the job before it burns through $18.


    Related reading

  • How to Monitor CrewAI Agents in Production — same pattern for CrewAI multi-agent crews
  • How to Monitor LangChain Agents in Production — covers LangChain agents and LangGraph workflows
  • Three AI Agent Failure Modes That Traditional Monitoring Will Never Catch — the broader failure taxonomy that motivates agent-specific monitoring

  • Stop finding out about AutoGen failures at 7 AM. ClevAgent monitors your conversations in real time — heartbeats, loop detection, cost alerts, and daily reports. Free for up to 3 agents.

    Start monitoring for free →

    Get your API key in 30s →

    3 agents free · No credit card · Setup in 30 seconds