Tracing in an Agentic World

You've got an influx of agents on the way into your business. Ready?

Gartner predicts that 40% of enterprise applications will include AI agents this year, up from less than 5% just last year. But there's a growing gap between ambition and production reality. A few months earlier Gartner also predicted that over 40% of agentic AI projects will be canceled by 2027 — not because the technology doesn't work, but because of escalating costs, unclear business value, and inadequate controls.

The question for anyone building production agent systems isn't just "Can we make it work?" — it's "Can we see what it's doing, understand why, and trust the outcome?"

The Black Box Problem at Scale

Multi-agent systems are fundamentally different from traditional software. A single user request might trigger a chain of agent conversations, dozens of LLM calls across different models, tool executions, code generation, and autonomous decision-making. The execution path isn't predetermined — it emerges at runtime. Traditional application monitoring was built for request-response cycles with predictable flows. Agent workflows are dynamic, branching, and unfortunately surprising. The numbers bear this out: Dynatrace found that 63% of organizations report their agents need more human supervision than anticipated, IBM shows 45% of executives cite lack of visibility as a barrier to adoption, and KPMG found that 75% of leaders prioritize security, compliance, and auditability as the most critical requirements for deployment.

This is why AG2 is open source — and why our approach to telemetry is too. When the systems themselves are non-deterministic, transparency in both the framework and its instrumentation isn't a nice-to-have, it's essential. You need to be able to see what your agents are doing at a high level — which agents are active, how conversations flow, where time and tokens are spent — and you need to be able to dive deep into any individual trace to understand exactly what happened and why.

Why Observability is Non-Negotiable

Deloitte recently published an article about AI agent observability and covered five key dimensions: Cost, Speed, Productivity, Quality, and Trust. What's notable is that trust wasn't treated as a separate concern — it was a measurable output of the other four. If you can see what your agents are doing, how fast they're doing it, what it costs, and whether the results are good, trust follows.

The cost dimension alone is enough to justify serious investment in observability. Token costs continue to drop significantly, but that reduction has also enabled more ambitious agent architectures — longer conversations, more tool calls, larger context windows — which means the total bill can still be substantial. Some enterprises report monthly AI costs in the tens of millions. Without granular visibility into token usage per agent, per conversation, per model, cost optimization is a guessing game.

And then there's the governance question. Only 21% of companies planning to deploy agentic AI within two years report having a mature model for agent governance, according to Deloitte's State of AI 2026 report. Forrester predicts that three out of four firms attempting to build agentic architectures on their own will fail. The organizations that succeed will be the ones that treat observability not as an afterthought but as infrastructure — as fundamental as the agents themselves.

OpenTelemetry, GenAI Conventions, and AG2's Approach

At AG2, we want to open the black box and recently implemented OpenTelemetry integration. OpenTelemetry is an open, vendor-neutral standard for observability — already firmly established across traditional software engineering. What makes it particularly important for agentic AI is the Generative AI Semantic Conventions being developed by the OpenTelemetry GenAI Special Interest Group. These conventions define a common language for AI-specific telemetry: how to represent LLM calls, tool executions, and agent invocations. Without a shared standard, every framework invents its own tracing format, locking you into a single vendor. The GenAI conventions define standard span types and attributes — like invoke_agent — that any compliant tool can understand. It's still evolving, as it should be given how fast the field moves, but the direction is clear.

AG2's tracing is built on OpenTelemetry natively, so when you instrument an agent or a group chat pattern, you get a hierarchical trace that mirrors the actual structure of the conversation:

Conversation spans wrap the entire interaction, from start to finish
Agent spans show each turn — which agent acted, when, and for how long
LLM spans capture the model used, token counts, and cost for every inference
Tool spans record what tools were called, with what arguments, and what they returned
Agent selection spans reveal how the system decided which agent should respond next

At the top level, you can see the shape of a conversation — how many agents participated, how long it took, what it cost in total. When something looks off, you drill down into a specific agent turn, see exactly what prompt was sent to the LLM, how many tokens it consumed, and what tools it called.

For distributed systems, AG2 propagates tracing across remote A2A (Agent2Agent Protocol) agent interactions, maintaining a single connected trace across service boundaries. And because it's OpenTelemetry, the traces work with whatever backend you already use: Grafana, Jaeger, Datadog, Honeycomb, or any OTLP-compatible system. For the full technical walkthrough, see our detailed tracing guide on docs.ag2.ai.

What's Next

Agentic AI observability is a young discipline. The observability conventions are still being refined, best practices for agent governance are still being written, and the industry is figuring this out together. AG2 is evolving with it. What we do know is that observability is a key part of closing the gap between experimental agent systems and production-grade ones. Gartner projects that by 2029, 70% of enterprises will deploy agentic AI as part of IT infrastructure operations. The organizations that get there will be the ones that invested early in understanding what their agents are actually doing.

Find out more about AG2's OpenTelemetry tracing.

Topics

Ag2 Opentelemetry Tracing Agentic-ai Enterprise Production

The Black Box Problem at Scale

Why Observability is Non-Negotiable

OpenTelemetry, GenAI Conventions, and AG2's Approach

AG2's tracing is built on OpenTelemetry natively, so when you instrument an agent or a group chat pattern, you get a hierarchical trace that mirrors the actual structure of the conversation:

Conversation spans wrap the entire interaction, from start to finish

Agent spans show each turn — which agent acted, when, and for how long

LLM spans capture the model used, token counts, and cost for every inference

Tool spans record what tools were called, with what arguments, and what they returned

Agent selection spans reveal how the system decided which agent should respond next

What's Next

Tracing in an Agentic World

The Black Box Problem at Scale

Why Observability is Non-Negotiable

OpenTelemetry, GenAI Conventions, and AG2's Approach

What's Next

Previous Posts

Beyond AutoGen: Why AG2 is the Essential Evolution for Production-Grade AI Agents

Giving Agents Expression

Agents Need Interfaces, Not Just Intelligence

Tracing in an Agentic World

The Black Box Problem at Scale

Why Observability is Non-Negotiable

OpenTelemetry, GenAI Conventions, and AG2's Approach

What's Next

Previous Posts

Beyond AutoGen: Why AG2 is the Essential Evolution for Production-Grade AI Agents

Giving Agents Expression

Agents Need Interfaces, Not Just Intelligence