Generative Ai

Error Logs Won't Save You: Why LLM Applications Need Tracing

Introduction

Observability is the capability to understand the internal state of a system purely from its outputs. Rather than instrumenting every internal component or requiring deep knowledge of how a system is built, a well-observable system lets you monitor, reason about, and understand what’s happening inside just by examining what comes out. Think of it as the difference between opening up an engine to diagnose a problem versus reading the dashboard.

What is Observability for LLM Applications?

In practice, observability rests on three pillars: logging records discrete events as they happen; metrics track aggregated measurements over time; and tracing follows the flow of a request as it moves through your system. Used together, they give you a complete picture of system behavior in production.

For LLM applications, tracing is a fundamental part of any observability stack, and in this article we take a look at why it is so important.

Tracing

A trace is a set of related operations linked together by a unique identifier, the trace ID, that records the flow of a request through your system while preserving the causal relationships between each step. This gives you complete visibility into what happened during a request: not just whether it succeeded or failed, but exactly what occurred, in what order, and how long each part took.

For LLM applications, this means capturing the full context of every request and response. A single chatbot turn is a good mental model: the user sends a message, your app retrieves relevant context, calls the LLM, and returns a response. All of those steps belong to one trace. Each individual operation within that trace is called a span. The retrieval step is a span, the LLM call is a span, and so on. Spans nest inside each other to reflect the parent-child relationships between operations, producing a tree that shows both the structure and the sequence of everything that happened.

Each span carries a standard set of attributes:

  • Start time: when the operation began
  • End time: when it completed
  • Duration: how long it took
  • Status: whether it succeeded, errored, or produced some other outcome
  • Inputs and outputs: what went in and what came out

Taken together, traces are a key tool for debugging, performance optimization, and quality monitoring in LLM applications that work alongside metrics and logs rather than replacing them.

Illustrative example trace

- task (8m 25s)
    - task.action_plan (3m 50s)
        - task.milestone (3m 50s)
            - PrepWorkflow.run (3m 50s)
                - PrepWorkflow.analyze_step (3m 50s)
                    - ReActAgent.run (30.4s)
                        - BaseWorkflowAgent.init_run (2ms)
                        - BaseWorkflowAgent.setup_agent (0ms)
                        - BaseWorkflowAgent.run_agent_step (3.8s)
                            - OpenAI.astream_chat (8,814 / 4ms)
                            - ReActOutputParser.parse (8ms)
                        - BaseWorkflowAgent.parse_agent_output (0ms)
                        - BaseWorkflowAgent.call_tool (12ms)
                            - FunctionTool.acall (12ms)

The trace above is a short excerpt from a much larger real trace of a LlamaIndex opens a new window agent workflow, captured using Phoenix opens a new window . Sibling spans and branches have been omitted for clarity, but it illustrates the kind of visibility tracing provides in practice.

At the top level, the entire task workflow took 8 minutes and 25 seconds to complete. The nesting here is strictly sequential, see how task.action_plan triggered task.milestone, which triggered PrepWorkflow.run, which triggered PrepWorkflow.analyze_step. Each level is a child of the one above it, and each parent up to PrepWorkflow.analyze_step has its full execution time dominated by the one child (visible in the 3min 50s execution time for all of them). This tells you that the bulk of the work happened deep in the chain, not at the top level, something you’d never infer from the root span alone.

Deeper still, PrepWorkflow.analyze_step spun up a ReActAgent, which is where the workflow’s actual reasoning happened. The agent itself took 30.4 seconds, but notice how much faster the visible child spans are: initialization and setup are essentially instantaneous, and the tool call completed in 12ms. Among the spans shown, run_agent_step at 3.8 seconds is where most of the visible time lies (the remainder of the 30.4 seconds is in spans omitted from this excerpt).

At the deepest level, run_agent_step contains two revealing child spans: OpenAI.astream_chat and ReActOutputParser.parse. The astream_chat span records 8,814 tokens consumed by that specific LLM call, with a stream initiation time of 4ms (not the full generation time). Because astream_chat returns a stream object almost immediately, the 4ms only captures how long it took to open the stream; the actual 3.8 seconds of generation latency is consumed in the parent run_agent_step as the stream is read. This is a common gotcha: without this context, the LLM span can look deceptively fast.

This short excerpt shows where cost and performance accountability become concrete. You can see exactly which operation consumed tokens, how many, and where the real latency lived, pinpointed to a single span within a seven-level-deep workflow.

Tracing Standards

Before OpenTelemetry opens a new window , observability was fragmented. Every vendor had their own format for traces, metrics, and logs, which meant that instrumenting your application for one backend locked you into that vendor’s ecosystem. Switching tools meant re-instrumenting everything from scratch.

This problem is especially acute for LLM applications, where observability and evaluation tooling come from a wide variety of vendors and frameworks. Without a shared standard for the shape of telemetry data, every tool speaks a slightly different language and your instrumentation code becomes tightly coupled to whichever platform you happen to be using today.

OpenTelemetry

OpenTelemetry (OTel) solves this by providing a single, vendor-neutral standard for generating, collecting, and exporting telemetry data. It is an open-source framework that covers all three pillars of observability (traces, metrics, and logs) across any language, infrastructure, or runtime environment. The key design principle is that instrumentation is decoupled from the backend: you instrument your application once using the OTel API and SDK, and where that data gets sent is configured separately. Storage and visualization are intentionally left to other tools.

The main components you’ll encounter are:

  • APIs and SDKs: the APIs define the interfaces for creating telemetry data; the SDKs implement those interfaces and handle the practical work of gathering, processing, and exporting it.
  • OTLP (OpenTelemetry Protocol): the specification that defines how telemetry data is encoded and transported between your application, any intermediaries, and the backend.
  • The Collector: a standalone service that receives telemetry from your application, processes it, and forwards it to one or more backends via exporters.
  • Semantic conventions: standardized attribute names and values that give telemetry data consistent meaning across languages and services. For example, when any service emits an attribute called gen_ai.request.model, every OTel-compatible backend understands that this represents the language model being used and can display or query it accordingly.

This last point (semantic conventions) is what makes cross-service correlation possible. A trace from a Python service, metrics from a Go service, and logs from a Java service can all be understood and queried together in the same backend because they follow the same naming rules.

Purpose-built LLM tooling

A number of open-source projects have also emerged that build on top of OTel specifically for LLM workloads, adding instrumentation for model calls, vector database queries, and agent frameworks without introducing a separate, incompatible format.

At the framework level, LlamaIndex opens a new window , for example, provides native OTel support via its llama-index-observability-otel opens a new window package, making it straightforward to get structured traces out of LlamaIndex-based pipelines without manual instrumentation.

This is now the norm across the ecosystem: every major observability tool (Langfuse opens a new window , LangSmith opens a new window , Phoenix opens a new window ) ships with OTel support, which means your instrumentation is portable across backends by default.

Why Tracing Matters

LLM applications introduce a new class of problems that traditional monitoring tools simply weren’t built to handle, and that’s where the stakes get higher.

The limits of traditional monitoring

Error logs tell you what broke. They don’t tell you when your model started hallucinating, when it deviated from its intended behavior, or why a response that looks syntactically correct is subtly wrong. Hallucinations, inconsistent output quality, and unexpected behavior under different inputs aren’t errors in the traditional sense (they won’t trigger an exception or a 500 status code). Without structured tracing, debugging these issues means piecing together scattered logs, which is doable but slow and error-prone compared to having the full causal chain laid out in a single trace.

LLM applications are also fundamentally more complex than conventional software at runtime. A single user request might fan out into multiple LLM calls, retrieval steps, and tool executions. Traditional request/response monitoring doesn’t capture that causal chain, so when something goes wrong mid-workflow, you have no reliable way to pinpoint where.

What tracing gives you

Debugging and root cause analysis. Tracing provides causality for multi-step applications. When a workflow produces a bad output or fails partway through, the trace shows you the exact sequence of operations, their inputs and outputs, and where things went sideways versus leaving you to reconstruct events from scattered logs.

Performance optimization. Latency in LLM applications is highly variable and input-dependent. Tracing lets you measure response times at each step so you can identify which part of your pipeline is the bottleneck, whether that’s a slow retrieval step, a large prompt, or a particular model configuration.

Cost and usage tracking. Token usage and API call frequency translate directly into operational costs. Tracing gives you the visibility to break these down by model, by user, by feature, or by workflow so you can optimize spend rather than flying blind until the invoice arrives.

Rate limit management. External LLM APIs impose rate limits, and hitting them can quietly degrade or break your application. With tracing in place, you can monitor call frequency, spot patterns that are pushing against limits, and build smarter throttling before it becomes an incident.

Output quality and evaluation. Because traces record the exact prompt and completion for every LLM call, they become the raw material for evaluation. You can run automated checks against trace data to measure accuracy, coherence, or policy compliance and you can tie user feedback directly to the specific trace that produced the output it’s rating.

Continuous improvement with domain experts. The harder challenge beyond automated evaluation is getting subject matter experts involved in the feedback loop. Traces make this possible: domain experts can review specific runs, annotate outputs with quality ratings and contextual notes, and that feedback can be turned into evaluation datasets that engineering teams can act on systematically.

AI safety and compliance. GenAI telemetry, combined with appropriate evaluators, provides a mechanism for monitoring ethical use such as detecting bias, flagging policy violations, maintaining privacy compliance, and ensuring data security. Without trace-level visibility, these concerns are difficult to audit in any meaningful way. It’s worth noting that because traces capture full prompts and completions, they can themselves contain sensitive data, see the Production Considerations section for how to handle PII redaction and trace volume at scale.

Version control and rollback. Tracking model and prompt versions within trace data lets you monitor the performance impact of changes over time. If a new model version or prompt change degrades output quality, you have the data to detect it quickly and roll back with confidence.

What to Instrument

Instrumentation is the process of adding code to your application so it records its own behavior. When instrumentation is in place, tracing tools (via OpenTelemetry) automatically capture these events and structure them into traces and spans. The output is only as useful as what you choose to record, so it’s worth being deliberate about what goes in and what stays out.

The core signals

For most LLM applications, there are a handful of operations that should always be instrumented:

LLM calls are the most obvious. Every call to a language model should produce a span that captures the model name and version, the full prompt sent, the response received, token usage (both input and output), latency, and cost. These attributes are the foundation for debugging, performance monitoring, and cost tracking alike.

Tool and function calls matter just as much in agentic workflows. When your application calls an external tool, such as a web search or a database lookup, that call should be its own span with its own inputs and outputs. This is what makes multi-step agent behavior inspectable rather than opaque.

Retrieval steps in RAG pipelines deserve explicit instrumentation. Record the query sent to the retrieval system and the documents that came back. This lets you diagnose retrieval quality independently from generation quality (two very different failure modes that are easy to conflate without the right visibility).

Prompt versions should be linked to traces. Knowing which prompt version was active for a given trace lets you track how output quality changes across prompt iterations and answer questions like: is the new version actually better? Are we regressing on edge cases?

Naming and metadata

For application-level spans you instrument yourself, clear naming makes the trace tree scannable at a glance. Name each span after what it does, such as retrieve_context, call_llm and format_response, rather than after implementation details like function names or class names. Infrastructure and framework spans are a different story: those are typically auto-instrumented using their own conventions (OTel semantic conventions, library defaults) and domain naming doesn’t apply there.

Each operation should carry an explicit input and output wherever possible. For a chatbot, that’s the user message and the assistant response. For a RAG pipeline, it’s the user query and the generated answer. These become the raw material for evaluation and debugging.

Metadata is a flexible key-value store you can attach to any span. Use it to capture context that doesn’t fit standard attributes: which user or session the trace belongs to, which feature or workflow triggered it, evaluation scores, or annotation flags. This flexibility is what makes metadata useful for filtering traces, grouping by feature, and surfacing runs for domain expert review.

What not to instrument

More instrumentation is not always better. Framework internals and generic HTTP spans often add noise to the trace tree without providing meaningful insight into LLM behavior. That said, context matters: in an application where query or retrieval directly impacts revenue or quality, those spans are worth keeping. For LLM applications where database queries aren’t a meaningful part of the workflow, they’re often not worth the noise.

The rule of thumb is to keep spans that map to application-level operations and filter out the ones that don’t, like low-level http.get calls or ORM internals that are many layers below the behavior you’re trying to understand. A clean trace tree where every span represents something meaningful is far more useful than an exhaustive one where the signal is buried.

The questions good instrumentation should answer

The right instrumentation isn’t just about collecting data, it’s about being able to answer the questions that drive continuous improvement:

  • Is this new prompt version performing better than the previous one?
  • Which examples should we add or remove from our few-shot prompts?
  • Are users satisfied with the responses the model is producing?
  • How do we catch regressions before they reach production?

If your current instrumentation can’t answer these questions, that’s a useful signal about where the gaps are.

Where the Space is Heading

LLM observability is still a young field, and the tooling and standards around it are evolving quickly. A few trends are worth paying attention to as you build out your observability practice.

Standards maturing

OpenTelemetry’s GenAI semantic conventions are still relatively new, and the current spec leaves gaps, particularly around edge cases and the growing variety of AI agent frameworks. The direction of travel is clear though: the community is actively working toward more robust conventions that cover a wider range of patterns, and toward a unified semantic convention for AI agents that ensures interoperability across frameworks while still allowing room for vendor-specific extensions. As these conventions stabilize, the cost of instrumentation goes down and the portability of your telemetry data goes up.

Multi-agent tracing complexity

Single-agent tracing is a largely solved problem. The harder challenge ahead is tracing across systems where multiple agents collaborate, hand off work to each other, or run in parallel often across different services, frameworks, and even organizations. Preserving causal relationships across these boundaries, without losing the context of what triggered what, is an open problem that the industry is only beginning to tackle seriously. Expect the tooling here to improve significantly over the next few years.

Evaluation as a first-class citizen

Right now, evaluation and tracing tend to live in separate workflows, you collect traces over here, you run evals over there, and connecting the two requires manual effort. The trend is toward tighter integration, where evaluation becomes a native part of the trace itself: scores, annotations, and quality signals attached directly to the spans that produced the outputs being evaluated. This closes the feedback loop between production behavior and improvement, making continuous quality monitoring less of a bespoke engineering project and more of a built-in capability.

End-to-end visibility

As LLM applications grow more complex, the boundary between model observability and application observability is blurring. The goal the ecosystem is moving toward is genuine end-to-end visibility: a single, coherent view that spans from the user request all the way through to the model’s internal behavior (latency, cost, quality, and safety signals) all in one place, correlated across every layer of the stack. We’re not fully there yet, but the combination of maturing OTel standards and purpose-built LLM tooling is closing the gap faster than most people expected.

Production Considerations: Privacy and Volume

Two things tend to surprise teams once LLM tracing moves from a prototype into production: what’s actually being stored, and how much of it there is.

Sensitive data in traces. By design, traces capture full prompts and completions, which means anything a user types or anything the model generates gets persisted, including PII, credentials, or other sensitive content that may pass through your application. This is easy to overlook early on, since a trace that’s useful for debugging during development can become a liability once it’s running against real user data. Before traces hit long-term storage, it’s worth building in a scrubbing or redaction step that strips known sensitive fields, masks patterns like emails or API keys or excludes specific attributes entirely based on their content.

Trace volume and cost. At any meaningful scale, capturing every trace for every request becomes expensive both in storage costs and in the noise it adds when you’re trying to debug a specific issue. That’s when sampling strategies help. You might trace 100% of requests at first or in a low-traffic environment and drop to a percentage-based sample in time, with the option to force full tracing for specific conditions like errors, slow requests, or a subset of users for ongoing quality review. The goal is to keep enough signal to debug effectively and track quality over time, without paying to store the full firehose of production traffic indefinitely.

Both of these are too easy to defer until they cause problems such as discovering your tracing bill scaled faster than your user base. No observability implementation plan is complete without specifying how you plan to protect sensitive data and manage the volume of data that tracing typically produces.

Conclusion

LLM applications introduce a class of problems that traditional monitoring was never designed to handle: non-deterministic outputs, multi-step agent workflows, prompt-sensitive behavior, and quality signals that don’t map to error rates or response codes. Tracing is what bridges that gap.

By capturing the full context of every request (the prompts, the responses, the retrieval steps, the tool calls, and the relationships between them) tracing gives you the visibility to debug with confidence, optimize with data, and improve with intention rather than guesswork.

The standards and tooling around observability for LLM applications are maturing fast, but the core principle is already clear: if you’re building LLM-powered applications and you can’t see inside them, you’re flying blind. Tracing is how you turn the lights on.

Are you ready to deploy your AI-powered system to production? Need help hardening existing applications? Let’s talk! opens a new window

Our AI Services

Turn your data into a competitive advantage

View AI Services opens a new window