OPIK : AI Evaluation and Observability
This lecture, led by Abby Morgan, an AI Research Engineer, introduces AI evaluation as a systematic feedback loop for transitioning prototypes to production-ready systems. It outlines the four key components of a useful evaluation: a target capability, a test set, a scoring method, and decision rules. The session differentiates between general benchmarks and specific product evaluations, emphasizing the need for observability in agent evaluation. It demonstrates using OPIK, an open-source tool, to track, debug, and evaluate LLM agents through features like traces, spans, 'LM as a judge', and regression testing datasets.
Takeaways
- Introduction to Abby Morgan, an AI Research Engineer and Developer Advocate at Comet.
- Housekeeping rules for the hackathon: use the public Discord channel for specific questions to ensure fairness and help others.
- Recap of the previous session on AI evaluations: they turn prototypes into production-ready systems.
- Evaluation is the feedback loop that enables systematic improvement, turning guesswork into a scientific process.
- Evaluations help in making decisions (ship or roll back), debugging failures, and building trustworthy systems.
- An evaluation is defined as a structured, repeatable measurement of system behavior against specific criteria.
- Four key ingredients of a useful evaluation: a target capability, a test set reflecting the relevant world, a scoring method, and decision rules.
- Evaluation outputs are not just numbers; they can include concrete examples, categorical slices, and error taxonomies.
- Distinction between benchmarks and product evaluations: Benchmarks are for broad comparisons, while product evals are specific to your use case, tools, and workflows.
- The importance of observability in evaluating agents: the full trace (context, prompts, tool calls) is crucial as agents don't fail like traditional software.
Highlights
"Evaluation is essentially just the feedback loop that makes improvement thematic."-- Annie Morgan"Evals are what turn random iteration or guesswork into more of a scientific process that allows you to improve in a systematic way."-- Annie Morgan
Chapters & Topics
The Role and Definition of AI Evaluation
An evaluation is a structured, repeatable measurement of system behavior against criteria we care about. This structured approach is key to distinguishing a true evaluation from a mere product demo, as it must be consistently runnable over time to interpret changes.
- Keypoints
- Evaluation turns a prototype into a production-ready system.
- It provides a feedback loop for systematic improvement.
- It helps make decisions like shipping or rolling back features.
- An evaluation must be structured and repeatable.
- It's the difference between a product demo and a scientific process.
- It helps build systems that people can trust in the real world.
- Explanation The process of evaluation is what transforms a cool but unreliable prototype into something that can be confidently shipped and iterated upon in a real-world production environment. It provides the necessary feedback loop to make improvements systematic rather than random. It helps decide whether to ship new features, roll back changes, debug failures, and ultimately, build systems that users can trust. Without a structured evaluation process, it's difficult to know if a system is actually improving or if observed successes are just cherry-picked examples or irrelevant outliers.
Four Ingredients of a Useful Evaluation
Most useful evaluations are composed of four essential ingredients: a target capability, a test set, a scoring method, and decision rules.
- Keypoints
- A target capability (e.g., fluency, relevance, toxicity).
- A test set that reflects the specific world and edge cases you care about.
- A scoring method, which can be human annotation or an automated metric.
- Decision rules that dictate actions based on the evaluation scores (e.g., ship, roll back).
- Explanation To create a useful evaluation, you need four components. First, a 'target capability' which could be things like fluency, relevance, or toxicity. Second, a 'test set' that accurately reflects the real-world scenarios you care about, including edge cases. This is where product evals differ from general benchmarks. Third, a 'scoring method', which can be either human annotation or an automated metric. Finally, 'decision rules' which define what actions to take based on the evaluation scores, such as deploying a new feature or rolling it back.
Benchmarks vs. Product Evals
Benchmarks are standardized evaluations for broad comparisons across a wide range of use cases, whereas product evaluations are tailored to a specific use case, including its unique tools and workflows. Benchmarks are helpful for general comparisons, but are not a substitute for product-specific evaluations.
- Keypoints
- Benchmarks are standardized and cover a wide range of use cases.
- Product evals are specific to your use case, tools, and workflows.
- Benchmarks are for broader comparisons.
- Product evals are for seriously evaluating your system for its specific purpose.
- Product evals should use a test set that reflects the particular world you care about, including edge cases.
- Explanation While the term 'benchmark' is often used in AI evals, it's important to distinguish it from a 'product eval'. Benchmarks are generalized and standardized, designed to test a wide range of examples and use cases. They are useful for broader comparisons. On the other hand, a product evaluation is specific to your product's use case. It should use a test set that reflects the world you care about, including specific edge cases, and incorporate your particular tools and workflows. For serious evaluation of your own system's performance for its intended purpose, a product-specific eval is necessary.
Observability as the Foundation for Agent Evaluation
For AI agents, observability is the foundation of evaluation. Since agents don't fail like traditional software, it's crucial to look beyond the final output and observe the full trace of their operation. This includes the context retrieved, prompts used, tools called, and all intermediate steps. This observation is the first step in a cycle of improvement: observe, understand, evaluate, and then improve.
- Keypoints
- The final output of an agent is never the full story.
- Observing the full trace is critical for evaluation.
- The trace includes retrieved context, prompts, tool calls, and intermediate steps.
- Agents do not fail like traditional software, so observation is key.
- The improvement cycle is: observe -> understand -> evaluate -> improve.
- OPIK is designed to make observability and evaluation practical and simple.
- Explanation Evaluating agents requires starting with observability because their final output alone doesn't tell the whole story. Unlike traditional software, an agent's failure or success is determined by a complex series of steps. Therefore, you must be able to see the full trace of its actions. This includes what context was retrieved, what prompts were generated and used, which tools were called, and all the intermediate results. Once you can observe this entire process, you can begin to understand what is happening. With understanding, you can then evaluate the agent's performance against your criteria. Finally, with these evaluations and measurements of success, you can systematically begin to improve the agent. OPIC is a tool specifically designed to facilitate this process by making observability and evaluation practical and easy.
Integrating OPIK into Your Code
OPIK is an observability tool that can be easily integrated into your code to track, monitor, and evaluate the performance of LLM agents. It requires minimal code, often just one to three lines, to start logging agent interactions. The integration method might vary slightly depending on whether you're using a direct integration like with OpenAI Agents or a more general method like the track decorator.
- Keypoints
- Import OPIK at the top of your file.
- Integration can be as simple as one to three lines of code.
- Direct integrations, like with OpenAI Agents, might have a specific syntax.
- The '@track' decorator is another common method for integration.
- Explanation To integrate OPIK, you first import it at the top of your script. Then, depending on the framework, you might use a specific call like the one shown for OpenAI Agents, or use a track decorator. For the demonstrated recipe generator agent, which uses OpenAI Agents, the integration was slightly different but still very simple. This minimal setup allows OPIK to automatically capture detailed information about each agent run.
- Examples
A basic agent acting as a recipe generator. The user provides a list of ingredients, and the agent performs two LLM calls. The first LLM call suggests a recipe based on the ingredients (e.g., creamy orange tomato soup from tomatoes, cream, and oranges). The second LLM call researches the steps to create that specific recipe.
- The user runs the script and is prompted for ingredients.
- The user enters 'tomatoes, cream, oranges'.
- The first LLM processes these ingredients and suggests 'creamy orange tomato soup'.
- This recipe name is then passed to the second LLM.
- The second LLM researches and outputs the detailed steps for making the soup.
- All of this activity is logged as a single trace in the OPIC dashboard.
Understanding Traces and Spans
OPIC organizes logged data into traces and spans. A trace represents a single, complete end-to-end process or execution of your agent. A span is a smaller, individual step or operation within that trace. This hierarchical structure allows for both a high-level overview and a granular look at each component of the agent's execution, which is crucial for debugging and evaluation.
- Keypoints
- A trace is the entire end-to-end agentic call.
- A span is an individual step within the trace.
- The dashboard allows you to view the overall trace and drill down into individual spans.
- This structure helps isolate where failures or issues occur in a complex agent.
- Explanation When an agent runs, the entire operation is captured as one trace. Within this trace, you can see individual spans corresponding to specific actions, like LLM calls or tool usage. The OPIC dashboard clearly displays the input and output for the entire agent call (the trace) and for each individual step (the spans). This helps identify exactly where in the process a failure or unexpected behavior occurs. For example, if an agent fails, you can inspect the spans leading up to the failure to understand the root cause.
- Examples
In the recipe generator agent, the entire process from taking user ingredients to outputting a full recipe is one trace. Within that trace, the first LLM call (suggesting 'chicken parmesan') is one span, and the second LLM call (researching how to make it) is another span.
- User interacts with the agent, triggering a run. This entire run is a 'trace'.
- The agent calls the first LLM to generate a recipe idea. This call is a 'span'.
- The agent then calls the second LLM to get the recipe steps. This second call is another 'span'.
- The dashboard shows the full trace and allows clicking into it to see the individual spans, each with its own inputs, outputs, and metadata.
Online Evaluations and 'LM as a Judge'
OPIC allows for creating online evaluations that automatically score agent performance against predefined criteria every time an agent runs. A key feature is 'LM as a judge,' where one LLM is used to evaluate the output of your agent's LLM based on a custom prompt and scoring scale you provide. This enables automated, qualitative assessment of agent outputs.
- Keypoints
- Online evaluations automatically score agent calls based on created rules.
- 'LM as a judge' uses an external LLM to evaluate your agent's output.
- You must provide a detailed prompt to guide the judge LLM.
- Users can manually annotate traces with their own scores to calibrate the 'LM as a judge' evaluator.
- You can compare human scores with LLM-generated scores to improve the evaluation prompt.
- Explanation To set up an online evaluation, you navigate to the 'Online Evaluation' section in the OPIC dashboard and create a new rule. You name the rule, provide API keys for the external judging LLM (e.g., from OpenAI, Anthropic, or OpenRouter), and select the model. You then write a detailed prompt that instructs the judge LLM on how to score the output, including the context, scoring scale, and criteria. Once the rule is created, every new trace will be automatically scored against this metric, and the scores will appear in your trace list.
- Examples
For the recipe agent, three 'LM as a judge' evaluation metrics were created to assess the generated recipes. The user set up a rule by providing a prompt to a judge LLM, defining a scoring scale and the context for what constitutes a good recipe suggestion. After setup, every time the recipe agent ran, it was automatically scored by these metrics.
- Go to 'Online Evaluation' and click 'Create New Rule'.
- Name the rule (e.g., 'Recipe Coherence').
- Choose a provider and model for the judge LLM (e.g., OpenAI's GPT-4).
- Write a prompt for the judge: 'You are an evaluator. Rate the following recipe on a scale of 1-5 for coherence...'
- Save the rule. Now, all subsequent runs of the recipe agent will have a 'Recipe Coherence' score.
- You can then go into a trace and manually add a human score to compare against the 'LM as a judge' score, which helps in refining the evaluation prompt.
- Considerations
- It's good practice to use a healthy mix of heuristic evaluation metrics and 'LLM as a judge' metrics.
- You may need to iterate on the prompt given to the judging LLM to ensure its evaluations align with your definition of success.
- Using one LLM to judge another can be problematic, so human oversight and comparison are valuable.
- Special Circumstances
- If you suspect the 'LM as a judge' is providing inaccurate scores, you should go into the traces, manually score them yourself, and compare your scores to the AI's. Based on the discrepancies, you can tweak the prompt given to the judge LLM to make its ratings more closely aligned with human judgment.
Managing and Testing with Problematic Samples
OPIC provides features to isolate problematic agent runs and use them for regression testing. When an agent produces an error or an undesirable output, you can select these 'problematic samples' from your traces and add them to a named dataset. This dataset can then be used to repeatedly test your agent after making code changes, ensuring that your fixes are effective and don't introduce new issues.
- Keypoints
- Isolate traces where the agent fails or performs poorly.
- Select these problematic samples in the dashboard.
- Add them to a new or existing dataset.
- Use this dataset to run regression tests on your agent after making code changes.
- This helps verify that improvements are effective and don't cause regressions.
- Explanation To do this, you would go through your list of traces in the OPIC dashboard. You can filter or sort by low evaluation scores or errors to find the problematic runs. Select the checkboxes next to these traces. Then, use the option to 'add to a dataset'. You can create a new dataset (e.g., 'problematic samples') or add to an existing one. Later, when you've modified your agent's code, you can run the agent specifically against the inputs from this dataset to see if the outputs have improved.
Getting Started with OPIC and Basic Tracing
OPIC (Open-Source Project for Instrumenting and Correcting AI) is an open-source tool for LLM observability. It can be self-hosted locally for free, providing high customization, or used via a 100% free cloud tier for quick setup. The demonstration and initial setup guide focus on the free cloud tier.
- Keypoints
- OPIC is 100% open source and offers a 100% free cloud tier.
- Setup involves installing the package and configuring it with an API key and workspace.
- The '@track' decorator is the simplest way to add tracing to any Python function.
- Direct integrations for frameworks like OpenAI provide deeper observability by wrapping the client object.
- Explanation To get started with the free cloud tier, first install OPIC using 'pip install opic'. Then, configure your account by running 'open configure'. This command will prompt you for your API key, which can be found in your Comet account settings. You will also confirm the workspace you want to use. You can also set these as environment variables to avoid entering them manually each time. A project name can be specified using a code snippet provided in the quick start guide.
- Examples
A simple 'Hello World' trace example can be created using the '@track' decorator from OPIC. You import 'track' from OPIC and place the decorator above the definition of the function you want to monitor, such as an LLM call. This allows you to track the function's execution without rewriting your code. Every time the decorated function is run, it will automatically log to Comet, and a link to view the trace will be provided in the output.
- Import the 'track' decorator:
from opic import track - Place the decorator above your function definition:
@track def my_function(): ...
- Import the 'track' decorator:
- Run the function. It will automatically log its execution, inputs, and outputs to OPIC.
OPIC offers direct integrations with popular frameworks like OpenAI. Instead of using the '@track' decorator, you import the specific integration module (e.g.,
opic.integrations.openai). Then, you wrap your framework client (e.g., the OpenAI client) in an OPIC track function. This provides deep observability into all system metrics and information from the LLM call with just one or two lines of code. - Import the integration:
import opic.integrations.openai - Instantiate your original client:
client = OpenAI() - Wrap the client with the OPIC function:
client = opic.integrations.openai.patch(client) - Now, all calls made using this
clientobject will be automatically tracked.
Debugging and Evaluation with OPIC
OPIC provides visibility into an agent's execution flow, helping users debug errors and identify areas for improvement. It allows you to see exactly which step in a multi-step agent failed, avoiding the need to manually sift through code or error traces.
- Keypoints
- OPIC's trace visualization helps pinpoint the exact step where an error occurred in an agent.
- This reduces debugging time by narrowing the search space from the entire codebase to a specific component (e.g., one LLM call).
- Automatic evaluation metrics can be set up to flag issues like toxicity that are hard to catch manually.
- The UI allows filtering the entire dataset based on evaluation flags, aggregating all relevant examples for analysis.
- Explanation When an agent call fails, OPIC's trace view shows the sequence of operations (e.g., LLM calls, tool calls). If a step fails, the trace stops there. For example, if an agent fails at the 'recipe suggester' step, you know the problem lies within that specific LLM call. This narrows down the debugging scope significantly. For more subtle issues, you can create evaluation metrics for things like toxicity. These metrics can automatically flag problematic runs, which you can then filter and analyze in the OPIC UI. This makes it much easier to consume a lot of information quickly and distill it down to find where things are going wrong, why, and how to fix them.
Managing Evaluation Datasets and Metrics
To maintain a relevant dataset for evaluations as an application evolves, OPIC allows for dynamic management of evaluation datasets. It is also highly recommended to use a combination of evaluation methods rather than relying on a single one.
- Keypoints
- Evaluation datasets in OPIC can be dynamically updated by adding or removing traces.
- If a dataset becomes irrelevant due to application changes, you can edit it or create a new one from scratch.
- It is recommended to use multiple evaluation metrics, such as combining heuristic logic with an LLM-as-a-judge, for more robust evaluation.
- Users are not limited to one evaluation dataset and can maintain several for different testing scenarios.
- Explanation As your application changes (e.g., tool call names, parameters), some traces in your evaluation dataset may become obsolete. In the OPIC UI, you can easily remove irrelevant traces from an existing dataset. You can also add new, more relevant traces. If a dataset has largely lost its relevance, you can create a completely new one from scratch. Users are not limited to a single evaluation dataset and can have dozens for different purposes. For the evaluation itself, you can and should use multiple metrics. For example, you can evaluate a model using a combination of heuristic logic and an LLM-as-a-judge.
Assignments & Suggestions
- Include the source code file (like a cursor) or basic setup steps in your final submission if you use AI tools like n8n, so judges can confirm the project works as intended.
- For detailed or 'deep divey' questions about OPIC, post them on the Discord server for a more thorough answer.
- For questions about specific SDKs and the functional differences between various OPIC integration methods, post them on Discord.
