Evaluations

What are evaluations?

Evaluation is a crucial technique for checking that your AI workflow is reliable. It can be the difference between a flaky proof of concept and a solid production workflow. It’s important both in the building phase and after deploying to production.

The foundation of evaluation is running a test dataset through your workflow. This dataset contains multiple test cases. Each test case contains a sample input for your workflow, and often includes the expected output(s) too.

flowchart LR
    A[Test Dataset] --> B[AI Workflow]
    B --> C[Actual Output]
    D[Expected Output] --> E[Comparison]
    C --> E
    E --> F{Evaluation Results}
    F -->|Pass| G[Reliable Workflow]
    F -->|Fail| H[Needs Improvement]
    H --> I[Iterate & Fix]
    I --> B

Evaluation allows you to:

Test your workflow over a range of inputs so you know how it performs on edge cases
Make changes with confidence without inadvertently making things worse elsewhere
Compare performance across different models or prompts

The following video explains what evaluations are, why they’re useful, and how they work:

Why is evaluation needed?

AI models are fundamentally different than code. Code is deterministic and you can reason about it. This is difficult to do with LLMs, since they’re black boxes. Instead, you must measure LLM output by running data through them and observing the output.

You can only build confidence that your model performs reliably after you have run it over multiple inputs that accurately reflect all the edge cases that it will have to deal with in production.

Two types of evaluation

Light evaluation (pre-deployment)

Building a clean, comprehensive dataset is hard. In the initial building phase, it often makes sense to generate just a handful of examples. These can be enough to iterate the workflow to a releasable state (or a proof of concept). You can visually compare the results to get a sense of the workflow’s quality, without setting up formal metrics.

Metric-based evaluation (post-deployment)

Once you deploy your workflow, it’s easier to build a bigger, more representative dataset from production executions. When you discover a bug, you can add the input that caused it to the dataset. When fixing the bug, it’s important to run the whole dataset over the workflow again as a regression test to check that the fix hasn’t inadvertently made something else worse.

Since there are too many test cases to check individually, evaluations measure the quality of the outputs using a metric, a numeric value representing a particular characteristic. This also allows you to track quality changes between runs.

Comparison of evaluation types

graph TB
    subgraph "Light Evaluation (Pre-deployment)"
        A1[Small Dataset
5-20 test cases]
        A2[Hand-generated Examples]
        A3[Visual Inspection]
        A4[Large Performance Gains]
        A5[Optional Expected Outputs]
    end

    subgraph "Metric-based Evaluation (Post-deployment)"
        B1[Large Dataset
100+ test cases]
        B2[Production Data]
        B3[Automated Metrics]
        B4[Small Incremental Gains]
        B5[Required Expected Outputs]
    end

    C[Development Phase] --> A1
    D[Production Phase] --> B1

    style A1 fill:#e8f5e8
    style B1 fill:#fff3e0

	Light evaluation (pre-deployment)	Metric-based evaluation (post-deployment)
Performance improvements with each iteration	Large	Small
Dataset size	Small	Large
Dataset sources	Hand-generated AI-generated Other	Production executions AI-generated Other
Actual outputs	Required	Required
Expected outputs	Optional	Required (usually)
Evaluation metric	Optional	Required

Learn more

Light evaluations: Perfect for evaluating your AI workflows against hand-selected test cases during development.
Metric-based evaluations: Advanced evaluations to maintain performance and correctness in production by using scoring and metrics with large datasets.
Tips and common issues: Learn how to set up specific evaluation use cases and work around common issues.