Evaluate your agent's performance with a LLM-as-a-judge
The Evaluator View is similar to the batch interface, in that it allows running a CSV file of inputs on your agent, all at once. This view allows testing before a project goes live, and leverages a LLM to evaluate your agent's output.
There are two types of evaluation:
1. Grading outputs based on criteria
On the right hand side, create an evaluator:
Select the output to evaluate
Add a system prompt - the evaluation logic
Give it a name
Once the evaluator is created, a new column will appear in the table showing the evaluation results for each row.
Add as many evaluators as outputs in your workflow. Each one will evaluate a different output. Give each evaluator's model a system prompt and select which of your agent's outputs should be evaluated.
You can manually add rows to evaluate, or upload a CSV with all your scenarios to evaluate (click the 3 dots and then the upload CSV option).
2. Comparing outputs to a gold standard answer
Click 'Requires Expected Answer' to add a ground truth to your execution. This is the response you would expect from the AI model. The evaluator will then take it into consideration for the analysis.