Evaluation
How to evaluate your AI workflow in Stack AI
Last updated
How to evaluate your AI workflow in Stack AI
Last updated
The evaluation section helps you analyze the performance of your LLM over several tests.
To run an evaluation, fill your evaluate platform by specifying the different inputs and parameters of the workflow. You specify the following parameters:
Inputs (nodes with an id of in-)
URLs (nodes with an id of url-)
Images (nodes with an id of img2text-)
All of these parameters are set dynamically for it test.
Add up-to 50 evaluations in parallel to your LLM workflow
You can upload a CSV with evaluations with the values of each parameter and output to be used. For instance, a CSV may have the following structure:
Hello
www.google.com
Hi :)
How are you?
www.facebook.com
Doing well.
How is it going?
www.twitter.com
All good!
Once your evaluations are done, you can download your results as a separate CSV.
You can grade a flow performance by selecting an "Output to evaluate" and specifying a "Grade Criteria" in the input box.
Under the hood, the grading is running a pipeline of multiple LLMs that scores the completions of other LLMs in the pipeline.
Evaluation criteria work better under the following guidelines:
Specify clear instructions: state exactly what is the goal of the LLM workflow and how you expect it to respond.
Enumerate how to grade: if possible, mention what corresponds to a 10-point score, what corresponds to a 5-point score, and what corresponds to a 1-point score.
Use plain English: avoid using technical jargon that wouldn't be immediately understandable by an LLM.
If the field "ground truth" is completed, the model grader also evaluates how closely the LLM completion matches the ground truth.
(Coming soon)