Evaluate AI Agent Response Correctness with OpenAI and RAGAS Methodology

Created by

Jimleuk

Last update

Last update 17 days ago

This evaluation works best where the agent's response is allowed to be more verbose and conversational.
For our scoring, we classify the agent's response into 3 buckets: True Positive (in answer and ground truth), False Positive (in answer but not ground truth) and False Negative (not in answer but in ground truth).
We also calculate an average similarity score on the agent's response against all ground truths.
The classification and the similarity score is then averaged to give the final score.
A high score indicates the agent is accurate whereas a low score could indicate the agent has incorrect training data or is not providing a comprehensive enough answer.

There’s nothing you can’t automate with n8n.