What is the Relationship Between Evaluation Grades, Scores, and Pass/Fail Status?
The evaluation result for each sample can be presented in one of the following three forms:
- Grade
- Score
- Pass/Fail
Which of these forms the evaluation result takes depends on the settings of the evaluation metric itself. For example, for a Match type evaluator, the results can be in the form of a score and pass/fail status, whereas for those based on large model grading, the result may only be a grade.
Grade to Score Conversion
A grade conversion table is used to convert grades into scores, primarily for use in custom large model grading, and is defined in a manner compatible with the choice_strings
and choice_scores
parameters format in the OpenAI Evals' model-graded eval template.
choice_strings:
The choices that we expect the model completion to contain given the evaluation prompt. For example, "ABCDE" or ["Yes", "No", "Unsure"]. Any other choices returned by the model are parsed into "invalid".
choice_scores (optional):
A mapping of each choice to its score, which is logged as a metric. For example, if a response of "Yes" (resp. "No") indicates that the model's original completion was good (resp. bad), we may assign this choice a score of 1 (resp. 0).
For detailed instructions on the OpenAI Evals' evaluation template, see: https://github.com/openai/evals/blob/main/docs/eval-templates.md
Score to Pass/Fail Conversion
By setting a threshold and scoring direction for the evaluation metric, the score in the evaluation result can be converted to a pass/fail status, which can then be reported as an overall pass rate in the statistical report.
Some custom metric types (e.g., string distance) allow users to fill in a threshold value at creation.
In fully custom large model grading metrics, score to pass/fail conversion can be configured by setting threshold
and reverse_score
parameters.
threshold (optional):
If a threshold is set, samples with scores above this threshold are considered as having passed the evaluation. If not set, no evaluation result including a score will be generated.
reverse_score (optional):
Defaults to 0, meaning scores above the threshold are considered as having passed the evaluation; if set to 1, scores below the threshold are considered as having passed. This needs to be used in conjunction with a threshold.