What is the Relationship Between Evaluation Grades, Scores, and Pass/Fail Status?

The evaluation result for each sample can be presented in one of the following three forms:

Grade
Score
Pass/Fail

Which of these forms the evaluation result takes depends on the settings of the evaluation metric itself. For example, for a Match type evaluator, the results can be in the form of a score and pass/fail status, whereas for those based on large model grading, the result may only be a grade.

Grade to Score Conversion

A grade conversion table is used to convert grades into scores, primarily for use in custom large model grading, and is defined in a manner compatible with the choice_strings and choice_scores parameters format in the OpenAI Evals' model-graded eval template.

choice_strings: The choices that we expect the model completion to contain given the evaluation prompt. For example, "ABCDE" or ["Yes", "No", "Unsure"]. Any other choices returned by the model are parsed into "invalid".
choice_scores (optional): A mapping of each choice to its score, which is logged as a metric. For example, if a response of "Yes" (resp. "No") indicates that the model's original completion was good (resp. bad), we may assign this choice a score of 1 (resp. 0).

For detailed instructions on the OpenAI Evals' evaluation template, see: https://github.com/openai/evals/blob/main/docs/eval-templates.md

Score to Pass/Fail Conversion

By setting a threshold and scoring direction for the evaluation metric, the score in the evaluation result can be converted to a pass/fail status, which can then be reported as an overall pass rate in the statistical report.

Some custom metric types (e.g., string distance) allow users to fill in a threshold value at creation.

In fully custom large model grading metrics, score to pass/fail conversion can be configured by setting threshold and reverse_score parameters.

threshold (optional): If a threshold is set, samples with scores above this threshold are considered as having passed the evaluation. If not set, no evaluation result including a score will be generated.
reverse_score (optional): Defaults to 0, meaning scores above the threshold are considered as having passed the evaluation; if set to 1, scores below the threshold are considered as having passed. This needs to be used in conjunction with a threshold.

Grade to Score Conversion​

Score to Pass/Fail Conversion​

Grade to Score Conversion

Score to Pass/Fail Conversion