How to create an evaluator with custom YAML configuration?

When creating an evaluator with a custom YAML configuration type, users will have more freedom to define the prompts used in the evaluation process and the form of evaluation. Here is a relatively complete YAML configuration example:

prompt: |-
  You are comparing a submitted answer to an expert answer on a given question. Here is the data:
  [BEGIN DATA]
  ************
  [Question]: {input}
  ************
  [Expert]: {ideal}
  ************
  [Submission]: {completion}
  ************
  [END DATA]

  Compare the factual content of the submitted answer with the expert answer. Ignore any differences in style, grammar, or punctuation.
  The submitted answer may either be a subset or superset of the expert answer, or it may conflict with it. Determine which case applies. Answer the question by selecting one of the following options:
  (A) The submitted answer is a subset of the expert answer and is fully consistent with it.
  (B) The submitted answer is a superset of the expert answer and is fully consistent with it.
  (C) The submitted answer contains all the same details as the expert answer.
  (D) There is a disagreement between the submitted answer and the expert answer.
  (E) The answers differ, but these differences don't matter from the perspective of factuality.
eval_type: cot_classify
choice_strings:
  - "A"
  - "B"
  - "C"
  - "D"
  - "E"
choice_scores:
  "A": 0.8
  "B": 0.8
  "C": 0.8
  "D": 0.0
  "E": 0.5
threshold: 0.5
reverse_score: 0
answer_prompt: ""

Supported properties include:

prompt: Dialog template content, where placeholders can be used
eval_type: Reasoning method, which can be one of cot_classify, classify_cot, classify. Here are the reasoning prompts represented by these three reasoning methods:

# e.g. "Yes"
"classify": "Answer the question by printing only a single choice from {choices} (without quotes or punctuation) corresponding to the correct answer with no other text."
# e.g. "Yes\n The reasons are: ..."
"classify_cot": "First, answer by printing a single choice from {choices} (without quotes or punctuation) corresponding to the correct answer. Then, from the next line, explain your reasonings step by step."
# e.g. "Let's think step by step. ...\nYes"
"cot_classify": """
First, write out in a step by step manner your reasoning to be sure that your conclusion is correct. Avoid simply stating the correct answer at the outset. Then print only a single choice from {choices} (without quotes or punctuation) on its own line corresponding to the correct answer. At the end, repeat just the answer by itself on a new line.

Here, {choices} is a placeholder that will be replaced by the choice_strings in the YAML evaluation configuration.

choice_strings: A list of strings or a single string representing the grade options for expressing the evaluation result.
choice_scores: A mapping from grades to scores, represented as a dictionary.
threshold: The threshold value used to convert scores to assertions, which needs to be used in conjunction with reverse_score. If the value of reverse_score is 0, samples with scores greater than or equal to this threshold will be considered as passing the evaluation; otherwise, samples with scores less than this threshold will be considered as passing.
reverse_score: Reverse scoring, with a value of 0 or 1, defaulting to 0, which needs to be used in conjunction with threshold.
answer_prompt: The answer prompt, which has a higher priority than eval_type. If set and not empty, it will override the reasoning method represented by eval_type.

When generating the final evaluation prompt to be used in the evaluation, the following placeholders in the prompt will be replaced by the corresponding content from the evaluation sample:

{input}: The prompt
{ideal}: The ideal answer
{completion}: The generated answer
{context}: The background information

With the custom YAML configuration, users can create evaluation prompts that meet complex evaluation scenarios even without programming experience. When using it in practice, you can refer to the examples provided when creating the evaluator.