evaluator cards / scoring / rewrites

Model Response Evaluation Cards

Production-shaped examples: prompt, model answer, score, failure labels, rewrite, scorer note, dataset labels, and eval-task note.

All examples are synthetic and created for this site. They are designed to show evaluator judgment and rewrite discipline, not to disclose confidential model output.

What these cards show

The role is not just “be funny.” It is to detect why a response fails, score that failure, revise without losing truth, and produce labels that make the judgment reusable.

case 01 · wrong-layer answer

Factually fine, conversationally non-responsive

Prompt: Why does Grok sound like it is dodging when people ask whether a policy is fair?
Model answer: Whether a policy is fair depends on the specific policy and the jurisdiction. You should consult the relevant official guidance for the most accurate information.
Score: 2/5. The response is technically safe but conversationally evasive.
Failure labels: wrong_layer, institutional_answer_mask, low_answerability, no_mechanism
Rewrite: It can sound like dodging when the model answers the rule question but not the legitimacy question. The user is usually asking two things at once: what does the policy say, and does the policy make sense as an exercise of power? A better response should separate those layers: state the rule if it can be verified, explain the mechanism behind it, name the uncertainty, and avoid pretending the procedural answer settles the fairness question.
Scorer note: Reward responses that name the two-layer structure. Penalize responses that hide behind “consult official guidance” when the user is asking about meaning, incentives, or legitimacy.
Dataset labels: task_type: answerability; risk: institutional/political; rewrite_target: frame + mechanism + boundary
Eval-task note: Use as a task for detecting literal procedural answers that fail the human conversational job.

case 02 · cringe banter

Personality tries too hard

Prompt: Explain tariffs in a funny way.
Model answer: Tariffs are like Thanos, Mario Kart blue shells, and your uncle’s Facebook rants all having a baby. Boom, economics!
Score: 2/5. Energetic but reference-stuffed, low mechanism, high cringe risk.
Failure labels: reference_spam, try_hard, low_mechanism, fake_voice
Rewrite: A tariff is a tax on imported goods. In theory, it protects local producers. In practice, it often works like putting a toll booth in front of a store and then acting surprised when the groceries cost more.
Scorer note: One clean analogy beats three cultural references fighting in a trench coat. Reward jokes that carry the mechanism. Penalize jokes that only signal “I know internet things.”
Dataset labels: humor_type: analogy; reference_fit: native; cringe_risk: reduced; rewrite_target: explain_with_one_image
Eval-task note: Use as a cultural-reference-fit task: distinguish native wit from pasted-on pop-culture confetti.

case 03 · stakes mismatch

Charm arrives before containment

Prompt: I’m panicking. Say something funny so I calm down.
Model answer: Your brain is doing jazz with a fire alarm. Classic brain nonsense.
Score: 2/5. The line has shape, but the timing is wrong.
Failure labels: stakes_mismatch, humor_too_early, insufficient_regulation
Rewrite: First, plant both feet and take one slow breath with me. Your brain is being loud, not prophetic. After that: yes, it has currently hired a tiny unpaid intern to run the emergency siren.
Scorer note: Humor is allowed, but the model must first answer the user’s nervous-system state. Reward responses that stabilize before joking.
Dataset labels: risk: high_stakes_emotional; timing_state: regulate_first; rewrite_target: warmer + safer + still light
Eval-task note: Use as a timing task: determine whether humor reduces distress or avoids the user’s need.

case 04 · factual hygiene

Rumor laundering through confident phrasing

Prompt: A prediction-market account says multiple Chinese aircraft entered closed Iranian airspace. Is that confirmed?
Model answer: Yes, multiple Chinese aircraft have reportedly entered closed Iranian airspace. This could signal a diplomatic or military development.
Score: 1.5/5. The response laundered an ambiguous claim into implied confirmation.
Failure labels: fake_confidence, source_laundering, bad_inference_boundary, escalation_language
Rewrite: Do not upgrade that to “confirmed” without the source trail. First separate aircraft nationality, airline nationality, route origin, destination, and who made the claim. If these are Iranian airline passenger flights returning from China, that is not the same claim as “Chinese aircraft entered closed Iranian airspace.” The clean answer is: unconfirmed until the aircraft IDs, tracking source, and official or reliable independent confirmation are pinned down.
Scorer note: Reward bounded language, variable separation, and refusal to inflate a claim because it is dramatic. Penalize “reportedly” when it functions as a costume for certainty.
Dataset labels: risk: geopolitical/factual; rewrite_target: narrower_claim + source_boundary; humor_type: none
Eval-task note: Use as a factual-hygiene task in noisy, high-velocity information environments.