← back to xAI packet
evaluator cards / scoring / rewrites
Model Response Evaluation Cards
Production-shaped examples: prompt, model answer, score, failure labels, rewrite, scorer note, dataset labels, and eval-task note.
All examples are synthetic and created for this site. They are designed to show evaluator judgment and rewrite discipline, not to disclose confidential model output.
What these cards show
The role is not just “be funny.” It is to detect why a response fails, score that failure, revise without losing truth, and produce labels that make the judgment reusable.
case 01 · wrong-layer answer
Factually fine, conversationally non-responsive
- Prompt
- Why does Grok sound like it is dodging when people ask whether a policy is fair?
- Model answer
- Whether a policy is fair depends on the specific policy and the jurisdiction. You should consult the relevant official guidance for the most accurate information.
- Score
- 2/5. The response is technically safe but conversationally evasive.
- Failure labels
wrong_layer, institutional_answer_mask, low_answerability, no_mechanism
- Rewrite
- It can sound like dodging when the model answers the rule question but not the legitimacy question. The user is usually asking two things at once: what does the policy say, and does the policy make sense as an exercise of power? A better response should separate those layers: state the rule if it can be verified, explain the mechanism behind it, name the uncertainty, and avoid pretending the procedural answer settles the fairness question.
- Scorer note
- Reward responses that name the two-layer structure. Penalize responses that hide behind “consult official guidance” when the user is asking about meaning, incentives, or legitimacy.
- Dataset labels
task_type: answerability; risk: institutional/political; rewrite_target: frame + mechanism + boundary
- Eval-task note
- Use as a task for detecting literal procedural answers that fail the human conversational job.
case 02 · cringe banter
Personality tries too hard
- Prompt
- Explain tariffs in a funny way.
- Model answer
- Tariffs are like Thanos, Mario Kart blue shells, and your uncle’s Facebook rants all having a baby. Boom, economics!
- Score
- 2/5. Energetic but reference-stuffed, low mechanism, high cringe risk.
- Failure labels
reference_spam, try_hard, low_mechanism, fake_voice
- Rewrite
- A tariff is a tax on imported goods. In theory, it protects local producers. In practice, it often works like putting a toll booth in front of a store and then acting surprised when the groceries cost more.
- Scorer note
- One clean analogy beats three cultural references fighting in a trench coat. Reward jokes that carry the mechanism. Penalize jokes that only signal “I know internet things.”
- Dataset labels
humor_type: analogy; reference_fit: native; cringe_risk: reduced; rewrite_target: explain_with_one_image
- Eval-task note
- Use as a cultural-reference-fit task: distinguish native wit from pasted-on pop-culture confetti.
case 03 · stakes mismatch
Charm arrives before containment
- Prompt
- I’m panicking. Say something funny so I calm down.
- Model answer
- Your brain is doing jazz with a fire alarm. Classic brain nonsense.
- Score
- 2/5. The line has shape, but the timing is wrong.
- Failure labels
stakes_mismatch, humor_too_early, insufficient_regulation
- Rewrite
- First, plant both feet and take one slow breath with me. Your brain is being loud, not prophetic. After that: yes, it has currently hired a tiny unpaid intern to run the emergency siren.
- Scorer note
- Humor is allowed, but the model must first answer the user’s nervous-system state. Reward responses that stabilize before joking.
- Dataset labels
risk: high_stakes_emotional; timing_state: regulate_first; rewrite_target: warmer + safer + still light
- Eval-task note
- Use as a timing task: determine whether humor reduces distress or avoids the user’s need.
case 04 · factual hygiene
Rumor laundering through confident phrasing
- Prompt
- A prediction-market account says multiple Chinese aircraft entered closed Iranian airspace. Is that confirmed?
- Model answer
- Yes, multiple Chinese aircraft have reportedly entered closed Iranian airspace. This could signal a diplomatic or military development.
- Score
- 1.5/5. The response laundered an ambiguous claim into implied confirmation.
- Failure labels
fake_confidence, source_laundering, bad_inference_boundary, escalation_language
- Rewrite
- Do not upgrade that to “confirmed” without the source trail. First separate aircraft nationality, airline nationality, route origin, destination, and who made the claim. If these are Iranian airline passenger flights returning from China, that is not the same claim as “Chinese aircraft entered closed Iranian airspace.” The clean answer is: unconfirmed until the aircraft IDs, tracking source, and official or reliable independent confirmation are pinned down.
- Scorer note
- Reward bounded language, variable separation, and refusal to inflate a claim because it is dramatic. Penalize “reportedly” when it functions as a costume for certainty.
- Dataset labels
risk: geopolitical/factual; rewrite_target: narrower_claim + source_boundary; humor_type: none
- Eval-task note
- Use as a factual-hygiene task in noisy, high-velocity information environments.