score model responses
Evaluator cards
Production-shaped cards with score, failure labels, rewrite, scorer note, dataset labels, and eval-task note.
application route / AI voice / model behavior
A compact route for reviewing voice judgment, humor, cultural fluency, scoring, rewrites, labels, persona consistency, and evaluation-task design.
Fast review path
Start here for the clearest proof of scoring, rewrites, Grok-specific model-behavior judgment, and labelable voice evaluation.
Four clicks. No archive wandering.
Why this role
This role asks for reviewing and scoring AI responses for humor, timing, engagement, and conversational naturalness; revising sample responses while preserving factual accuracy; maintaining a consistent personality across contexts; labeling humor, irony, banter, and cultural references; and helping engineering teams turn those traits into evaluation tasks.
Voice Evaluation Lab and Evaluator Cards show the scoring, rewrite, labeling, and eval-task loop. Coaching Grok shows the model-behavior diagnosis layer. Public Receipts supply the portfolio side: timing, recovery, factual hygiene, and cultural fluency in live public contexts.
score model responses
Production-shaped cards with score, failure labels, rewrite, scorer note, dataset labels, and eval-task note.
rewrite sample responses
Invented weak responses revised toward sharper, safer, more answerable voice while preserving usefulness.
maintain personality
Casual, technical, disagreement, philosophical uncertainty, correction, and public humor without changing characters mid-answer.
label humor and culture
A small schema for humor mechanic, banter intensity, reference fit, stakes/risk class, cringe risk, failure mode, and rewrite target.
build evaluation tasks
A scorer-ready task shape with candidate responses, reward criteria, and penalty criteria.
public portfolio
Live public timing, deadpan humor, self-aware recovery, factual correction, meme grammar, and quieter grounded explanation.
Coaching Grok: Upstream vs Downstream diagnoses a reusable failure mode: factually correct responses can still miss the conversational job. It includes a repair template: Frame → Mechanism → Facts → Inference boundary → Refusal.