Rubric Format Reference
The evaluation rubric defines how a skill is scored against gold standards. It’s a YAML file that specifies dimensions, weights, criteria, and scoring parameters.
Schema
name: <skill-name>-rubric # identifier
dimensions: # 5-10 scored dimensions
- name: <string> # lowercase, underscores (e.g., content_accuracy)
weight: <float> # 0.0-1.0, must sum to 1.0 across all dimensions
scale: "1-10" # scoring scale
criteria: <string> # falsifiable statement of what this measures
target_score: <float> # normalized 0.0-1.0 (e.g., 0.85 = 85%)
max_iterations: <int> # experiment budget (default: 20)
plateau_window: <int> # consecutive failures before plateau alert (default: 5)Universal Dimensions
Every rubric includes these four dimensions. Adjust weights per domain (default total: 0.60).
| Dimension | Default Weight | Criteria |
|---|---|---|
| correctness | 0.20 | Instructions are technically accurate and executable. No hallucinated APIs, wrong syntax, or incorrect behavior. |
| completeness | 0.15 | All necessary sections present. Edge cases addressed. No critical workflow gaps. |
| clarity | 0.15 | A naive agent can follow without ambiguity. Consistent terminology. Concrete examples. |
| consistency | 0.10 | Aligns with existing codebase conventions, naming patterns, and integration points. |
Domain-Specific Dimensions
Add 3-6 domain-specific dimensions using the remaining weight (default: 0.40).
Writing Good Criteria
Criteria must be falsifiable — a judge can determine TRUE or FALSE by examining the artifact.
Good:
- “Every gold standard pattern appears in the skill output”
- “No org-specific names or hardcoded references appear in the body”
- “All file references resolve to existing files”
- “Output length is within 20% of reference length”
Bad:
- “The skill is well-written” (subjective)
- “Output is high quality” (vague)
- “Follows best practices” (which practices?)
Overall Score Computation
overall_score = sum(dimension_score / scale_max * weight) for each dimensionExample with a 1-10 scale:
- correctness: 8/10 * 0.20 = 0.16
- completeness: 7/10 * 0.15 = 0.105
- clarity: 9/10 * 0.15 = 0.135
- consistency: 8/10 * 0.10 = 0.08
- coverage: 7/10 * 0.20 = 0.14
- actionability: 8/10 * 0.20 = 0.16
- overall_score = 0.78
Example Rubric
name: code-behavior-rubric
dimensions:
- name: correctness
weight: 0.20
scale: "1-10"
criteria: "Rules are technically accurate. Code examples compile and run correctly."
- name: completeness
weight: 0.10
scale: "1-10"
criteria: "All relevant coding scenarios covered. Edge cases addressed."
- name: clarity
weight: 0.15
scale: "1-10"
criteria: "Rules are unambiguous. A naive agent follows them without misinterpretation."
- name: consistency
weight: 0.10
scale: "1-10"
criteria: "Matches existing project conventions and coding standards."
- name: coverage
weight: 0.15
scale: "1-10"
criteria: "Every pattern from gold standard source material appears in the skill."
- name: actionability
weight: 0.15
scale: "1-10"
criteria: "Rules are concrete and executable, not abstract platitudes."
- name: portability
weight: 0.15
scale: "1-10"
criteria: "Works across projects without hardcoded names or project-specific assumptions."
target_score: 0.85
max_iterations: 20
plateau_window: 5See rubric-templates.md in the factory skill for additional domain templates (workflow skills, pipeline skills).
Judge Variance
From case study experience (extensive experiments with multiple judge models):
- Typical judge variance: 0.2-0.3 points on a 1-10 scale
- Improvements must exceed judge variance to be meaningful
- Different judge models have different characteristics:
- Lower variance models give more consistent but potentially less nuanced scores
- Stricter models produce lower baselines but may detect subtler issues
- Multiple judges improve confidence but increase evaluation cost
When plateau detection triggers, check whether the plateau falls within judge variance — the skill may already be at the measurement ceiling.