Skip to Content
DocumentationRubric Format

Rubric Format Reference

The evaluation rubric defines how a skill is scored against gold standards. It’s a YAML file that specifies dimensions, weights, criteria, and scoring parameters.

Schema

name: <skill-name>-rubric # identifier dimensions: # 5-10 scored dimensions - name: <string> # lowercase, underscores (e.g., content_accuracy) weight: <float> # 0.0-1.0, must sum to 1.0 across all dimensions scale: "1-10" # scoring scale criteria: <string> # falsifiable statement of what this measures target_score: <float> # normalized 0.0-1.0 (e.g., 0.85 = 85%) max_iterations: <int> # experiment budget (default: 20) plateau_window: <int> # consecutive failures before plateau alert (default: 5)

Universal Dimensions

Every rubric includes these four dimensions. Adjust weights per domain (default total: 0.60).

DimensionDefault WeightCriteria
correctness0.20Instructions are technically accurate and executable. No hallucinated APIs, wrong syntax, or incorrect behavior.
completeness0.15All necessary sections present. Edge cases addressed. No critical workflow gaps.
clarity0.15A naive agent can follow without ambiguity. Consistent terminology. Concrete examples.
consistency0.10Aligns with existing codebase conventions, naming patterns, and integration points.

Domain-Specific Dimensions

Add 3-6 domain-specific dimensions using the remaining weight (default: 0.40).

Writing Good Criteria

Criteria must be falsifiable — a judge can determine TRUE or FALSE by examining the artifact.

Good:

  • “Every gold standard pattern appears in the skill output”
  • “No org-specific names or hardcoded references appear in the body”
  • “All file references resolve to existing files”
  • “Output length is within 20% of reference length”

Bad:

  • “The skill is well-written” (subjective)
  • “Output is high quality” (vague)
  • “Follows best practices” (which practices?)

Overall Score Computation

overall_score = sum(dimension_score / scale_max * weight) for each dimension

Example with a 1-10 scale:

  • correctness: 8/10 * 0.20 = 0.16
  • completeness: 7/10 * 0.15 = 0.105
  • clarity: 9/10 * 0.15 = 0.135
  • consistency: 8/10 * 0.10 = 0.08
  • coverage: 7/10 * 0.20 = 0.14
  • actionability: 8/10 * 0.20 = 0.16
  • overall_score = 0.78

Example Rubric

name: code-behavior-rubric dimensions: - name: correctness weight: 0.20 scale: "1-10" criteria: "Rules are technically accurate. Code examples compile and run correctly." - name: completeness weight: 0.10 scale: "1-10" criteria: "All relevant coding scenarios covered. Edge cases addressed." - name: clarity weight: 0.15 scale: "1-10" criteria: "Rules are unambiguous. A naive agent follows them without misinterpretation." - name: consistency weight: 0.10 scale: "1-10" criteria: "Matches existing project conventions and coding standards." - name: coverage weight: 0.15 scale: "1-10" criteria: "Every pattern from gold standard source material appears in the skill." - name: actionability weight: 0.15 scale: "1-10" criteria: "Rules are concrete and executable, not abstract platitudes." - name: portability weight: 0.15 scale: "1-10" criteria: "Works across projects without hardcoded names or project-specific assumptions." target_score: 0.85 max_iterations: 20 plateau_window: 5

See rubric-templates.md  in the factory skill for additional domain templates (workflow skills, pipeline skills).

Judge Variance

From case study experience (extensive experiments with multiple judge models):

  • Typical judge variance: 0.2-0.3 points on a 1-10 scale
  • Improvements must exceed judge variance to be meaningful
  • Different judge models have different characteristics:
    • Lower variance models give more consistent but potentially less nuanced scores
    • Stricter models produce lower baselines but may detect subtler issues
  • Multiple judges improve confidence but increase evaluation cost

When plateau detection triggers, check whether the plateau falls within judge variance — the skill may already be at the measurement ceiling.