I/O Contract

The one page that tells you what to put in and what you get out when you build a skill with this harness. If you only read one reference, read this one.

A build is a single self-contained folder, builds/<skill-name>/, with three zones: input/ (you provide) → work/ (the factory maintains) → output/ (you receive). See workspace-layout.md for the full file-by-file layout. (self-test/ is the factory’s own regression test and uses a separate evaluation layout, not the three-zone build layout.)

INPUT — what you provide

Drop your materials into builds/<skill-name>/input/ in any structure that’s natural to you. You do not author an index by hand — the factory scans input/ during the interview and derives the manifest for you to confirm.

You provide	What it is	Required?
Gold standards	3+ exemplars of “what good looks like”: input/output pairs, reference artifacts, or previously solved tasks	Yes (min 3)
Study materials	Docs, code, transcripts, specs, style guides, an existing skill to upgrade	Recommended
A judge (env vars)	`JUDGE_MODEL`, `JUDGE_API_KEY`, `JUDGE_API_BASE` for an OpenAI-compatible endpoint	Optional — falls back to deterministic-only scoring

Gold standards are the benchmark; they are never modified during the build. Fewer than three is a risk the factory will warn you about.

CONTRACT — what the factory maintains in `work/`

These are generated and owned by the factory, but they define the measurable bar, so they are part of the contract. Inspect or correct them at the phase boundaries.

Artifact	Purpose	Spec
`work/manifest.yaml`	Factory-derived index of your gold standards, tagged train/validation/test	confirm during Phase 1
`work/evaluation/rubric.yaml`	Scored dimensions, weights, criteria, target score	rubric-format.md
`work/evaluation/evaluate.sh`	Runs the skill on a case and emits measurements	metric-protocol.md

The evaluation contract in one line: ./work/evaluation/evaluate.sh <case-id> prints METRIC <name>=<value> lines to stdout, including a normalized METRIC overall_score=<0.0–1.0> as the primary metric. That is the only interface the autoresearch loop needs.

OUTPUT — what you get

You receive	Where	Notes
The finished skill	`builds/<skill-name>/output/<skill-name>/`	`SKILL.md` (+ `references/`), in its own named dir — publish-ready
The autoresearch journal	`builds/<skill-name>/work/experiments/`	`results.tsv`, `autoresearch.jsonl`, `run.log` — inspectable record of every experiment
The verification record	`builds/<skill-name>/BENCHMARK.md`	Final panel pass/fail scores (Phase 5)

To publish the result, copy the named skill dir straight into a skills repo:


cp -r builds/<skill-name>/output/<skill-name> <your-skills-repo>/skills/

or install it directly with npx skills.

The flow at a glance


input/  ──►  [interview → research → draft → autoresearch → verify]  ──►  output/<skill-name>/
(you)              the factory works in work/                              (publish-ready)

The finished skill follows the official skill-authoring rules (see .agents/skills/create-skill-autoresearch/references/skill-authoring-best-practices.md): name ≤ 64 chars and free of reserved words, a third-person description stating what + when, body < 500 lines, references one level deep.

I/O Contract

INPUT — what you provide

CONTRACT — what the factory maintains in work/

OUTPUT — what you get

The flow at a glance

CONTRACT — what the factory maintains in `work/`