Skip to Content
DocumentationI/O Contract

I/O Contract

The one page that tells you what to put in and what you get out when you build a skill with this harness. If you only read one reference, read this one.

A build is a single self-contained folder, builds/<skill-name>/, with three zones: input/ (you provide) → work/ (the factory maintains) → output/ (you receive). See workspace-layout.md for the full file-by-file layout. (self-test/ is the factory’s own regression test and uses a separate evaluation layout, not the three-zone build layout.)


INPUT — what you provide

Drop your materials into builds/<skill-name>/input/ in any structure that’s natural to you. You do not author an index by hand — the factory scans input/ during the interview and derives the manifest for you to confirm.

You provideWhat it isRequired?
Gold standards3+ exemplars of “what good looks like”: input/output pairs, reference artifacts, or previously solved tasksYes (min 3)
Study materialsDocs, code, transcripts, specs, style guides, an existing skill to upgradeRecommended
A judge (env vars)JUDGE_MODEL, JUDGE_API_KEY, JUDGE_API_BASE for an OpenAI-compatible endpointOptional — falls back to deterministic-only scoring

Gold standards are the benchmark; they are never modified during the build. Fewer than three is a risk the factory will warn you about.


CONTRACT — what the factory maintains in work/

These are generated and owned by the factory, but they define the measurable bar, so they are part of the contract. Inspect or correct them at the phase boundaries.

ArtifactPurposeSpec
work/manifest.yamlFactory-derived index of your gold standards, tagged train/validation/testconfirm during Phase 1
work/evaluation/rubric.yamlScored dimensions, weights, criteria, target scorerubric-format.md
work/evaluation/evaluate.shRuns the skill on a case and emits measurementsmetric-protocol.md

The evaluation contract in one line: ./work/evaluation/evaluate.sh <case-id> prints METRIC <name>=<value> lines to stdout, including a normalized METRIC overall_score=<0.0–1.0> as the primary metric. That is the only interface the autoresearch loop needs.


OUTPUT — what you get

You receiveWhereNotes
The finished skillbuilds/<skill-name>/output/<skill-name>/SKILL.md (+ references/), in its own named dir — publish-ready
The autoresearch journalbuilds/<skill-name>/work/experiments/results.tsv, autoresearch.jsonl, run.log — inspectable record of every experiment
The verification recordbuilds/<skill-name>/BENCHMARK.mdFinal panel pass/fail scores (Phase 5)

To publish the result, copy the named skill dir straight into a skills repo:

cp -r builds/<skill-name>/output/<skill-name> <your-skills-repo>/skills/

or install it directly with npx skills.


The flow at a glance

input/ ──► [interview → research → draft → autoresearch → verify] ──► output/<skill-name>/ (you) the factory works in work/ (publish-ready)

The finished skill follows the official skill-authoring rules (see .agents/skills/create-skill-autoresearch/references/skill-authoring-best-practices.md): name ≤ 64 chars and free of reserved words, a third-person description stating what + when, body < 500 lines, references one level deep.