Agent Skills Harness
Build production-grade agent skills. Benchmarked against gold standards, autonomously improved, and verified by an independent multi-agent panel.
Get started → · GitHub · Install the skill
The pipeline
The core skill, create-skill-autoresearch, runs the whole skill-creation lifecycle as five phases:
| Phase | What happens |
|---|---|
| 1 · Interview | Discover purpose, gold standards, and scope |
| 2 · Research | Study domain materials with parallel subagents |
| 3 · Draft | Design-first, following the official skill-authoring rules |
| 4 · Autoresearch | Iterate against an LLM-as-judge (or a real-world metric) until the target is hit |
| 5 · Verify | Independent panel with a devil’s advocate, reaching consensus before shipping |
It extends the official single-pass skill creators rather than replacing them — adding the research dossier, benchmarking, the improvement loop, and verification a one-shot generator can’t.
Why it’s different
- Gold-standard benchmarking — quality is a measured number, not a vibe.
- Autonomous improvement — an autoresearch loop edits, measures, keeps wins, reverts regressions.
- Independent verification — a fresh-context panel (Quality · Utility · Devil’s Advocate) signs off via evidence-based consensus.
In a blind end-to-end benchmark, a factory-built skill scored an adjudicated ≈ 0.84 against a human reference (target 0.80). See the benchmark.
Quick start
git clone https://github.com/a-tokyo/agent-skills-harness
cd agent-skills-harnessThen open the repo in your AI coding agent and ask:
“Build me a skill for <your domain> using the create-skill-autoresearch factory.”
Drop your examples into input/, and the factory researches, drafts, improves, and verifies a skill into
output/. Read the I/O Contract for exactly what goes in and comes out.
Install just the skill
The factory is also published standalone for npx:
npx skills add a-tokyo/agent-skills --skill create-skill-autoresearchThe harness is the batteries-included environment (companion skills, self-test, benchmark); the published skill is the portable on-ramp.