Skip to Content
Home

Agent Skills Harness

Build production-grade agent skills. Benchmarked against gold standards, autonomously improved, and verified by an independent multi-agent panel.

Get started → · GitHub  · Install the skill 


The pipeline

The core skill, create-skill-autoresearch, runs the whole skill-creation lifecycle as five phases:

PhaseWhat happens
1 · InterviewDiscover purpose, gold standards, and scope
2 · ResearchStudy domain materials with parallel subagents
3 · DraftDesign-first, following the official skill-authoring rules
4 · AutoresearchIterate against an LLM-as-judge (or a real-world metric) until the target is hit
5 · VerifyIndependent panel with a devil’s advocate, reaching consensus before shipping

It extends the official single-pass skill creators rather than replacing them — adding the research dossier, benchmarking, the improvement loop, and verification a one-shot generator can’t.

Why it’s different

  • Gold-standard benchmarking — quality is a measured number, not a vibe.
  • Autonomous improvement — an autoresearch loop edits, measures, keeps wins, reverts regressions.
  • Independent verification — a fresh-context panel (Quality · Utility · Devil’s Advocate) signs off via evidence-based consensus.

In a blind end-to-end benchmark, a factory-built skill scored an adjudicated ≈ 0.84 against a human reference (target 0.80). See the benchmark.

Quick start

git clone https://github.com/a-tokyo/agent-skills-harness cd agent-skills-harness

Then open the repo in your AI coding agent and ask:

“Build me a skill for <your domain> using the create-skill-autoresearch factory.”

Drop your examples into input/, and the factory researches, drafts, improves, and verifies a skill into output/. Read the I/O Contract for exactly what goes in and comes out.

Install just the skill

The factory is also published standalone for npx:

npx skills add a-tokyo/agent-skills --skill create-skill-autoresearch

The harness is the batteries-included environment (companion skills, self-test, benchmark); the published skill is the portable on-ramp.

Last updated on