Architecture
Design overview of the create-skill-autoresearch factory.
Pipeline
The factory runs a 5-phase pipeline. Every successful skill build in our case studies followed this exact pattern, regardless of domain.
Agent Roles
The factory orchestrates 4 distinct agent roles. The key architectural constraint is context isolation: the BUILDER and PANEL never share context, preventing bias.
| Role | What It Does | How It’s Spawned |
|---|---|---|
| ORCHESTRATOR | Manages phase transitions, spawns other roles, handles handoffs | The factory skill itself (the parent agent) |
| RESEARCHER | Studies domain materials, writes research notes | N parallel explore subagents |
| BUILDER | Drafts SKILL.md, runs autoresearch loop | Single subagent with autoresearch skill |
| PANEL | Independent verification and consensus | 3 parallel generalPurpose subagents |
Evaluation Architecture
The evaluation pipeline:
evaluate.shtakes a gold standard test case as input- Runs the skill on the test case input
- Uses an LLM-as-judge to compare output to the gold standard reference
- Scores each rubric dimension independently
- Emits
METRIC <dimension>=<score>lines for the autoresearch skill to parse
Consensus Protocol
The verification panel uses a structured protocol inspired by academic research on multi-agent deliberation.
Key design decisions:
- Per-criterion atomic scoring prevents halo effects (Autorubric research)
- Evidence-anchoring requires verbatim quotes for extreme scores (Rulers framework)
- Explicit adversarial assignment achieves 99.2% disagreement detection vs 48.3% for “think critically” (OpenReview research)
- Single synthesis round balances deliberation quality against token cost
Skill Integration
| Skill | Integration Type | When |
|---|---|---|
| autoresearch | Called as dependency | Phase 4: provides the experimentation loop |
| premortem | Invoked directly | Phase 5: before panel evaluation |
| handoff | Invoked directly | When context fatigues or session ends |
| production-grade | Conventions followed | Throughout (plan-of-plans, quality gates) |
| create-skill | Conventions followed | Phase 3: SKILL.md structure and format |
Design Decisions
All 18 locked design decisions are documented in thoughts/07-design-questions.md . Key ones:
- D4: 4-role agent topology (Orchestrator, Researcher, Builder, Panel)
- D5: Structured consensus with synthesis round and escalation
- D7: Two-tier loop budget (per-session + score threshold + plateau detection)
- D8: Shipped skill package vs process artifacts split
- D13: Adaptive data split (70/20/10 for 10+ cases, leave-one-out for fewer)
- D17: Enhance then call autoresearch skill (not embed)
- D18: Incremental build in 5 phases
Case Study Origins
Each factory component traces to a real-world case study:
| Component | Origin | Evidence |
|---|---|---|
| 5-phase pipeline | All case studies | Every build followed this pattern |
| METRIC protocol | Case studies + pi-autoresearch | Automated experiments proved it |
| Research dossier | tokyo skill | 19 research files, 10 parallel subagents |
| LLM-as-judge | Case studies | Deterministic scoring with rubrics |
| Panel consensus | Philosophy + research | Formalized from described approach |
| Handoff documents | tokyo skill | 2 cross-session handoffs preserved continuity |
| Craft-decisions ledger | tokyo v2 | 85+ DNN entries tracked every iteration |