Research & Roadmap — skills docs

Testing Coverage Matrix

Orchestra-skills covers a strong baseline across the testing pyramid. The matrix below identifies what's covered, what's partially addressed, and what's missing.

Test type	Status	Where it lives	Gap skill needed?
Unit / TDD	✅ Full	`practicing-tdd`	No
Mutation testing	✅ Full	`hardening-tests` → mutation-run.sh	No
Flake detection	✅ Full	`hardening-tests` → flake-hunt.sh	No
e2e (Playwright)	✅ Full	`authoring-tests`	No
API / HTTP contract	✅ Full	`authoring-tests`	No
Accessibility (a11y)	✅ Full	`authoring-tests` + `designing-ui-ux`	No
Exploratory QA	✅ Full	`exploring-quality`	No
Fuzz / property-based	⚠ Partial	`hardening-tests` mentions it, no workflow	fuzzing-inputs
Load / stress	⚠ Partial	bench_test.go only, no SLO enforcement	profiling-performance
Soak / long-duration	❌ Missing	—	stress-testing-resilience
Chaos / fault injection	❌ Missing	—	stress-testing-resilience
Visual regression	⚠ Partial	screenshot-states.sh (manual, no baseline diff)	Enhancement to authoring-tests
Consumer-driven contracts	❌ Future	—	When multi-service (Pact)

Gap: Fuzz Testing

fuzzing-inputs High Priority

Go has had first-class corpus-driven fuzz testing since 1.18 (go test -fuzz=FuzzXxx). The hardening-tests skill mentions property/fuzz cases in assertion-strength.md but provides no workflow or scripts for:

Identifying which functions are good fuzz targets (parsers, decoders, user-input handlers)
Scaffolding func FuzzXxx(f *testing.F) with seed corpus entries
Running time-boxed fuzzing in CI without blocking indefinitely
Triaging and minimizing crashing inputs
Storing corpus entries in testdata/fuzz/ (the standard Go location, auto-replayed by go test)

Community context

MiniKao's 24-skill QA suite includes property-based-test-gen (Hypothesis/fast-check strategies to close coverage gaps) and mutation-testing (deliberate fault injection to find undertested logic). The ffuf-web-fuzzing community skill covers HTTP-level fuzzing. None of these are directly available here.

Proposed skill structure

skills/fuzzing-inputs/ ├── SKILL.md ├── references/ │ ├── fuzz-target-patterns.md ← what makes a good fuzz target │ └── corpus-management.md ← seeding, minimization, CI integration └── scripts/ ├── find-fuzz-targets.sh ← grep for fuzz-worthy functions └── fuzz-timed.sh ← run fuzz for N seconds, capture crashes

Gap: Load & Performance

profiling-performance High Priority

The authoring-tests skill references bench_test.go for hot-path benchmarks, but there's no guidance on when to run performance tests, what thresholds to enforce, or how to track regression across commits. Missing:

CPU and memory profiling with go tool pprof and flame graphs
Latency SLOs enforced in CI (fail if P95 > threshold)
Benchmark comparison against stored baselines (benchstat)
Soak test pattern: sustained load for configurable duration, reporting allocations
HTTP-layer throughput testing (requests/sec, P99 latency)

Community context

The alirezarezvani/claude-skills repo includes a performance-profiler skill covering Node/Python/Go profiling, bundle analysis, and load testing. The qaskills.sh ecosystem has k6-load-testing covering smoke / load / stress / soak test shapes.

Proposed skill structure

skills/profiling-performance/ ├── SKILL.md ├── references/ │ ├── slo-thresholds.md ← P95 latency, throughput targets │ └── profiling-guide.md ← pprof commands, flame graph reading └── scripts/ ├── bench-compare.sh ← run bench, compare to stored baseline ├── update-baseline.sh ← update after intentional improvement └── soak-run.sh ← sustained load for N minutes, report allocs

Gap: Chaos & Resilience

stress-testing-resilience Medium Priority

No skill covers controlled failure injection — verifying the system degrades gracefully under network failures, resource exhaustion, and concurrent pressure. For this repo (Go server + SSE + browser client), relevant failure modes:

Network latency injection between components (toxiproxy)
Abrupt connection drops during SSE streaming
Concurrent request floods past handler capacity
Clock skew (time.Now() injection via interface)
Slow memory leak detection under sustained load

Community context

The jeffallan/chaos-engineer community skill is a well-designed reference. It designs experiments with Litmus Chaos (Kubernetes pod/node failure), toxiproxy (network latency/packet-loss), and Chaos Monkey (instance termination). Key safety guardrails it enforces: steady-state first, blast radius starts minimal, automated rollback within 30 seconds, single variable per experiment.

Proposed skill structure

skills/stress-testing-resilience/ ├── SKILL.md ├── references/ │ ├── failure-modes.md ← catalog of failure scenarios for this app │ ├── steady-state.md ← baseline metrics to assert recovery │ └── safety-guardrails.md ← blast-radius rules, rollback requirements └── scripts/ ├── latency-inject.sh ← toxiproxy setup + teardown ├── connection-flood.sh ← concurrent HTTP + SSE pressure └── goroutine-check.sh ← assert goroutine count returns to baseline

Gap: Visual Regression

authoring-tests enhancement High Priority (low effort)

The designing-ui-ux skill runs screenshot-states.sh for before/after evidence, but there's no baseline + automated diff workflow that can fail CI on unexpected visual regression. Playwright has this built in — no external service needed:

// e2e/tests/visual.spec.ts
test('dashboard looks correct', async ({ page }) => {
  await page.goto('/dashboard');
  await expect(page).toHaveScreenshot('dashboard.png');
});

First run creates baseline screenshots in e2e/tests/snapshots/
Subsequent runs diff pixel-by-pixel against baseline; CI fails on unexpected change
Update baselines: npx playwright test --update-snapshots
Per-state coverage: idle, streaming, error, responsive widths

Community context

MiniKao's QA suite includes visual-regression-gen. The qaskills.sh collection has visual-regression-percy-chromatic. Both confirm this is a standard QA layer that teams need — it just doesn't require an external service when using Playwright's built-in support.

Gap: Security

security-reviewing skill (missing) High Priority

Orchestra-skills has no dedicated security skill. The built-in /security-review command is generic — it doesn't know the project's architecture, Go idioms, or the threat surface specific to this codebase. Four distinct security concerns are currently unaddressed:

Area	Current state	Gap
Secrets / credentials	Ad-hoc	No systematic scan before commit or in CI
Dependency vulnerabilities	None	No `govulncheck` / `go list -m` audit skill
SAST / code patterns	Partial	auditing-code-quality checks idioms, not security patterns (injection, SSRF, path traversal)
Threat modelling	None	No structured threat-model doc or review step tied to architecture changes

What a dedicated skill would do

Run govulncheck ./... and surface CVEs with fix guidance
Run gosec ./... (or staticcheck security rules) for SAST findings
Scan for hardcoded secrets / credentials using pattern rules
Walk the OWASP Top 10 checklist against the active codebase
Produce a ranked findings report to docs/SECURITY_REVIEW.md
Boundary: fixes belong to /code-review + TDD cycle — this skill only audits and reports

Community context

Security review skills are common in community collections. The hesreallyhim/awesome-claude-code list includes security-scanning workflows; the qaskills.sh collection has a security-audit skill covering OWASP, dependency CVEs, and secrets detection. This is a well-understood pattern — it's just missing here.

UI/UX Prototyping: Current State

The designing-ui-ux skill is a production-first design loop — it audits, designs, implements, and verifies in the real app. This is the right scope for a production htmx app.

Prototyping is different: it explores and communicates ideas before committing to production code. Goals are speed (3 layouts in an hour), isolation (test a component without the full app), and stakeholder communication (interactive mockups without backend).

What's missing: A rapid prototyping phase where Claude generates a static HTML mockup using the existing design tokens, serves it locally, and produces stakeholder-share screenshots — all before touching production templates. This should be added as a ## Rapid prototyping section to designing-ui-ux.

Proposed rapid prototyping workflow

Read docs/DESIGN.md and app.css for current tokens and palette
Generate standalone HTML mockup in docs/prototypes/YYYY-MM-DD-feature/index.html
Serve: python3 -m http.server 9999 --directory docs/prototypes/
Run screenshot-states.sh to capture stakeholder-share images
Iterate on feedback before touching production templates
When approved: port styles and structure to htmx templates

The Storybook + Claude Pattern

For JavaScript/React/Vue/Angular frontends (not this htmx repo specifically, but worth documenting for future stacks), the strongest 2026 pattern for UI prototyping is Storybook + Claude Code rather than external tools.

Why Storybook over Lovable / v0.dev / bolt.new

Prototypes live in the codebase — CI/CD catches regressions automatically
Uses real production components — no handoff gap between prototype and production
CLAUDE.md documents the design system; Claude always uses real tokens
@storybook/addon-mcp exposes list_components, get_component_props, get_component_source as MCP tools

The workflow

Install @storybook/addon-mcp — exposes component metadata to Claude as tools
Claude queries your component library: list_components, get_component_source
Claude writes a Story (ComponentName.stories.ts) with mock data and interaction tests
Visual regression testing runs automatically on each story via Playwright
Stakeholders view the deployed Storybook (Chromatic / self-hosted) for interactive review
When approved: Claude ports the Story's component usage directly to production routes

flight505/storybook-assistant provides a ready-made Claude Code plugin with 18 skills, 12 slash commands, and 3 agents for this workflow — supporting React, Vue, Angular, Next.js, Svelte, Solid, and Tauri.

SPA Module Isolation Pattern

For SPAs, prototyping a new route or module in isolation before wiring it into the router:

Create src/prototypes/feature-name/ with a standalone React root (or similar)
Wire up Mock Service Worker (MSW) to intercept API calls with realistic fixture data
Use an in-memory router for multi-step flows (wizard, tabs, breadcrumb navigation)
Import design tokens from the same source as production
When approved: promote to src/routes/feature/, remove MSW stubs

This lets Claude Code generate an entire interactive module prototype in a single session, with the user providing feedback on the live prototype before any production routing is touched.

Tools Comparison

Approach	Tools	Best for	Main tradeoff
External AI prototyping	v0.dev, Lovable, bolt.new	Greenfield exploration, throwaway mockups	Disconnected from real codebase; handoff gap
Storybook + Claude Code	@storybook/addon-mcp + Claude	Component-first teams with design systems	Setup overhead; excellent long-term
Design tool bridge	Figma → Code Connect → Claude	Teams with dedicated designers and Figma	Requires Figma; preserves design intent well
UXPin Merge + Claude	UXPin + Claude API	Established design systems, enterprise	Paid tool; best for mature DS
Static HTML prototypes (this repo)	Claude + app.css tokens	Server-rendered htmx apps	No interactivity; excellent for layout review

The 2026 consensus: prototypes should live in the codebase. External tools create a handoff gap that costs more engineering time than the initial speed gain.

Community Ecosystem

As of 2026, Claude Code skills follow the Agent Skills open standard, meaning skills written here also work (with minor adjustments) in Cursor, Gemini CLI, Codex CLI, and Antigravity IDE. Multiple "awesome" collections have emerged:

travisvn/awesome-claude-skills

Best curated starting point. Quality-filtered list of skills, resources, and tools for Claude Code workflows.

hesreallyhim/awesome-claude-code

Most comprehensive list: skills, agents, hooks, orchestrators, status lines, developer tooling, and all latest features. Currently being restructured.

rohitg00/awesome-claude-code-toolkit

135 agents, 35 curated skills, 42 commands, 176+ plugins, 20 hooks, 15 rules, 7 templates, 14 MCP configs, 26 companion apps. The most batteries-included toolkit.

alirezarezvani/claude-skills

337+ production-ready skills across 16 domains for 13 AI platforms. Strong on engineering, compliance (ISO/SOC 2/GDPR), C-suite personas, and security-first architecture.

VoltAgent/awesome-agent-skills

1 000+ community agent skills from official dev teams and community contributors, compatible across coding agents.

sickn33/antigravity-awesome-skills

1 500+ installable skills with a CLI installer, bundles, workflows, and official/community collections. Multi-platform.

Notable Community Skills

Skill	What it does	Source
obra/superpowers	20+ battle-tested utilities: TDD, debugging, collaboration. Commands like /brainstorm, /write-plan	travisvn list
Trail of Bits security	Static analysis, variant analysis, code auditing, vulnerability detection	travisvn list
chaos-engineer	Designs chaos experiments with Litmus Chaos, toxiproxy, Chaos Monkey; outputs manifests, runbooks, post-mortem templates	jeffallan
playwright-skill	Full Playwright browser automation framework with Claude Code integration	community
ffuf-web-fuzzing	HTTP-level penetration testing with authenticated request handling	community
storybook-assistant	18 skills, 12 slash commands, 3 agents for Storybook+Claude workflow; visual regression, a11y, design system integration	flight505
skill-creator	Anthropic meta-skill — builds new skills through interactive Q&A with eval-driven optimization and trigger testing	Anthropic
loki-mode	Orchestrates 37 AI agents across 6 swarms for autonomous multi-domain work	community

Reference: The 24-Skill QA Suite

MiniKao's open-source QA toolkit (2026) is the most complete community testing skill collection. Organized into 8 categories, with three operation modes (full-MCP, partial-MCP, markdown-only):

Category	Skills
Test Design	test-master, flutter-test-master, test-review, regression-test, speckit-to-tc, tc-version-diff, sheet-md-sync, smoke-test-analyzer
Automation	test-automation, flutter-test-automation, tc-to-pytest
Bug Management	bug-report (RIDER format, JIRA dup check, git blame root cause, Slack notify)
Quality Quantification	mutation-testing (mutmut), property-based-test-gen (Hypothesis/fast-check)
Reporting	publish-regression
Performance & Security	performance-test-gen, security-scan, api-contract-test
CI Health	visual-regression-gen, flaky-test-hunter
Quality Specialties	a11y-audit, localization-test, push-notification-test, test-data-factory

Bold entries are skills that address gaps not yet covered in skills.

Authoring Best Practices

Concise is key

Claude is already smart. Only add context Claude doesn't already have. Challenge every paragraph: "Does Claude need this explanation?" The context window is shared with conversation history, other skills, and system prompts.

Degrees of freedom

High freedom (text instructions): multiple valid approaches. Medium freedom (pseudocode): preferred pattern with variation. Low freedom (exact script): fragile, exact sequence required. Match specificity to task fragility.

Evaluations first

Write 3 test scenarios BEFORE writing the skill body. This ensures you're solving real problems, not imagined ones. Baseline Claude's performance without the skill, then measure improvement.

Test with all models

Skills are model-dependent. Haiku needs more detail; Opus needs less verbosity. Test with Haiku (does it provide enough guidance?), Sonnet (is it clear and efficient?), and Opus (does it over-explain?).

Progressive disclosure

SKILL.md body under 500 lines. Split content into separate reference files when approaching this limit. Keep all reference links one level deep from SKILL.md — never chain references deeper.

Feedback loops

For validation-heavy skills: run validator → fix errors → repeat. The plan-validate-execute pattern catches errors before irreversible changes. Scripts should surface errors with specific messages, not generic failures.

Scripts execute, not load

Scripts in scripts/ are run via bash — their code never enters the context window. Only their output costs tokens. This makes scripts far more efficient than asking Claude to generate equivalent code.

Develop with Claude

Use Claude A to write the skill, Claude B (fresh instance) to test it on real tasks. Observe where Claude B struggles or misses rules, then return to Claude A with specific observations. The iterative cycle beats writing from assumptions.

Naming & Descriptions

Name field

Pattern	Example	Verdict
Gerund form (recommended)	processing-pdfs, hardening-tests	✅ Best
Noun phrase	pdf-processing, test-hardening	✅ Good
Action verb	process-pdfs, harden-tests	✅ Good
Too vague	helper, utils, tools	❌ Avoid
Reserved word	anthropic-helper, claude-tools	❌ Invalid
Uppercase / spaces	Process PDFs	❌ Invalid

Description field — the most critical part

Claude uses the description to pick the right skill from potentially 100+ available. It must answer: what does it do AND when should I use it. Formula: [Action verb] [what] — Use when [trigger conditions].

✅ Good:
description: >
  Extracts text and tables from PDF files, fills forms, merges documents.
  Use when working with PDF files or when the user mentions PDFs, forms,
  or document extraction.

❌ Bad:
description: Helps with documents

Always write in third person (injected into system prompt)
Max 1 024 characters; no XML tags; no reserved words
Include at least 2–3 specific trigger terms users might type

Anti-patterns to Avoid

Anti-pattern	Problem	Fix
Too many options	Paralyzes Claude with analysis paralysis	Provide one default; name the escape hatch explicitly
Punting errors to Claude	Unreliable; Claude can't recover what a script swallowed	Handle error conditions explicitly in scripts
Windows-style paths (`\`)	Breaks on Unix — where most CI runs	Always use forward slashes
Voodoo constants	TIMEOUT=47 — nobody knows why	Brief inline justification for every non-obvious value
Deeply nested references	Claude may partially read chained files	All references link one level deep from SKILL.md
Time-sensitive information	"If before Aug 2025…" — rots immediately	Use "old patterns" sections with <details> collapse
Inconsistent terminology	Same concept named 3 ways → confuses Claude	Pick one term and use it throughout the skill
No evaluation before authoring	Skill solves imagined problems	Run Claude on representative tasks first; document failures

Skill Composition Patterns

Skills compose by having explicit boundary rules — each skill states what it does and what it defers to other skills. This prevents overlap and ensures Claude routes to the right specialist.

Orchestra-skills demonstrates this well:

auditing-code-quality boundaries:
  Bugs / correctness         → /code-review (built-in)
  Mechanical simplification  → /simplify (built-in)
  Module structure           → improving-architecture
  Test quality               → hardening-tests

hardening-tests boundaries:
  Creates tests              → practicing-tdd, authoring-tests
  Attacks & strengthens them → this skill
  Product bugs found         → file via tracking system, fix with TDD

exploring-quality boundaries:
  Finds problems             → this skill
  Locks in a11y/visual fixes → authoring-tests
  Schedules filed bugs       → tracking issues

Document boundaries explicitly. Every skill should answer "what do I NOT do?" as clearly as "what do I do?". This is what allows skills to be composed without stepping on each other.

Recommended New Skills

security-reviewing

Invocation: /security-reviewing

Systematic security audit tailored to this codebase: run govulncheck for CVEs, gosec for SAST findings, scan for hardcoded secrets, walk OWASP Top 10 against the active code, and produce a ranked findings report to docs/SECURITY_REVIEW.md. Boundaries: audits and reports only — fixes go through the normal /code-review + TDD cycle.

skills/security-reviewing/ ├── SKILL.md ├── references/ │ ├── owasp-top10.md ← Go-specific OWASP checklist │ ├── secrets-patterns.md ← regex patterns for credential detection │ └── vuln-triage.md ← CVE severity guide + fix priority rules └── scripts/ ├── govulncheck-run.sh ← dependency CVE scan ├── gosec-run.sh ← SAST security lint └── secrets-scan.sh ← grep for hardcoded creds

profiling-performance

Invocation: /profiling-performance

Systematically profile Go performance, detect regressions against stored baselines, enforce P95 latency SLOs in CI, and run soak tests for leak detection. Fills the gap between "we have bench_test.go" and "we fail CI when P95 degrades."

skills/profiling-performance/ ├── SKILL.md ├── references/ │ ├── slo-thresholds.md │ └── profiling-guide.md └── scripts/ ├── bench-compare.sh ← benchstat diff vs baseline ├── update-baseline.sh └── soak-run.sh

fuzzing-inputs

Invocation: /fuzzing-inputs

Scaffold and run Go corpus-driven fuzz tests for parsers, decoders, and user-input handlers. Identifies fuzz targets, writes func FuzzXxx(f *testing.F) with seed corpus, runs time-boxed fuzzing in CI, and stores crash-reproducing corpus entries.

skills/fuzzing-inputs/ ├── SKILL.md ├── references/ │ ├── fuzz-target-patterns.md │ └── corpus-management.md └── scripts/ ├── find-fuzz-targets.sh └── fuzz-timed.sh ← run for N seconds, capture crashes

stress-testing-resilience

Invocation: /stress-testing-resilience

Verify the system degrades gracefully under failure and load. Maps failure modes from CODEBASE_MAP.md, runs controlled fault injection (toxiproxy latency, connection floods), asserts steady-state recovery within SLO, and produces a resilience report.

skills/stress-testing-resilience/ ├── SKILL.md ├── references/ │ ├── failure-modes.md │ ├── steady-state.md │ └── safety-guardrails.md ← blast-radius rules └── scripts/ ├── latency-inject.sh ├── connection-flood.sh └── goroutine-check.sh

Enhancement: authoring-tests → visual regression

Add ## Visual regression tests section to authoring-tests SKILL.md

Add Playwright's built-in screenshot comparison to the test layers — zero external dependencies, CI-enforceable, per-state coverage. Update the layer matrix to include visual as a fifth layer. No new skill needed; this is a targeted enhancement.

Enhancement: designing-ui-ux → rapid prototyping phase

Add ## Rapid prototyping section to designing-ui-ux SKILL.md

Add Phase 0 before the Audit → Design → Implement loop. Claude generates a standalone HTML mockup using existing design tokens, serves it locally, captures stakeholder-share screenshots, and iterates — all before touching production templates.

Phased Roadmap

Phase 1 — Q3 2026: Fill Critical Gaps

Item	Type	Effort	Impact
security-reviewing skill	New skill	Medium	High — no CVE scan, SAST, or secrets detection today
Visual regression in authoring-tests	Enhancement	Small	High — zero-dependency, immediate CI value
Rapid prototyping in designing-ui-ux	Enhancement	Small	High — faster design iteration
fuzzing-inputs skill	New skill	Medium	High — Go 1.18+ fuzz is first-class and currently untapped
profiling-performance skill	New skill	Medium	High — no current perf regression detection

Phase 2 — Q4 2026: Stability & Resilience

Item	Type	Effort	Impact
stress-testing-resilience skill	New skill	Large	Medium-High — important pre-release gate
Soak test pattern in profiling-performance	Enhancement	Small	Medium — catches slow memory leaks

Phase 3 — Q1 2027: Ecosystem Integration

Item	Type	Effort	Impact
Storybook MCP integration in designing-ui-ux	Enhancement	Medium	Medium — relevant if frontend stack changes
Consumer-contract testing in registering-contracts	New skill	Large	Low now, critical if multi-service
Publish to Agent Skills open standard registry	Distribution	Small	High — community discoverability

Skill count projection

Phase	Skills	Change
Today	13	—
Phase 1 complete	16	+3 new skills, 2 enhancements
Phase 2 complete	17	+1 new skill, 1 enhancement
Phase 3 complete	19	+2 integrations

External resources

Claude Code Skills Documentation — official reference
Agent Skills Best Practices — official authoring guide
Claude Code Commands Reference — full command table
travisvn/awesome-claude-skills — curated community list
hesreallyhim/awesome-claude-code — comprehensive list
rohitg00/awesome-claude-code-toolkit — 135 agents + toolkit
flight505/storybook-assistant — Storybook + Claude Code
jeffallan chaos-engineer skill — chaos engineering reference
agentskills.io — open standard for cross-platform skills