Testing Coverage Matrix
Orchestra-skills covers a strong baseline across the testing pyramid. The matrix below identifies what's covered, what's partially addressed, and what's missing.
| Test type | Status | Where it lives | Gap skill needed? |
|---|---|---|---|
| Unit / TDD | ✅ Full | practicing-tdd | No |
| Mutation testing | ✅ Full | hardening-tests → mutation-run.sh | No |
| Flake detection | ✅ Full | hardening-tests → flake-hunt.sh | No |
| e2e (Playwright) | ✅ Full | authoring-tests | No |
| API / HTTP contract | ✅ Full | authoring-tests | No |
| Accessibility (a11y) | ✅ Full | authoring-tests + designing-ui-ux | No |
| Exploratory QA | ✅ Full | exploring-quality | No |
| Fuzz / property-based | ⚠ Partial | hardening-tests mentions it, no workflow | fuzzing-inputs |
| Load / stress | ⚠ Partial | bench_test.go only, no SLO enforcement | profiling-performance |
| Soak / long-duration | ❌ Missing | — | stress-testing-resilience |
| Chaos / fault injection | ❌ Missing | — | stress-testing-resilience |
| Visual regression | ⚠ Partial | screenshot-states.sh (manual, no baseline diff) | Enhancement to authoring-tests |
| Consumer-driven contracts | ❌ Future | — | When multi-service (Pact) |
Gap: Fuzz Testing
Go has had first-class corpus-driven fuzz testing since 1.18 (go test -fuzz=FuzzXxx). The hardening-tests skill mentions property/fuzz cases in assertion-strength.md but provides no workflow or scripts for:
- Identifying which functions are good fuzz targets (parsers, decoders, user-input handlers)
- Scaffolding
func FuzzXxx(f *testing.F)with seed corpus entries - Running time-boxed fuzzing in CI without blocking indefinitely
- Triaging and minimizing crashing inputs
- Storing corpus entries in
testdata/fuzz/(the standard Go location, auto-replayed bygo test)
Community context
MiniKao's 24-skill QA suite includes property-based-test-gen (Hypothesis/fast-check strategies to close coverage gaps) and mutation-testing (deliberate fault injection to find undertested logic). The ffuf-web-fuzzing community skill covers HTTP-level fuzzing. None of these are directly available here.
Proposed skill structure
Gap: Load & Performance
The authoring-tests skill references bench_test.go for hot-path benchmarks, but there's no guidance on when to run performance tests, what thresholds to enforce, or how to track regression across commits. Missing:
- CPU and memory profiling with
go tool pprofand flame graphs - Latency SLOs enforced in CI (fail if P95 > threshold)
- Benchmark comparison against stored baselines (
benchstat) - Soak test pattern: sustained load for configurable duration, reporting allocations
- HTTP-layer throughput testing (requests/sec, P99 latency)
Community context
The alirezarezvani/claude-skills repo includes a performance-profiler skill covering Node/Python/Go profiling, bundle analysis, and load testing. The qaskills.sh ecosystem has k6-load-testing covering smoke / load / stress / soak test shapes.
Proposed skill structure
Gap: Chaos & Resilience
No skill covers controlled failure injection — verifying the system degrades gracefully under network failures, resource exhaustion, and concurrent pressure. For this repo (Go server + SSE + browser client), relevant failure modes:
- Network latency injection between components (toxiproxy)
- Abrupt connection drops during SSE streaming
- Concurrent request floods past handler capacity
- Clock skew (time.Now() injection via interface)
- Slow memory leak detection under sustained load
Community context
The jeffallan/chaos-engineer community skill is a well-designed reference. It designs experiments with Litmus Chaos (Kubernetes pod/node failure), toxiproxy (network latency/packet-loss), and Chaos Monkey (instance termination). Key safety guardrails it enforces: steady-state first, blast radius starts minimal, automated rollback within 30 seconds, single variable per experiment.
Proposed skill structure
Gap: Visual Regression
The designing-ui-ux skill runs screenshot-states.sh for before/after evidence, but there's no baseline + automated diff workflow that can fail CI on unexpected visual regression. Playwright has this built in — no external service needed:
// e2e/tests/visual.spec.ts
test('dashboard looks correct', async ({ page }) => {
await page.goto('/dashboard');
await expect(page).toHaveScreenshot('dashboard.png');
});
- First run creates baseline screenshots in
e2e/tests/snapshots/ - Subsequent runs diff pixel-by-pixel against baseline; CI fails on unexpected change
- Update baselines:
npx playwright test --update-snapshots - Per-state coverage: idle, streaming, error, responsive widths
Community context
MiniKao's QA suite includes visual-regression-gen. The qaskills.sh collection has visual-regression-percy-chromatic. Both confirm this is a standard QA layer that teams need — it just doesn't require an external service when using Playwright's built-in support.
Gap: Security
Orchestra-skills has no dedicated security skill. The built-in /security-review command is generic — it doesn't know the project's architecture, Go idioms, or the threat surface specific to this codebase. Four distinct security concerns are currently unaddressed:
| Area | Current state | Gap |
|---|---|---|
| Secrets / credentials | Ad-hoc | No systematic scan before commit or in CI |
| Dependency vulnerabilities | None | No govulncheck / go list -m audit skill |
| SAST / code patterns | Partial | auditing-code-quality checks idioms, not security patterns (injection, SSRF, path traversal) |
| Threat modelling | None | No structured threat-model doc or review step tied to architecture changes |
What a dedicated skill would do
- Run
govulncheck ./...and surface CVEs with fix guidance - Run
gosec ./...(orstaticchecksecurity rules) for SAST findings - Scan for hardcoded secrets / credentials using pattern rules
- Walk the OWASP Top 10 checklist against the active codebase
- Produce a ranked findings report to
docs/SECURITY_REVIEW.md - Boundary: fixes belong to
/code-review+ TDD cycle — this skill only audits and reports
Community context
Security review skills are common in community collections. The hesreallyhim/awesome-claude-code list includes security-scanning workflows; the qaskills.sh collection has a security-audit skill covering OWASP, dependency CVEs, and secrets detection. This is a well-understood pattern — it's just missing here.
UI/UX Prototyping: Current State
The designing-ui-ux skill is a production-first design loop — it audits,
designs, implements, and verifies in the real app. This is the right scope for a production htmx app.
Prototyping is different: it explores and communicates ideas before committing to production code. Goals are speed (3 layouts in an hour), isolation (test a component without the full app), and stakeholder communication (interactive mockups without backend).
## Rapid prototyping section
to designing-ui-ux.
Proposed rapid prototyping workflow
- Read
docs/DESIGN.mdandapp.cssfor current tokens and palette - Generate standalone HTML mockup in
docs/prototypes/YYYY-MM-DD-feature/index.html - Serve:
python3 -m http.server 9999 --directory docs/prototypes/ - Run
screenshot-states.shto capture stakeholder-share images - Iterate on feedback before touching production templates
- When approved: port styles and structure to htmx templates
The Storybook + Claude Pattern
For JavaScript/React/Vue/Angular frontends (not this htmx repo specifically, but worth documenting for future stacks), the strongest 2026 pattern for UI prototyping is Storybook + Claude Code rather than external tools.
Why Storybook over Lovable / v0.dev / bolt.new
- Prototypes live in the codebase — CI/CD catches regressions automatically
- Uses real production components — no handoff gap between prototype and production
- CLAUDE.md documents the design system; Claude always uses real tokens
@storybook/addon-mcpexposeslist_components,get_component_props,get_component_sourceas MCP tools
The workflow
- Install
@storybook/addon-mcp— exposes component metadata to Claude as tools - Claude queries your component library:
list_components,get_component_source - Claude writes a Story (
ComponentName.stories.ts) with mock data and interaction tests - Visual regression testing runs automatically on each story via Playwright
- Stakeholders view the deployed Storybook (Chromatic / self-hosted) for interactive review
- When approved: Claude ports the Story's component usage directly to production routes
SPA Module Isolation Pattern
For SPAs, prototyping a new route or module in isolation before wiring it into the router:
- Create
src/prototypes/feature-name/with a standalone React root (or similar) - Wire up Mock Service Worker (MSW) to intercept API calls with realistic fixture data
- Use an in-memory router for multi-step flows (wizard, tabs, breadcrumb navigation)
- Import design tokens from the same source as production
- When approved: promote to
src/routes/feature/, remove MSW stubs
This lets Claude Code generate an entire interactive module prototype in a single session, with the user providing feedback on the live prototype before any production routing is touched.
Tools Comparison
| Approach | Tools | Best for | Main tradeoff |
|---|---|---|---|
| External AI prototyping | v0.dev, Lovable, bolt.new | Greenfield exploration, throwaway mockups | Disconnected from real codebase; handoff gap |
| Storybook + Claude Code | @storybook/addon-mcp + Claude | Component-first teams with design systems | Setup overhead; excellent long-term |
| Design tool bridge | Figma → Code Connect → Claude | Teams with dedicated designers and Figma | Requires Figma; preserves design intent well |
| UXPin Merge + Claude | UXPin + Claude API | Established design systems, enterprise | Paid tool; best for mature DS |
| Static HTML prototypes (this repo) | Claude + app.css tokens | Server-rendered htmx apps | No interactivity; excellent for layout review |
The 2026 consensus: prototypes should live in the codebase. External tools create a handoff gap that costs more engineering time than the initial speed gain.
Community Ecosystem
As of 2026, Claude Code skills follow the Agent Skills open standard, meaning skills written here also work (with minor adjustments) in Cursor, Gemini CLI, Codex CLI, and Antigravity IDE. Multiple "awesome" collections have emerged:
Notable Community Skills
| Skill | What it does | Source |
|---|---|---|
| obra/superpowers | 20+ battle-tested utilities: TDD, debugging, collaboration. Commands like /brainstorm, /write-plan | travisvn list |
| Trail of Bits security | Static analysis, variant analysis, code auditing, vulnerability detection | travisvn list |
| chaos-engineer | Designs chaos experiments with Litmus Chaos, toxiproxy, Chaos Monkey; outputs manifests, runbooks, post-mortem templates | jeffallan |
| playwright-skill | Full Playwright browser automation framework with Claude Code integration | community |
| ffuf-web-fuzzing | HTTP-level penetration testing with authenticated request handling | community |
| storybook-assistant | 18 skills, 12 slash commands, 3 agents for Storybook+Claude workflow; visual regression, a11y, design system integration | flight505 |
| skill-creator | Anthropic meta-skill — builds new skills through interactive Q&A with eval-driven optimization and trigger testing | Anthropic |
| loki-mode | Orchestrates 37 AI agents across 6 swarms for autonomous multi-domain work | community |
Reference: The 24-Skill QA Suite
MiniKao's open-source QA toolkit (2026) is the most complete community testing skill collection. Organized into 8 categories, with three operation modes (full-MCP, partial-MCP, markdown-only):
| Category | Skills |
|---|---|
| Test Design | test-master, flutter-test-master, test-review, regression-test, speckit-to-tc, tc-version-diff, sheet-md-sync, smoke-test-analyzer |
| Automation | test-automation, flutter-test-automation, tc-to-pytest |
| Bug Management | bug-report (RIDER format, JIRA dup check, git blame root cause, Slack notify) |
| Quality Quantification | mutation-testing (mutmut), property-based-test-gen (Hypothesis/fast-check) |
| Reporting | publish-regression |
| Performance & Security | performance-test-gen, security-scan, api-contract-test |
| CI Health | visual-regression-gen, flaky-test-hunter |
| Quality Specialties | a11y-audit, localization-test, push-notification-test, test-data-factory |
Bold entries are skills that address gaps not yet covered in skills.
Naming & Descriptions
Name field
| Pattern | Example | Verdict |
|---|---|---|
| Gerund form (recommended) | processing-pdfs, hardening-tests | ✅ Best |
| Noun phrase | pdf-processing, test-hardening | ✅ Good |
| Action verb | process-pdfs, harden-tests | ✅ Good |
| Too vague | helper, utils, tools | ❌ Avoid |
| Reserved word | anthropic-helper, claude-tools | ❌ Invalid |
| Uppercase / spaces | Process PDFs | ❌ Invalid |
Description field — the most critical part
Claude uses the description to pick the right skill from potentially 100+ available. It must answer: what does it do AND when should I use it. Formula: [Action verb] [what] — Use when [trigger conditions].
✅ Good: description: > Extracts text and tables from PDF files, fills forms, merges documents. Use when working with PDF files or when the user mentions PDFs, forms, or document extraction. ❌ Bad: description: Helps with documents
- Always write in third person (injected into system prompt)
- Max 1 024 characters; no XML tags; no reserved words
- Include at least 2–3 specific trigger terms users might type
Anti-patterns to Avoid
| Anti-pattern | Problem | Fix |
|---|---|---|
| Too many options | Paralyzes Claude with analysis paralysis | Provide one default; name the escape hatch explicitly |
| Punting errors to Claude | Unreliable; Claude can't recover what a script swallowed | Handle error conditions explicitly in scripts |
Windows-style paths (\) | Breaks on Unix — where most CI runs | Always use forward slashes |
| Voodoo constants | TIMEOUT=47 — nobody knows why | Brief inline justification for every non-obvious value |
| Deeply nested references | Claude may partially read chained files | All references link one level deep from SKILL.md |
| Time-sensitive information | "If before Aug 2025…" — rots immediately | Use "old patterns" sections with <details> collapse |
| Inconsistent terminology | Same concept named 3 ways → confuses Claude | Pick one term and use it throughout the skill |
| No evaluation before authoring | Skill solves imagined problems | Run Claude on representative tasks first; document failures |
Skill Composition Patterns
Skills compose by having explicit boundary rules — each skill states what it does and what it defers to other skills. This prevents overlap and ensures Claude routes to the right specialist.
Orchestra-skills demonstrates this well:
auditing-code-quality boundaries: Bugs / correctness → /code-review (built-in) Mechanical simplification → /simplify (built-in) Module structure → improving-architecture Test quality → hardening-tests hardening-tests boundaries: Creates tests → practicing-tdd, authoring-tests Attacks & strengthens them → this skill Product bugs found → file via tracking system, fix with TDD exploring-quality boundaries: Finds problems → this skill Locks in a11y/visual fixes → authoring-tests Schedules filed bugs → tracking issues
Recommended New Skills
govulncheck for CVEs,
gosec for SAST findings, scan for hardcoded secrets, walk OWASP Top 10 against
the active code, and produce a ranked findings report to docs/SECURITY_REVIEW.md.
Boundaries: audits and reports only — fixes go through the normal /code-review + TDD cycle.
func FuzzXxx(f *testing.F) with seed corpus, runs
time-boxed fuzzing in CI, and stores crash-reproducing corpus entries.
Phased Roadmap
| Item | Type | Effort | Impact |
|---|---|---|---|
| security-reviewing skill | New skill | Medium | High — no CVE scan, SAST, or secrets detection today |
| Visual regression in authoring-tests | Enhancement | Small | High — zero-dependency, immediate CI value |
| Rapid prototyping in designing-ui-ux | Enhancement | Small | High — faster design iteration |
| fuzzing-inputs skill | New skill | Medium | High — Go 1.18+ fuzz is first-class and currently untapped |
| profiling-performance skill | New skill | Medium | High — no current perf regression detection |
| Item | Type | Effort | Impact |
|---|---|---|---|
| stress-testing-resilience skill | New skill | Large | Medium-High — important pre-release gate |
| Soak test pattern in profiling-performance | Enhancement | Small | Medium — catches slow memory leaks |
| Item | Type | Effort | Impact |
|---|---|---|---|
| Storybook MCP integration in designing-ui-ux | Enhancement | Medium | Medium — relevant if frontend stack changes |
| Consumer-contract testing in registering-contracts | New skill | Large | Low now, critical if multi-service |
| Publish to Agent Skills open standard registry | Distribution | Small | High — community discoverability |
Skill count projection
| Phase | Skills | Change |
|---|---|---|
| Today | 13 | — |
| Phase 1 complete | 16 | +3 new skills, 2 enhancements |
| Phase 2 complete | 17 | +1 new skill, 1 enhancement |
| Phase 3 complete | 19 | +2 integrations |
External resources
- Claude Code Skills Documentation — official reference
- Agent Skills Best Practices — official authoring guide
- Claude Code Commands Reference — full command table
- travisvn/awesome-claude-skills — curated community list
- hesreallyhim/awesome-claude-code — comprehensive list
- rohitg00/awesome-claude-code-toolkit — 135 agents + toolkit
- flight505/storybook-assistant — Storybook + Claude Code
- jeffallan chaos-engineer skill — chaos engineering reference
- agentskills.io — open standard for cross-platform skills