Research & Roadmap

Deep analysis of testing skill gaps, UI/UX prototyping patterns, community ecosystem survey, authoring best practices, and a concrete roadmap for new skills.

4 testing gaps 5 new skill proposals 7 community collections 3-phase roadmap

Testing Coverage Matrix

Orchestra-skills covers a strong baseline across the testing pyramid. The matrix below identifies what's covered, what's partially addressed, and what's missing.

Test type Status Where it lives Gap skill needed?
Unit / TDD✅ Fullpracticing-tddNo
Mutation testing✅ Fullhardening-tests → mutation-run.shNo
Flake detection✅ Fullhardening-tests → flake-hunt.shNo
e2e (Playwright)✅ Fullauthoring-testsNo
API / HTTP contract✅ Fullauthoring-testsNo
Accessibility (a11y)✅ Fullauthoring-tests + designing-ui-uxNo
Exploratory QA✅ Fullexploring-qualityNo
Fuzz / property-based⚠ Partialhardening-tests mentions it, no workflowfuzzing-inputs
Load / stress⚠ Partialbench_test.go only, no SLO enforcementprofiling-performance
Soak / long-duration❌ Missingstress-testing-resilience
Chaos / fault injection❌ Missingstress-testing-resilience
Visual regression⚠ Partialscreenshot-states.sh (manual, no baseline diff)Enhancement to authoring-tests
Consumer-driven contracts❌ FutureWhen multi-service (Pact)

Gap: Fuzz Testing

fuzzing-inputs High Priority

Go has had first-class corpus-driven fuzz testing since 1.18 (go test -fuzz=FuzzXxx). The hardening-tests skill mentions property/fuzz cases in assertion-strength.md but provides no workflow or scripts for:

  • Identifying which functions are good fuzz targets (parsers, decoders, user-input handlers)
  • Scaffolding func FuzzXxx(f *testing.F) with seed corpus entries
  • Running time-boxed fuzzing in CI without blocking indefinitely
  • Triaging and minimizing crashing inputs
  • Storing corpus entries in testdata/fuzz/ (the standard Go location, auto-replayed by go test)

Community context

MiniKao's 24-skill QA suite includes property-based-test-gen (Hypothesis/fast-check strategies to close coverage gaps) and mutation-testing (deliberate fault injection to find undertested logic). The ffuf-web-fuzzing community skill covers HTTP-level fuzzing. None of these are directly available here.

Proposed skill structure

skills/fuzzing-inputs/ ├── SKILL.md ├── references/ │ ├── fuzz-target-patterns.md ← what makes a good fuzz target │ └── corpus-management.md ← seeding, minimization, CI integration └── scripts/ ├── find-fuzz-targets.sh ← grep for fuzz-worthy functions └── fuzz-timed.sh ← run fuzz for N seconds, capture crashes

Gap: Load & Performance

profiling-performance High Priority

The authoring-tests skill references bench_test.go for hot-path benchmarks, but there's no guidance on when to run performance tests, what thresholds to enforce, or how to track regression across commits. Missing:

  • CPU and memory profiling with go tool pprof and flame graphs
  • Latency SLOs enforced in CI (fail if P95 > threshold)
  • Benchmark comparison against stored baselines (benchstat)
  • Soak test pattern: sustained load for configurable duration, reporting allocations
  • HTTP-layer throughput testing (requests/sec, P99 latency)

Community context

The alirezarezvani/claude-skills repo includes a performance-profiler skill covering Node/Python/Go profiling, bundle analysis, and load testing. The qaskills.sh ecosystem has k6-load-testing covering smoke / load / stress / soak test shapes.

Proposed skill structure

skills/profiling-performance/ ├── SKILL.md ├── references/ │ ├── slo-thresholds.md ← P95 latency, throughput targets │ └── profiling-guide.md ← pprof commands, flame graph reading └── scripts/ ├── bench-compare.sh ← run bench, compare to stored baseline ├── update-baseline.sh ← update after intentional improvement └── soak-run.sh ← sustained load for N minutes, report allocs

Gap: Chaos & Resilience

stress-testing-resilience Medium Priority

No skill covers controlled failure injection — verifying the system degrades gracefully under network failures, resource exhaustion, and concurrent pressure. For this repo (Go server + SSE + browser client), relevant failure modes:

  • Network latency injection between components (toxiproxy)
  • Abrupt connection drops during SSE streaming
  • Concurrent request floods past handler capacity
  • Clock skew (time.Now() injection via interface)
  • Slow memory leak detection under sustained load

Community context

The jeffallan/chaos-engineer community skill is a well-designed reference. It designs experiments with Litmus Chaos (Kubernetes pod/node failure), toxiproxy (network latency/packet-loss), and Chaos Monkey (instance termination). Key safety guardrails it enforces: steady-state first, blast radius starts minimal, automated rollback within 30 seconds, single variable per experiment.

Proposed skill structure

skills/stress-testing-resilience/ ├── SKILL.md ├── references/ │ ├── failure-modes.md ← catalog of failure scenarios for this app │ ├── steady-state.md ← baseline metrics to assert recovery │ └── safety-guardrails.md ← blast-radius rules, rollback requirements └── scripts/ ├── latency-inject.sh ← toxiproxy setup + teardown ├── connection-flood.sh ← concurrent HTTP + SSE pressure └── goroutine-check.sh ← assert goroutine count returns to baseline

Gap: Visual Regression

authoring-tests enhancement High Priority (low effort)

The designing-ui-ux skill runs screenshot-states.sh for before/after evidence, but there's no baseline + automated diff workflow that can fail CI on unexpected visual regression. Playwright has this built in — no external service needed:

// e2e/tests/visual.spec.ts
test('dashboard looks correct', async ({ page }) => {
  await page.goto('/dashboard');
  await expect(page).toHaveScreenshot('dashboard.png');
});
  • First run creates baseline screenshots in e2e/tests/snapshots/
  • Subsequent runs diff pixel-by-pixel against baseline; CI fails on unexpected change
  • Update baselines: npx playwright test --update-snapshots
  • Per-state coverage: idle, streaming, error, responsive widths

Community context

MiniKao's QA suite includes visual-regression-gen. The qaskills.sh collection has visual-regression-percy-chromatic. Both confirm this is a standard QA layer that teams need — it just doesn't require an external service when using Playwright's built-in support.

Gap: Security

security-reviewing skill (missing) High Priority

Orchestra-skills has no dedicated security skill. The built-in /security-review command is generic — it doesn't know the project's architecture, Go idioms, or the threat surface specific to this codebase. Four distinct security concerns are currently unaddressed:

AreaCurrent stateGap
Secrets / credentialsAd-hocNo systematic scan before commit or in CI
Dependency vulnerabilitiesNoneNo govulncheck / go list -m audit skill
SAST / code patternsPartialauditing-code-quality checks idioms, not security patterns (injection, SSRF, path traversal)
Threat modellingNoneNo structured threat-model doc or review step tied to architecture changes

What a dedicated skill would do

  • Run govulncheck ./... and surface CVEs with fix guidance
  • Run gosec ./... (or staticcheck security rules) for SAST findings
  • Scan for hardcoded secrets / credentials using pattern rules
  • Walk the OWASP Top 10 checklist against the active codebase
  • Produce a ranked findings report to docs/SECURITY_REVIEW.md
  • Boundary: fixes belong to /code-review + TDD cycle — this skill only audits and reports

Community context

Security review skills are common in community collections. The hesreallyhim/awesome-claude-code list includes security-scanning workflows; the qaskills.sh collection has a security-audit skill covering OWASP, dependency CVEs, and secrets detection. This is a well-understood pattern — it's just missing here.

UI/UX Prototyping: Current State

The designing-ui-ux skill is a production-first design loop — it audits, designs, implements, and verifies in the real app. This is the right scope for a production htmx app.

Prototyping is different: it explores and communicates ideas before committing to production code. Goals are speed (3 layouts in an hour), isolation (test a component without the full app), and stakeholder communication (interactive mockups without backend).

What's missing: A rapid prototyping phase where Claude generates a static HTML mockup using the existing design tokens, serves it locally, and produces stakeholder-share screenshots — all before touching production templates. This should be added as a ## Rapid prototyping section to designing-ui-ux.

Proposed rapid prototyping workflow

  1. Read docs/DESIGN.md and app.css for current tokens and palette
  2. Generate standalone HTML mockup in docs/prototypes/YYYY-MM-DD-feature/index.html
  3. Serve: python3 -m http.server 9999 --directory docs/prototypes/
  4. Run screenshot-states.sh to capture stakeholder-share images
  5. Iterate on feedback before touching production templates
  6. When approved: port styles and structure to htmx templates

The Storybook + Claude Pattern

For JavaScript/React/Vue/Angular frontends (not this htmx repo specifically, but worth documenting for future stacks), the strongest 2026 pattern for UI prototyping is Storybook + Claude Code rather than external tools.

Why Storybook over Lovable / v0.dev / bolt.new

  • Prototypes live in the codebase — CI/CD catches regressions automatically
  • Uses real production components — no handoff gap between prototype and production
  • CLAUDE.md documents the design system; Claude always uses real tokens
  • @storybook/addon-mcp exposes list_components, get_component_props, get_component_source as MCP tools

The workflow

  1. Install @storybook/addon-mcp — exposes component metadata to Claude as tools
  2. Claude queries your component library: list_components, get_component_source
  3. Claude writes a Story (ComponentName.stories.ts) with mock data and interaction tests
  4. Visual regression testing runs automatically on each story via Playwright
  5. Stakeholders view the deployed Storybook (Chromatic / self-hosted) for interactive review
  6. When approved: Claude ports the Story's component usage directly to production routes
flight505/storybook-assistant provides a ready-made Claude Code plugin with 18 skills, 12 slash commands, and 3 agents for this workflow — supporting React, Vue, Angular, Next.js, Svelte, Solid, and Tauri.

SPA Module Isolation Pattern

For SPAs, prototyping a new route or module in isolation before wiring it into the router:

  1. Create src/prototypes/feature-name/ with a standalone React root (or similar)
  2. Wire up Mock Service Worker (MSW) to intercept API calls with realistic fixture data
  3. Use an in-memory router for multi-step flows (wizard, tabs, breadcrumb navigation)
  4. Import design tokens from the same source as production
  5. When approved: promote to src/routes/feature/, remove MSW stubs

This lets Claude Code generate an entire interactive module prototype in a single session, with the user providing feedback on the live prototype before any production routing is touched.

Tools Comparison

ApproachToolsBest forMain tradeoff
External AI prototypingv0.dev, Lovable, bolt.newGreenfield exploration, throwaway mockupsDisconnected from real codebase; handoff gap
Storybook + Claude Code@storybook/addon-mcp + ClaudeComponent-first teams with design systemsSetup overhead; excellent long-term
Design tool bridgeFigma → Code Connect → ClaudeTeams with dedicated designers and FigmaRequires Figma; preserves design intent well
UXPin Merge + ClaudeUXPin + Claude APIEstablished design systems, enterprisePaid tool; best for mature DS
Static HTML prototypes (this repo)Claude + app.css tokensServer-rendered htmx appsNo interactivity; excellent for layout review

The 2026 consensus: prototypes should live in the codebase. External tools create a handoff gap that costs more engineering time than the initial speed gain.

Community Ecosystem

As of 2026, Claude Code skills follow the Agent Skills open standard, meaning skills written here also work (with minor adjustments) in Cursor, Gemini CLI, Codex CLI, and Antigravity IDE. Multiple "awesome" collections have emerged:

Best curated starting point. Quality-filtered list of skills, resources, and tools for Claude Code workflows.
Most comprehensive list: skills, agents, hooks, orchestrators, status lines, developer tooling, and all latest features. Currently being restructured.
135 agents, 35 curated skills, 42 commands, 176+ plugins, 20 hooks, 15 rules, 7 templates, 14 MCP configs, 26 companion apps. The most batteries-included toolkit.
337+ production-ready skills across 16 domains for 13 AI platforms. Strong on engineering, compliance (ISO/SOC 2/GDPR), C-suite personas, and security-first architecture.
1 000+ community agent skills from official dev teams and community contributors, compatible across coding agents.
1 500+ installable skills with a CLI installer, bundles, workflows, and official/community collections. Multi-platform.

Notable Community Skills

SkillWhat it doesSource
obra/superpowers20+ battle-tested utilities: TDD, debugging, collaboration. Commands like /brainstorm, /write-plantravisvn list
Trail of Bits securityStatic analysis, variant analysis, code auditing, vulnerability detectiontravisvn list
chaos-engineerDesigns chaos experiments with Litmus Chaos, toxiproxy, Chaos Monkey; outputs manifests, runbooks, post-mortem templatesjeffallan
playwright-skillFull Playwright browser automation framework with Claude Code integrationcommunity
ffuf-web-fuzzingHTTP-level penetration testing with authenticated request handlingcommunity
storybook-assistant18 skills, 12 slash commands, 3 agents for Storybook+Claude workflow; visual regression, a11y, design system integrationflight505
skill-creatorAnthropic meta-skill — builds new skills through interactive Q&A with eval-driven optimization and trigger testingAnthropic
loki-modeOrchestrates 37 AI agents across 6 swarms for autonomous multi-domain workcommunity

Reference: The 24-Skill QA Suite

MiniKao's open-source QA toolkit (2026) is the most complete community testing skill collection. Organized into 8 categories, with three operation modes (full-MCP, partial-MCP, markdown-only):

CategorySkills
Test Designtest-master, flutter-test-master, test-review, regression-test, speckit-to-tc, tc-version-diff, sheet-md-sync, smoke-test-analyzer
Automationtest-automation, flutter-test-automation, tc-to-pytest
Bug Managementbug-report (RIDER format, JIRA dup check, git blame root cause, Slack notify)
Quality Quantificationmutation-testing (mutmut), property-based-test-gen (Hypothesis/fast-check)
Reportingpublish-regression
Performance & Securityperformance-test-gen, security-scan, api-contract-test
CI Healthvisual-regression-gen, flaky-test-hunter
Quality Specialtiesa11y-audit, localization-test, push-notification-test, test-data-factory

Bold entries are skills that address gaps not yet covered in skills.

Authoring Best Practices

Concise is key
Claude is already smart. Only add context Claude doesn't already have. Challenge every paragraph: "Does Claude need this explanation?" The context window is shared with conversation history, other skills, and system prompts.
Degrees of freedom
High freedom (text instructions): multiple valid approaches. Medium freedom (pseudocode): preferred pattern with variation. Low freedom (exact script): fragile, exact sequence required. Match specificity to task fragility.
Evaluations first
Write 3 test scenarios BEFORE writing the skill body. This ensures you're solving real problems, not imagined ones. Baseline Claude's performance without the skill, then measure improvement.
Test with all models
Skills are model-dependent. Haiku needs more detail; Opus needs less verbosity. Test with Haiku (does it provide enough guidance?), Sonnet (is it clear and efficient?), and Opus (does it over-explain?).
Progressive disclosure
SKILL.md body under 500 lines. Split content into separate reference files when approaching this limit. Keep all reference links one level deep from SKILL.md — never chain references deeper.
Feedback loops
For validation-heavy skills: run validator → fix errors → repeat. The plan-validate-execute pattern catches errors before irreversible changes. Scripts should surface errors with specific messages, not generic failures.
Scripts execute, not load
Scripts in scripts/ are run via bash — their code never enters the context window. Only their output costs tokens. This makes scripts far more efficient than asking Claude to generate equivalent code.
Develop with Claude
Use Claude A to write the skill, Claude B (fresh instance) to test it on real tasks. Observe where Claude B struggles or misses rules, then return to Claude A with specific observations. The iterative cycle beats writing from assumptions.

Naming & Descriptions

Name field

PatternExampleVerdict
Gerund form (recommended)processing-pdfs, hardening-tests✅ Best
Noun phrasepdf-processing, test-hardening✅ Good
Action verbprocess-pdfs, harden-tests✅ Good
Too vaguehelper, utils, tools❌ Avoid
Reserved wordanthropic-helper, claude-tools❌ Invalid
Uppercase / spacesProcess PDFs❌ Invalid

Description field — the most critical part

Claude uses the description to pick the right skill from potentially 100+ available. It must answer: what does it do AND when should I use it. Formula: [Action verb] [what] — Use when [trigger conditions].

✅ Good:
description: >
  Extracts text and tables from PDF files, fills forms, merges documents.
  Use when working with PDF files or when the user mentions PDFs, forms,
  or document extraction.

❌ Bad:
description: Helps with documents
  • Always write in third person (injected into system prompt)
  • Max 1 024 characters; no XML tags; no reserved words
  • Include at least 2–3 specific trigger terms users might type

Anti-patterns to Avoid

Anti-patternProblemFix
Too many optionsParalyzes Claude with analysis paralysisProvide one default; name the escape hatch explicitly
Punting errors to ClaudeUnreliable; Claude can't recover what a script swallowedHandle error conditions explicitly in scripts
Windows-style paths (\)Breaks on Unix — where most CI runsAlways use forward slashes
Voodoo constantsTIMEOUT=47 — nobody knows whyBrief inline justification for every non-obvious value
Deeply nested referencesClaude may partially read chained filesAll references link one level deep from SKILL.md
Time-sensitive information"If before Aug 2025…" — rots immediatelyUse "old patterns" sections with <details> collapse
Inconsistent terminologySame concept named 3 ways → confuses ClaudePick one term and use it throughout the skill
No evaluation before authoringSkill solves imagined problemsRun Claude on representative tasks first; document failures

Skill Composition Patterns

Skills compose by having explicit boundary rules — each skill states what it does and what it defers to other skills. This prevents overlap and ensures Claude routes to the right specialist.

Orchestra-skills demonstrates this well:

auditing-code-quality boundaries:
  Bugs / correctness         → /code-review (built-in)
  Mechanical simplification  → /simplify (built-in)
  Module structure           → improving-architecture
  Test quality               → hardening-tests

hardening-tests boundaries:
  Creates tests              → practicing-tdd, authoring-tests
  Attacks & strengthens them → this skill
  Product bugs found         → file via tracking system, fix with TDD

exploring-quality boundaries:
  Finds problems             → this skill
  Locks in a11y/visual fixes → authoring-tests
  Schedules filed bugs       → tracking issues
Document boundaries explicitly. Every skill should answer "what do I NOT do?" as clearly as "what do I do?". This is what allows skills to be composed without stepping on each other.

Recommended New Skills

security-reviewing
Invocation: /security-reviewing
Systematic security audit tailored to this codebase: run govulncheck for CVEs, gosec for SAST findings, scan for hardcoded secrets, walk OWASP Top 10 against the active code, and produce a ranked findings report to docs/SECURITY_REVIEW.md. Boundaries: audits and reports only — fixes go through the normal /code-review + TDD cycle.
skills/security-reviewing/ ├── SKILL.md ├── references/ │ ├── owasp-top10.md ← Go-specific OWASP checklist │ ├── secrets-patterns.md ← regex patterns for credential detection │ └── vuln-triage.md ← CVE severity guide + fix priority rules └── scripts/ ├── govulncheck-run.sh ← dependency CVE scan ├── gosec-run.sh ← SAST security lint └── secrets-scan.sh ← grep for hardcoded creds
profiling-performance
Invocation: /profiling-performance
Systematically profile Go performance, detect regressions against stored baselines, enforce P95 latency SLOs in CI, and run soak tests for leak detection. Fills the gap between "we have bench_test.go" and "we fail CI when P95 degrades."
skills/profiling-performance/ ├── SKILL.md ├── references/ │ ├── slo-thresholds.md │ └── profiling-guide.md └── scripts/ ├── bench-compare.sh ← benchstat diff vs baseline ├── update-baseline.sh └── soak-run.sh
fuzzing-inputs
Invocation: /fuzzing-inputs
Scaffold and run Go corpus-driven fuzz tests for parsers, decoders, and user-input handlers. Identifies fuzz targets, writes func FuzzXxx(f *testing.F) with seed corpus, runs time-boxed fuzzing in CI, and stores crash-reproducing corpus entries.
skills/fuzzing-inputs/ ├── SKILL.md ├── references/ │ ├── fuzz-target-patterns.md │ └── corpus-management.md └── scripts/ ├── find-fuzz-targets.sh └── fuzz-timed.sh ← run for N seconds, capture crashes
stress-testing-resilience
Invocation: /stress-testing-resilience
Verify the system degrades gracefully under failure and load. Maps failure modes from CODEBASE_MAP.md, runs controlled fault injection (toxiproxy latency, connection floods), asserts steady-state recovery within SLO, and produces a resilience report.
skills/stress-testing-resilience/ ├── SKILL.md ├── references/ │ ├── failure-modes.md │ ├── steady-state.md │ └── safety-guardrails.md ← blast-radius rules └── scripts/ ├── latency-inject.sh ├── connection-flood.sh └── goroutine-check.sh
Enhancement: authoring-tests → visual regression
Add ## Visual regression tests section to authoring-tests SKILL.md
Add Playwright's built-in screenshot comparison to the test layers — zero external dependencies, CI-enforceable, per-state coverage. Update the layer matrix to include visual as a fifth layer. No new skill needed; this is a targeted enhancement.
Enhancement: designing-ui-ux → rapid prototyping phase
Add ## Rapid prototyping section to designing-ui-ux SKILL.md
Add Phase 0 before the Audit → Design → Implement loop. Claude generates a standalone HTML mockup using existing design tokens, serves it locally, captures stakeholder-share screenshots, and iterates — all before touching production templates.

Phased Roadmap

Phase 1 — Q3 2026: Fill Critical Gaps
ItemTypeEffortImpact
security-reviewing skillNew skillMediumHigh — no CVE scan, SAST, or secrets detection today
Visual regression in authoring-testsEnhancementSmallHigh — zero-dependency, immediate CI value
Rapid prototyping in designing-ui-uxEnhancementSmallHigh — faster design iteration
fuzzing-inputs skillNew skillMediumHigh — Go 1.18+ fuzz is first-class and currently untapped
profiling-performance skillNew skillMediumHigh — no current perf regression detection
Phase 2 — Q4 2026: Stability & Resilience
ItemTypeEffortImpact
stress-testing-resilience skillNew skillLargeMedium-High — important pre-release gate
Soak test pattern in profiling-performanceEnhancementSmallMedium — catches slow memory leaks
Phase 3 — Q1 2027: Ecosystem Integration
ItemTypeEffortImpact
Storybook MCP integration in designing-ui-uxEnhancementMediumMedium — relevant if frontend stack changes
Consumer-contract testing in registering-contractsNew skillLargeLow now, critical if multi-service
Publish to Agent Skills open standard registryDistributionSmallHigh — community discoverability

Skill count projection

PhaseSkillsChange
Today13
Phase 1 complete16+3 new skills, 2 enhancements
Phase 2 complete17+1 new skill, 1 enhancement
Phase 3 complete19+2 integrations

External resources