How I run 4-8 parallel coding agents with tmux and markdown specs
I’ve been running 4-8 parallel coding agents as my primary development setup since January 2025. No orchestrator, no custom infrastructure. Just tmux, markdown files, bash aliases, and six slash commands.
Each agent takes on a role:
| Agent | What it does |
|---|---|
| Planner | Explore code, design features, iterate on specs |
| Worker | Implement from a finished spec, atomic commits, verify |
| PM | Backlog grooming, prioritization, idea dumping |
These are not formal subagents. No special system prompts, no skill definitions, no subagent configs. Just a naming convention.
The core idea: every unit of work gets a written spec (I call them Feature Designs, or FDs) before any agent writes code. Agents pick up specs, execute them, and close them. This separation of planning and execution is what makes parallelism work. This has produced 300+ completed FDs (small-to-medium scoped changes; one spec, one implementation pass) across three projects.
/fd-init bootstraps the full system into any repo. This article walks through how it works.
Feature Design tracking
Each FD is:
- A numbered spec file (FD-001, FD-002, etc.) with the problem, solution, files to change, and verification plan
- Tracked in an index across all FDs
- Managed through slash commands for the full lifecycle
Each FD file lives in docs/features/ and moves through 8 stages:
| Stage | What it means |
|---|---|
| Planned | Identified, not yet designed |
| Design | Actively designing the solution |
| Open | Designed, ready for implementation |
| In Progress | Currently being implemented |
| Pending Verification | Code complete, awaiting runtime verification |
| Complete | Verified working, ready to archive |
| Deferred | Postponed indefinitely |
| Closed | Won’t fix |
Six slash commands handle the lifecycle:
| Command | What it does |
|---|---|
/fd-new |
Create a new FD from an idea dump |
/fd-status |
Show the index: what’s active, pending verification, and done |
/fd-explore |
Bootstrap a session: load architecture docs, dev guide, FD index |
/fd-deep |
Launch 4 parallel Opus agents to explore a hard design problem |
/fd-verify |
Proofread code, propose a verification plan, commit |
/fd-close |
Archive the FD, update the index, update the changelog |
Every commit ties back to its FD: FD-049: Implement incremental index rebuild. The changelog accumulates automatically as FDs complete.
A typical FD file looks like this:
FD-051: Multi-label document classification
Status: Open Priority: Medium
Effort: Medium Impact: Better recall for downstream filtering
## Problem
Incoming documents get a single category label, but many span
multiple topics. Downstream filters miss relevant docs because
the classifier forces a single best-fit.
## Solution
Replace single-label classification with multi-label:
1. Use an LLM to assign confidence scores per category.
2. Accept all labels above 0.90 confidence.
3. For ambiguous scores (0.50-0.90), run a second LLM pass with few-shot examples to confirm.
4. Store all labels with scores so downstream queries can threshold flexibly.
## Files to Modify
- src/classify/multi_label.py (new: LLM-based multi-label logic)
- src/classify/prompts.py (new: few-shot templates for ambiguous cases)
- sql/01_schema.sql (add document_labels table with scores)
- sql/06_classify_job.sql (new: scheduled classification after ingestion)
## Verification
1. Run classifier on staging document table
2. Verify no errors in operation log, run health checks
3. Spot-check: docs with known multi-topic content have expected labels
4. Run tests, confirm downstream filters respect confidence threshold
The FEATURE_INDEX.md tracks status across all FDs:
## Active Features
| FD | Title | Status | Effort | Priority |
|--------|-------------------------------------|----------------------|--------|----------|
| FD-051 | Multi-label document classification | Open | Medium | Medium |
| FD-052 | Streaming classification pipeline | In Progress | Large | High |
| FD-050 | Confidence-based routing | Pending Verification | Medium | High |
## Completed
| FD | Title | Completed | Notes |
|--------|-------------------------------------|------------|----------------|
| FD-049 | Incremental index rebuild | 2026-02-20 | 45 min → 2 min |
| FD-048 | LLM response caching | 2026-02-18 | |
Portable: /fd-init
The original project’s FD system was built slowly over months. I wanted the same structure in every new project without repeating that work, so I packaged it as a slash command.
Run /fd-init in any repo and it:
- Infers project context from CLAUDE.md, package configs, and git log
- Creates the directory structure (
docs/features/,docs/features/archive/) - Generates a
FEATURE_INDEX.mdcustomized to the project - Creates an FD template
- Installs the slash commands (
/fd-new,/fd-status,/fd-explore,/fd-deep,/fd-verify,/fd-close) - Appends FD lifecycle conventions to the project’s CLAUDE.md
* FD System Initialized
Files Created
- docs/features/FEATURE_INDEX.md — Feature index
- docs/features/TEMPLATE.md — FD file template
- docs/features/archive/ — Archive directory
- CHANGELOG.md — Changelog (Keep a Changelog format)
- CLAUDE.md — Project conventions with FD management section
- .claude/commands/fd-new.md — Create new FD
- .claude/commands/fd-explore.md — Project exploration
- .claude/commands/fd-deep.md — Deep parallel analysis
- .claude/commands/fd-status.md — Status and grooming
- .claude/commands/fd-verify.md — Verification workflow
- .claude/commands/fd-close.md — Close and archive FD with changelog update
Next Steps
1. Run /fd-new to create your first feature design
2. Run /fd-status to check the current state
The development loop
How I plan
I spend most of the time working with Planners. Each one starts with /fd-explore, which loads codebase context and past work so the agent doesn’t start from zero: architecture docs, dev guide, readmes, FD index. I customize it per project as it grows.
From there, I work through the FD design:
on fd14 - can we move the batch job to event-driven? what does the retry logic look like if the queue backs up?
In Boris Tane’s How I Use Claude Code, he describes how he uses inline annotations to give Claude feedback. I adapted this pattern for complex FDs where conversational back-and-forth can be imprecise. I edit the FD file directly in Cursor and add inline annotations prefixed with %%:
## Solution
Replace cron-based batch processing with an event-driven pipeline.
Consumer pulls from the queue, processes in micro-batches of 50.
%% what's the max queue depth before we start dropping? need backpressure math
Run both in parallel for 48h, compare outputs, then kill the cron job.
Failures go to the dead-letter queue.
%% what happens to in-flight items during cutover? need to confirm drain behavior
Then in Claude Code:
fd14 - check %% notes.
Claude revises the design, removes the annotations, and the cycle repeats until the design is solid.
For critical designs, I may do two things:
-
I cross-check the plan in Cursor with GPT or Gemini to catch blind spots.
-
/fd-deeplaunches 4 Opus agents in parallel to explore different angles:
are you sure the consumer service account has write access to the output table? use
/fd-deep.
Each agent runs in read-only Explore mode with a specific angle to investigate (algorithmic, structural, incremental, environmental, or whatever fits the problem). Once they report back, the orchestrator verifies their factual claims (file paths, function signatures, behavioral assumptions), flags contradictions, and synthesizes a ranked recommendation with confidence levels, tradeoffs, and a concrete first step.
The pattern borrows from GPT-5 Pro’s parallel test-time compute1, adapted for design questions where there’s no single correct answer.
Worker execution
Once an FD is marked Open, a Worker picks it up. I point it at the FD, turn on plan mode so Claude builds a line-level implementation plan, review it, then switch to accept edits and let it run. Most FDs are self-contained: one design, one implementation pass, working on a dev branch. When a feature needs isolation, I tell the agent to create a git worktree. Claude Code handles it natively. The finished FD contains all the files and details, so even after compaction the Worker stays on track.
Verification
When Workers finished, I kept typing the same things:
proofread your code end to end, must be airtight
check for edge cases again
commit now, then create a verification plan on live test deployment.
Agents consistently find more issues when prompted to review their own work. So I built /fd-verify. It does a proofread pass, proposes a verification plan, and commits.
Some projects go further with dedicated slash commands like /test-cli that run full verification against real deployments. These aren’t traditional test suites. There’s no test runner and no assert statements. The agent reads markdown instructions, executes commands against real infrastructure, reasons about whether the results are correct, and writes structured results: markdown files with tables, timestamps, and diagnostic notes.
When something fails, the agent can investigate on the spot rather than just flagging it. By the end, the result comes back diagnosed. For systems that are inherently async and run on real data, an LLM following markdown instructions is a more natural verification harness than pytest.
One cycle
Putting it all together:
PM window:
1. /fd-status ← What's active, what's pending, what's done
2. Pick an FD (or /fd-new) ← Groom the backlog or dump a new idea
Planner window (new agent session):
3. /fd-explore ← Load project context
4. Design the FD ← /fd-deep if stuck, cross-check in Cursor
5. FD status → Open ← Design is solid, ready for implementation
Worker window (fresh agent session):
6. /fd-explore ← Fresh context load
7. "Implement FD-XXX" (plan mode) ← Claude builds a line-level implementation plan
8. Implement with atomic commits ← FD-XXX: description
9. /fd-verify ← Proofread, verification plan
10. Test on real deployment ← Verification skills or manual
11. /fd-close ← Archive, update index, changelog
The Planner and Worker are separate sessions on purpose. Planning can burn through multiple context windows as the agent explores the codebase, and compaction tends to drop files the Planner still needs. I always start Workers fresh with just the FD, or with /fd-explore when they need broader project context.
Where the decisions live
FD files as decision traces
The development loop produces a trail of FD files. Each one captures more than the task itself. It records what was considered, what was chosen, what was rejected, and why. In practice, when a new agent picks up an FD, it may launch an Explore subagent that (unprompted to do so) finds past FDs with related work on its own. The agent arrives with context about prior decisions. The FD archive is institutional memory that accumulates with every completed feature.
The dev guide
Every project accumulates practical lessons. The dev guide (docs/dev_guide/) captures these as short entries. Agents read a summary on session start and can go deeper into any specific entry when it’s relevant to the task. Unlike the FD system (which bootstraps in seconds via /fd-init), the dev guide grows organically. Each lesson becomes a new entry as it comes up.
For example:
| Entry | What it covers |
|---|---|
| No silent fallback values | Config errors fail loudly instead of hiding behind defaults |
| DRY: extract helpers and utilities | Don’t rewrite the same parser or validation logic twice |
| No backwards compatibility | All deployments are test environments, no migration code necessary |
| Structured logging conventions | Uniform log format across all features |
| Embedding handling | Always normalize embeddings at ingestion, never trust raw format from the database driver |
| Deployment safety | Destructive ops must wait for running tasks to complete before deploying |
| LLM JSON parsing | Always parse with lenient mode and regex fallback, never raw json.loads() |
The dev guide is separate from CLAUDE.md on purpose. CLAUDE.md loads into every session, so it stays lean: commit style, tool preferences, hard guardrails. The dev guide entries are denser, often with inline code examples, and load on demand via /fd-explore when they’re relevant to the current task.
Two-tier CLAUDE.md
Claude Code loads a CLAUDE.md file at the start of every session. I split this into two tiers:
Global (~/.claude/CLAUDE.md) sets rules that apply everywhere: no AI attribution in commits, Python and SQL conventions, and never bypass denied commands.
Project-level (<repo>/CLAUDE.md) adds project conventions and FD lifecycle rules (written by /fd-init).
How it compounds
Past FDs act as decision traces. Early on in a project, one FD focused on performance work in a hot path and captured what we tried, what worked, and what didn’t. Weeks later, another agent touched that same path, found the FD, and asked whether we should benchmark before making changes. Without that record, I would have needed to remember and catch it myself, or the change would have gone in without the extra check.
The physical setup
┌────────────────────────┬────────────────────────┬────────────────────────┐
│ │ │ │
│ Cursor (IDE) │ Ghostty Terminal 1 │ Ghostty Terminal 2 │
│ │ tmux │ tmux │
│ │ │ │
│ Visual browsing │ Window 1: PM │ Window 1: Worker │
│ Hand edits │ Window 2: Planner │ Window 2: Worker │
│ Cross-model checks │ Window 3: Planner │ Window 3: Worker │
│ │ Window 4: Planner │ Window 4: bash │
│ │ │ │
└────────────────────────┴────────────────────────┴────────────────────────┘
Three panels across an ultrawide monitor:
- Cursor (left) for visual code browsing, hand edits, and cross-checking plans with other models like GPT or Gemini.
- Two Ghostty terminals (middle and right), each running a tmux session.
Two coding agents across the terminals:
- Claude Code is my daily driver for general-purpose coding.
- Cortex Code is Snowflake’s coding agent, built for end-to-end data workflows and Snowflake-aware out of the box (catalog, semantics, governance). It runs the latest Opus model and loads the same
CLAUDE.mdfile.
I use mostly vanilla tmux to navigate: Ctrl-b n/p to cycle windows, Ctrl-b , to rename them (planner, worker-fd038, PM), Ctrl-b c to spin up a new agent, Ctrl-b s to browse sessions. A few custom additions: Shift-Left/Right to reorder windows, m to move a window between sessions, and renumber-windows on so closing a tab doesn’t leave gaps.
Every project gets a g* alias (“go to”) for instant navigation:
| Alias | Project |
|---|---|
gapi |
~/workspace/api-service |
gpipeline |
~/workspace/data-pipeline |
gdatakit |
~/workspace/datakit |
gclaude |
~/.claude |
Claude reads them too. I tell Claude:
run the eval in gpipeline
and it resolves the alias to the actual path.
When an agent finishes, the tmux tab changes color. Two config layers make this work:
| Layer | File | What it does |
|---|---|---|
| Claude Code | ~/.claude/settings.json |
Notification hook (matcher: idle_prompt) sends bell (\a) to terminal |
| tmux | ~/.tmux.conf |
monitor-bell on, bell-action any, window-status-bell-style reverse |
Agent goes idle, Claude Code fires the hook, tmux catches the bell and inverts the tab color.
What’s hard
With 6+ agents running, there’s always something waiting for me, like a Planner with design questions or a Worker ready for verification. Managing that is where the system starts to strain.
Cognitive load is the real ceiling. Around 8 agents is my practical max. Past that, I lose track of what each one is doing and design decisions suffer.
Not everything parallelizes. Some features have sequential dependencies. Forcing parallelism on inherently serial work creates merge conflicts and wasted effort.
Context window limits. Planners burn through context windows fast. When compaction kicks in, it can drop files the agent needs to continue the design. I’ve learned to checkpoint FD progress before compaction hits.
Sandbox whack-a-mole. I deny destructive commands (rm, git reset --hard, DROP). The agent finds creative alternatives: unlink, python -c "import os; os.remove()", find ... -delete. The permission system has evaluation order quirks where blanket allows override specific denies. My CLAUDE.md now says “If a command is denied, that’s the answer. Ask the user to do it.”
Translating business context into FDs is still manual. Jira tickets, Slack threads, meeting notes, product decisions. I’m the bridge between all of that and a well-scoped FD. A dedicated subagent profile would close this gap.
- OpenAI describes GPT-5 Pro as using "scaled but efficient parallel test-time compute." Nathan Lambert on Lex Fridman #490 discusses the broader pattern of inference-time scaling: giving models more compute at generation time to explore multiple reasoning paths. ↩
If you try this, I'd love to hear what you change.