How I run 4-8 parallel coding agents with tmux and markdown specs

Feb 26, 2026 · 13 min read

I’ve been running 4-8 parallel coding agents as my primary development setup since January 2025. No orchestrator, no custom infrastructure. Just tmux, markdown files, bash aliases, and six slash commands.

Each agent takes on a role:

Agent	What it does
Planner	Explore code, design features, iterate on specs
Worker	Implement from a finished spec, atomic commits, verify
PM	Backlog grooming, prioritization, idea dumping

These are not formal subagents. No special system prompts, no skill definitions, no subagent configs. Just a naming convention.

The core idea: every unit of work gets a written spec (I call them Feature Designs, or FDs) before any agent writes code. Agents pick up specs, execute them, and close them. This separation of planning and execution is what makes parallelism work. This has produced 300+ completed FDs (small-to-medium scoped changes; one spec, one implementation pass) across three projects.

/fd-init bootstraps the full system into any repo. This article walks through how it works.

Feature Design tracking

Each FD is:

A numbered spec file (FD-001, FD-002, etc.) with the problem, solution, files to change, and verification plan
Tracked in an index across all FDs
Managed through slash commands for the full lifecycle

Each FD file lives in docs/features/ and moves through 8 stages:

Stage	What it means
Planned	Identified, not yet designed
Design	Actively designing the solution
Open	Designed, ready for implementation
In Progress	Currently being implemented
Pending Verification	Code complete, awaiting runtime verification
Complete	Verified working, ready to archive
Deferred	Postponed indefinitely
Closed	Won’t fix

Six slash commands handle the lifecycle:

Command	What it does
`/fd-new`	Create a new FD from an idea dump
`/fd-status`	Show the index: what’s active, pending verification, and done
`/fd-explore`	Bootstrap a session: load architecture docs, dev guide, FD index
`/fd-deep`	Launch 4 parallel Opus agents to explore a hard design problem
`/fd-verify`	Proofread code, propose a verification plan, commit
`/fd-close`	Archive the FD, update the index, update the changelog

Every commit ties back to its FD: FD-049: Implement incremental index rebuild. The changelog accumulates automatically as FDs complete.

A typical FD file looks like this:

FD-051: Multi-label document classification

Status: Open          Priority: Medium
Effort: Medium        Impact: Better recall for downstream filtering

## Problem
Incoming documents get a single category label, but many span
multiple topics. Downstream filters miss relevant docs because
the classifier forces a single best-fit.

## Solution
Replace single-label classification with multi-label:

1. Use an LLM to assign confidence scores per category.
2. Accept all labels above 0.90 confidence.
3. For ambiguous scores (0.50-0.90), run a second LLM pass with few-shot examples to confirm.
4. Store all labels with scores so downstream queries can threshold flexibly.

## Files to Modify
- src/classify/multi_label.py (new: LLM-based multi-label logic)
- src/classify/prompts.py (new: few-shot templates for ambiguous cases)
- sql/01_schema.sql (add document_labels table with scores)
- sql/06_classify_job.sql (new: scheduled classification after ingestion)

## Verification
1. Run classifier on staging document table
2. Verify no errors in operation log, run health checks
3. Spot-check: docs with known multi-topic content have expected labels
4. Run tests, confirm downstream filters respect confidence threshold

The FEATURE_INDEX.md tracks status across all FDs:

## Active Features
| FD     | Title                               | Status               | Effort | Priority |
|--------|-------------------------------------|----------------------|--------|----------|
| FD-051 | Multi-label document classification | Open                 | Medium | Medium   |
| FD-052 | Streaming classification pipeline   | In Progress          | Large  | High     |
| FD-050 | Confidence-based routing            | Pending Verification | Medium | High     |

## Completed
| FD     | Title                               | Completed  | Notes          |
|--------|-------------------------------------|------------|----------------|
| FD-049 | Incremental index rebuild           | 2026-02-20 | 45 min → 2 min |
| FD-048 | LLM response caching                | 2026-02-18 |                |

Portable: `/fd-init`

The original project’s FD system was built slowly over months. I wanted the same structure in every new project without repeating that work, so I packaged it as a slash command.

Run /fd-init in any repo and it:

Infers project context from CLAUDE.md, package configs, and git log
Creates the directory structure (docs/features/, docs/features/archive/)
Generates a FEATURE_INDEX.md customized to the project
Creates an FD template
Installs the slash commands (/fd-new, /fd-status, /fd-explore, /fd-deep, /fd-verify, /fd-close)
Appends FD lifecycle conventions to the project’s CLAUDE.md

* FD System Initialized

  Files Created

  - docs/features/FEATURE_INDEX.md — Feature index
  - docs/features/TEMPLATE.md — FD file template
  - docs/features/archive/ — Archive directory
  - CHANGELOG.md — Changelog (Keep a Changelog format)
  - CLAUDE.md — Project conventions with FD management section
  - .claude/commands/fd-new.md — Create new FD
  - .claude/commands/fd-explore.md — Project exploration
  - .claude/commands/fd-deep.md — Deep parallel analysis
  - .claude/commands/fd-status.md — Status and grooming
  - .claude/commands/fd-verify.md — Verification workflow
  - .claude/commands/fd-close.md — Close and archive FD with changelog update

  Next Steps

  1. Run /fd-new to create your first feature design
  2. Run /fd-status to check the current state

The development loop

How I plan

I spend most of the time working with Planners. Each one starts with /fd-explore, which loads codebase context and past work so the agent doesn’t start from zero: architecture docs, dev guide, readmes, FD index. I customize it per project as it grows.

From there, I work through the FD design:

on fd14 - can we move the batch job to event-driven? what does the retry logic look like if the queue backs up?

In Boris Tane’s How I Use Claude Code, he describes how he uses inline annotations to give Claude feedback. I adapted this pattern for complex FDs where conversational back-and-forth can be imprecise. I edit the FD file directly in Cursor and add inline annotations prefixed with %%:

## Solution

Replace cron-based batch processing with an event-driven pipeline.
Consumer pulls from the queue, processes in micro-batches of 50.

%% what's the max queue depth before we start dropping? need backpressure math

Run both in parallel for 48h, compare outputs, then kill the cron job.
Failures go to the dead-letter queue.

%% what happens to in-flight items during cutover? need to confirm drain behavior

Then in Claude Code:

fd14 - check %% notes.

Claude revises the design, removes the annotations, and the cycle repeats until the design is solid.

For critical designs, I may do two things:

I cross-check the plan in Cursor with GPT or Gemini to catch blind spots.
/fd-deep launches 4 Opus agents in parallel to explore different angles:

are you sure the consumer service account has write access to the output table? use /fd-deep.

Each agent runs in read-only Explore mode with a specific angle to investigate (algorithmic, structural, incremental, environmental, or whatever fits the problem). Once they report back, the orchestrator verifies their factual claims (file paths, function signatures, behavioral assumptions), flags contradictions, and synthesizes a ranked recommendation with confidence levels, tradeoffs, and a concrete first step.

The pattern borrows from GPT-5 Pro’s parallel test-time compute¹, adapted for design questions where there’s no single correct answer.

Worker execution

Once an FD is marked Open, a Worker picks it up. I point it at the FD, turn on plan mode so Claude builds a line-level implementation plan, review it, then switch to accept edits and let it run. Most FDs are self-contained: one design, one implementation pass, working on a dev branch. When a feature needs isolation, I tell the agent to create a git worktree. Claude Code handles it natively. The finished FD contains all the files and details, so even after compaction the Worker stays on track.

Verification

When Workers finished, I kept typing the same things:

proofread your code end to end, must be airtight

check for edge cases again

commit now, then create a verification plan on live test deployment.

Agents consistently find more issues when prompted to review their own work. So I built /fd-verify. It does a proofread pass, proposes a verification plan, and commits.

Some projects go further with dedicated slash commands like /test-cli that run full verification against real deployments. These aren’t traditional test suites. There’s no test runner and no assert statements. The agent reads markdown instructions, executes commands against real infrastructure, reasons about whether the results are correct, and writes structured results: markdown files with tables, timestamps, and diagnostic notes.

When something fails, the agent can investigate on the spot rather than just flagging it. By the end, the result comes back diagnosed. For systems that are inherently async and run on real data, an LLM following markdown instructions is a more natural verification harness than pytest.

One cycle

Putting it all together:

PM window:
1. /fd-status                      ← What's active, what's pending, what's done
2. Pick an FD (or /fd-new)         ← Groom the backlog or dump a new idea

Planner window (new agent session):
3. /fd-explore                     ← Load project context
4. Design the FD                   ← /fd-deep if stuck, cross-check in Cursor
5. FD status → Open                ← Design is solid, ready for implementation

Worker window (fresh agent session):
6. /fd-explore                     ← Fresh context load
7. "Implement FD-XXX" (plan mode)  ← Claude builds a line-level implementation plan
8. Implement with atomic commits   ← FD-XXX: description
9. /fd-verify                      ← Proofread, verification plan
10. Test on real deployment        ← Verification skills or manual
11. /fd-close                      ← Archive, update index, changelog

The Planner and Worker are separate sessions on purpose. Planning can burn through multiple context windows as the agent explores the codebase, and compaction tends to drop files the Planner still needs. I always start Workers fresh with just the FD, or with /fd-explore when they need broader project context.

Where the decisions live

FD files as decision traces

The development loop produces a trail of FD files. Each one captures more than the task itself. It records what was considered, what was chosen, what was rejected, and why. In practice, when a new agent picks up an FD, it may launch an Explore subagent that (unprompted to do so) finds past FDs with related work on its own. The agent arrives with context about prior decisions. The FD archive is institutional memory that accumulates with every completed feature.

The dev guide

Every project accumulates practical lessons. The dev guide (docs/dev_guide/) captures these as short entries. Agents read a summary on session start and can go deeper into any specific entry when it’s relevant to the task. Unlike the FD system (which bootstraps in seconds via /fd-init), the dev guide grows organically. Each lesson becomes a new entry as it comes up.

For example:

Entry	What it covers
No silent fallback values	Config errors fail loudly instead of hiding behind defaults
DRY: extract helpers and utilities	Don’t rewrite the same parser or validation logic twice
No backwards compatibility	All deployments are test environments, no migration code necessary
Structured logging conventions	Uniform log format across all features
Embedding handling	Always normalize embeddings at ingestion, never trust raw format from the database driver
Deployment safety	Destructive ops must wait for running tasks to complete before deploying
LLM JSON parsing	Always parse with lenient mode and regex fallback, never raw `json.loads()`

The dev guide is separate from CLAUDE.md on purpose. CLAUDE.md loads into every session, so it stays lean: commit style, tool preferences, hard guardrails. The dev guide entries are denser, often with inline code examples, and load on demand via /fd-explore when they’re relevant to the current task.

Two-tier CLAUDE.md

Claude Code loads a CLAUDE.md file at the start of every session. I split this into two tiers:

Global (~/.claude/CLAUDE.md) sets rules that apply everywhere: no AI attribution in commits, Python and SQL conventions, and never bypass denied commands.

Project-level (<repo>/CLAUDE.md) adds project conventions and FD lifecycle rules (written by /fd-init).

How it compounds

Past FDs act as decision traces. Early on in a project, one FD focused on performance work in a hot path and captured what we tried, what worked, and what didn’t. Weeks later, another agent touched that same path, found the FD, and asked whether we should benchmark before making changes. Without that record, I would have needed to remember and catch it myself, or the change would have gone in without the extra check.

The physical setup

┌────────────────────────┬────────────────────────┬────────────────────────┐
│                        │                        │                        │
│  Cursor (IDE)          │  Ghostty Terminal 1    │  Ghostty Terminal 2    │
│                        │  tmux                  │  tmux                  │
│                        │                        │                        │
│  Visual browsing       │  Window 1: PM          │  Window 1: Worker      │
│  Hand edits            │  Window 2: Planner     │  Window 2: Worker      │
│  Cross-model checks    │  Window 3: Planner     │  Window 3: Worker      │
│                        │  Window 4: Planner     │  Window 4: bash        │
│                        │                        │                        │
└────────────────────────┴────────────────────────┴────────────────────────┘

Three panels across an ultrawide monitor:

Cursor (left) for visual code browsing, hand edits, and cross-checking plans with other models like GPT or Gemini.
Two Ghostty terminals (middle and right), each running a tmux session.

Two coding agents across the terminals:

Claude Code is my daily driver for general-purpose coding.
Cortex Code is Snowflake’s coding agent, built for end-to-end data workflows and Snowflake-aware out of the box (catalog, semantics, governance). It runs the latest Opus model and loads the same CLAUDE.md file.

I use mostly vanilla tmux to navigate: Ctrl-b n/p to cycle windows, Ctrl-b , to rename them (planner, worker-fd038, PM), Ctrl-b c to spin up a new agent, Ctrl-b s to browse sessions. A few custom additions: Shift-Left/Right to reorder windows, m to move a window between sessions, and renumber-windows on so closing a tab doesn’t leave gaps.

Every project gets a g* alias (“go to”) for instant navigation:

Alias	Project
`gapi`	~/workspace/api-service
`gpipeline`	~/workspace/data-pipeline
`gdatakit`	~/workspace/datakit
`gclaude`	~/.claude

Claude reads them too. I tell Claude:

run the eval in gpipeline

and it resolves the alias to the actual path.

When an agent finishes, the tmux tab changes color. Two config layers make this work:

Layer	File	What it does
Claude Code	`~/.claude/settings.json`	`Notification` hook (matcher: `idle_prompt`) sends bell (`\a`) to terminal
tmux	`~/.tmux.conf`	`monitor-bell on`, `bell-action any`, `window-status-bell-style reverse`

Agent goes idle, Claude Code fires the hook, tmux catches the bell and inverts the tab color.

tmux tabs with multiple agent sessions — tmux tabs showing active agent sessions: PM, planners, and an fd-init run. Tabs change color when an agent goes idle. *(click to enlarge)*

What’s hard

With 6+ agents running, there’s always something waiting for me, like a Planner with design questions or a Worker ready for verification. Managing that is where the system starts to strain.

Cognitive load is the real ceiling. Around 8 agents is my practical max. Past that, I lose track of what each one is doing and design decisions suffer.

Not everything parallelizes. Some features have sequential dependencies. Forcing parallelism on inherently serial work creates merge conflicts and wasted effort.

Context window limits. Planners burn through context windows fast. When compaction kicks in, it can drop files the agent needs to continue the design. I’ve learned to checkpoint FD progress before compaction hits.

Sandbox whack-a-mole. I deny destructive commands (rm, git reset --hard, DROP). The agent finds creative alternatives: unlink, python -c "import os; os.remove()", find ... -delete. The permission system has evaluation order quirks where blanket allows override specific denies. My CLAUDE.md now says “If a command is denied, that’s the answer. Ask the user to do it.”

Translating business context into FDs is still manual. Jira tickets, Slack threads, meeting notes, product decisions. I’m the bridge between all of that and a well-scoped FD. A dedicated subagent profile would close this gap.

OpenAI describes GPT-5 Pro as using "scaled but efficient parallel test-time compute." Nathan Lambert on Lex Fridman #490 discusses the broader pattern of inference-time scaling: giving models more compute at generation time to explore multiple reasoning paths. ↩

If you try this, I'd love to hear what you change.