Skip to main content

Orchestration Rules and State Machine

The orchestrator is the control layer that routes work between agents, enforces stage order, and handles failures. This document defines the rules that govern the orchestrator's behavior. It is mandatory per PRD-STD-009 REQ-009-05.

State Machine Definition

Every work item in the pipeline exists in exactly one state at any time. The orchestrator manages state transitions.

┌─────────────────────────────────────────────────────────────────────────────┐
│ ORCHESTRATION STATE MACHINE │
│ │
│ ┌──────────┐ │
│ │ INTAKE │ │
│ └────┬─────┘ │
│ │ │
│ v │
│ ┌────────────┐ PASS ┌──────────┐ PASS ┌──────────────┐ │
│ │REQUIREMENTS│────────>│ DESIGN │────────>│IMPLEMENTATION│ │
│ │ (Stage 1) │ │(Stage 2) │ │ (Stage 3) │ │
│ └─────┬──────┘ └────┬─────┘ └──────┬───────┘ │
│ │ FAIL │ FAIL │ FAIL │
│ v v v │
│ REWORK REWORK REWORK │
│ (Stage 1) (Stage 2) (Stage 3) │
│ │ PASS │
│ v │
│ ┌──────────┐ PASS ┌──────────┐ │
│ │ SECURITY │<────────│ TESTING │ │
│ │(Stage 5) │ │(Stage 4) │ │
│ └────┬─────┘ └────┬─────┘ │
│ │ FAIL │ FAIL │
│ v v │
│ REWORK REWORK │
│ (Stage 3/5) (Stage 3/4) │
│ │ PASS │
│ v │
│ ┌──────────┐ PASS ┌──────────┐ │
│ │DEPLOYMENT│────────>│OPERATIONS│ │
│ │(Stage 6) │ │(Stage 7) │ │
│ └────┬─────┘ └────┬─────┘ │
│ │ FAIL │ INCIDENT │
│ v v │
│ REWORK ROLLBACK │
│ (Stage 6) + REWORK │
│ │ │
│ FEEDBACK │
│ TO INTAKE │
│ │
│ Terminal states: DEPLOYED, ROLLED_BACK, CANCELLED │
└─────────────────────────────────────────────────────────────────────────┘

State Definitions

StateDescriptionOwner AgentValid Transitions
INTAKEWork item received, not yet assignedOrchestratorREQUIREMENTS
REQUIREMENTSStage 1 activeproduct-agent, scrum-agentDESIGN (pass), → REWORK_REQ (fail)
DESIGNStage 2 activearchitect-agentIMPLEMENTATION (pass), → REWORK_DESIGN (fail)
IMPLEMENTATIONStage 3 activedeveloper-agentTESTING (pass), → REWORK_IMPL (fail)
TESTINGStage 4 activeqa-agent, devmgr-agentSECURITY (pass), → REWORK_IMPL (code fix), → REWORK_TEST (test fix)
SECURITYStage 5 activesecurity-agent, compliance-agentDEPLOYMENT (pass), → REWORK_IMPL (remediation), → REWORK_SEC (scan config)
DEPLOYMENTStage 6 activeplatform-agentOPERATIONS (pass), → REWORK_DEPLOY (config fix)
OPERATIONSStage 7 activeops-agent, executive-agentDEPLOYED (stable), → ROLLBACK (incident)
REWORK_*Rework in progress at specific stageStage-specific agent→ Return to originating stage
DEPLOYEDSuccessfully in production (terminal)INTAKE (new work via feedback)
ROLLED_BACKRolled back from production (terminal)INTAKE (rework item created)
CANCELLEDWork item cancelled (terminal)None

Transition Rules

Forward Transitions (Happy Path)

Every forward transition requires:

  1. Gate criteria met — all checks for the current stage passed
  2. Handoff artifact produced — structured output per PRD-STD-009 REQ-009-06
  3. Target agent available — next agent has capacity and valid contract
  4. No blocking escalations — no unresolved escalation requests
# Example transition rule
transition:
from: IMPLEMENTATION
to: TESTING
requires:
- gate_3_passed: true
- handoff_artifact: present
- ai_metadata: [AI-Usage, Agent-IDs, AI-Prompt-Ref]
- unit_tests: passing
- lint: passing
produces:
- handoff: "HO-developer-agent-qa-agent-{timestamp}"
- state_change: "IMPLEMENTATION → TESTING"
- audit_record: true

Failure Transitions (Rework Routing)

When a gate fails, the orchestrator must route the work item to the correct agent for rework. The routing depends on the failure type.

Current StageFailure TypeRoute ToRework AgentMax Rework Iterations
Testing (4)Test failure (code bug)REWORK_IMPLdeveloper-agent3
Testing (4)Test gap (missing coverage)REWORK_TESTqa-agent2
Testing (4)Acceptance criteria mismatchREWORK_REQproduct-agent1
Security (5)Vulnerability foundREWORK_IMPLdeveloper-agent3
Security (5)License violationREWORK_IMPLdeveloper-agent2
Security (5)Compliance evidence gapREWORK_SECcompliance-agent2
Deployment (6)Configuration errorREWORK_DEPLOYplatform-agent2
Deployment (6)Environment mismatchREWORK_DEPLOYplatform-agent2
Operations (7)Health check failureROLLBACKops-agent + platform-agent1 (then escalate)
Operations (7)Critical incidentROLLBACKops-agent + humanImmediate

Escalation Transitions

When an agent cannot resolve an issue within its contract, it must escalate to a human.

TriggerEscalation TargetResponse SLAAction if SLA Breached
Architecture-impacting decisionSolution Architect4 hoursBlock pipeline, notify CTO
Auth/crypto/PII changeSecurity Engineer2 hoursBlock pipeline, notify Security lead
Rework iteration limit reachedDevelopment Manager4 hoursBlock pipeline, create incident
Agent contract violationCTO1 hourSuspend agent immediately
Cross-agent conflict (contradictory outputs)Solution Architect4 hoursBlock pipeline, convene review

Iteration Limits and Deadlock Prevention

Maximum Iteration Thresholds

Per PRD-STD-009 REQ-009-07, autonomous loops must enforce maximum iteration limits.

Loop TypeMax IterationsOn Breach
Single agent rework (same stage)3Escalate to human owner of that stage
Cross-stage rework (bouncing between stages)5 total across all stagesEscalate to Development Manager
Full pipeline retry (Stage 1 restart)2Escalate to CTO, likely needs scope change
Deployment retry2Block deployment, human investigation required

Deadlock Detection

A deadlock occurs when two or more agents are waiting for each other's output. The orchestrator must detect and resolve these.

Detection rules:

  1. Circular wait: Agent A waits for Agent B, Agent B waits for Agent A
  2. Timeout: Any agent state unchanged for >2x its expected execution time
  3. Contradictory outputs: Two agents produce conflicting recommendations with no resolution path

Resolution protocol:

1. Orchestrator detects deadlock condition
2. Orchestrator pauses all involved agents
3. Orchestrator notifies the Development Manager with:
- Deadlock type (circular, timeout, contradictory)
- Involved agents and their current states
- Last handoff artifacts from each agent
- Suggested resolution (human decision needed)
4. Development Manager resolves by:
- Choosing one agent's output over the other
- Providing additional context to break the tie
- Escalating to Solution Architect for architecture decisions
- Cancelling the work item if resolution is not feasible
5. Orchestrator resumes pipeline with resolution applied

Parallel Execution Rules

Some stages can run in parallel to reduce cycle time. The orchestrator manages parallelism.

Allowed Parallel Paths

Stage 3 (Implementation) completes

├──> Stage 4 (qa-agent) ──────────────┐
│ │
└──> Stage 5 (security-agent) ─────────┤ ──> Merge results ──> Gate 5
Stage 5 (compliance-agent) ────────┘

Rules for parallel execution:

  1. qa-agent and security-agent MAY run in parallel after Gate 3
  2. compliance-agent MAY run in parallel with security-agent
  3. Both paths must complete and pass before Gate 5 is evaluated
  4. If one path fails, the other continues but the work item cannot advance
  5. devmgr-agent runs after both qa-agent and security-agent complete (needs both outputs)

Forbidden Parallel Paths

These stages MUST run sequentially:

Stage AStage BReason
Requirements (1)Design (2)Design depends on approved requirements
Design (2)Implementation (3)Code depends on approved design
Security (5)Deployment (6)Cannot deploy security-uncleared code
Deployment (6)Operations (7)Cannot monitor what is not deployed

Orchestrator Configuration

Work Item Metadata

Every work item tracked by the orchestrator carries this metadata:

work_item:
id: "WI-{project}-{sequence}"
title: "{descriptive title}"
risk_tier: 1|2|3|4
data_classification: public|internal|confidential|restricted
current_state: "{state from state machine}"
current_agent: "{agent-id or null}"
created_at: "{ISO 8601}"
updated_at: "{ISO 8601}"
stage_history:
- stage: 1
agent: "product-agent"
entered_at: "{ISO 8601}"
exited_at: "{ISO 8601}"
result: "pass|fail|escalate"
handoff_id: "HO-{id}"
iteration: 1
rework_count: 0
total_iterations: 0
escalation_history: []
trust_levels:
product-agent: 1
architect-agent: 0
developer-agent: 2
qa-agent: 1
security-agent: 1

Orchestrator Health Checks

The orchestrator itself must be monitored:

MetricThresholdAction on Breach
Queue depth>50 work itemsAlert Development Manager, assess capacity
Average cycle time>2x baselineInvestigate bottleneck stages
Deadlock rate>1 per weekReview agent contracts for conflicts
Escalation rate>10% of work itemsReview trust levels and agent capabilities
Gate failure rate>30% at any single gateInvestigate root cause, retrain agents

Event Log Format

Every state transition produces an event in the orchestration log:

{
"event_id": "EVT-{uuid}",
"timestamp": "2026-02-23T14:30:00Z",
"work_item_id": "WI-myproject-042",
"transition": {
"from_state": "IMPLEMENTATION",
"to_state": "TESTING",
"trigger": "gate_3_passed"
},
"agent": {
"source": "developer-agent",
"target": "qa-agent"
},
"handoff_id": "HO-developer-agent-qa-agent-20260223T143000",
"gate_results": {
"lint": "pass",
"unit_tests": "pass",
"sast_basic": "pass",
"ai_metadata": "pass"
},
"trust_level": 2,
"human_approval": null,
"duration_seconds": 3420
}

This log format satisfies PRD-STD-009 REQ-009-14 (auditable run records) and PRD-STD-005 (documentation requirements).


Quick Reference: Orchestration Decision Tree

Is the work item new?
├── YES → State: INTAKE → Route to product-agent (Stage 1)
└── NO → Is the current gate passed?
├── YES → Is there a next stage?
│ ├── YES → Can parallel paths run?
│ │ ├── YES → Launch parallel agents
│ │ └── NO → Route to next stage's owner agent
│ └── NO → State: DEPLOYED (terminal)
└── NO → Is the rework limit reached?
├── YES → Escalate to human (Development Manager)
└── NO → What type of failure?
├── Code bug → Route to developer-agent
├── Test gap → Route to qa-agent
├── Security finding → Route to developer-agent (remediate)
├── Compliance gap → Route to compliance-agent
├── Config error → Route to platform-agent
└── Unclear → Escalate to human (Development Manager)