Incident Response Automation

git clone https://github.com/AEEF-AI/aeef-production.git

The Production tier includes automated incident response tooling that reduces mean time to resolution (MTTR) for AI governance incidents. This includes automated triage, rollback automation, structured alert routing, and incident record generation that feeds into post-incident review processes.

For the normative incident response requirements, see PRD-STD-010: AI Product Safety & Trust and the Incident Response governance guide.

Automated Triage Scripts

The triage system classifies incidents by type and severity, then routes them to the appropriate response automation:

Triage Script

#!/usr/bin/env bash
# scripts/triage.sh
set -euo pipefail

SEVERITY="${1:?Usage: triage.sh <P1|P2|P3|P4> <incident-type>}"
INCIDENT_TYPE="${2:-unknown}"

echo "Triaging incident: severity=$SEVERITY type=$INCIDENT_TYPE"

case "$INCIDENT_TYPE" in
  drift)
    echo "Configuration drift detected"
    ./scripts/assess-drift-impact.sh
    ;;
  security)
    echo "Security incident detected"
    ./scripts/isolate-affected-services.sh
    ;;
  agent-violation)
    echo "Agent trust boundary violation"
    ./scripts/suspend-agent.sh
    ;;
  data-breach)
    echo "Potential data breach"
    ./scripts/activate-breach-protocol.sh
    ;;
  *)
    echo "Unknown incident type, routing to on-call"
    ;;
esac

# Create incident record
./scripts/create-incident.sh "$SEVERITY" "$INCIDENT_TYPE"

# Route based on severity
case "$SEVERITY" in
  P1)
    echo "P1: Initiating immediate rollback and page on-call"
    ./scripts/rollback.sh --to-last-known-good
    ./scripts/page-oncall.sh --severity P1
    ;;
  P2)
    echo "P2: Alerting on-call team"
    ./scripts/page-oncall.sh --severity P2
    ;;
  P3)
    echo "P3: Creating ticket for next business day"
    ./scripts/create-ticket.sh --priority high
    ;;
  P4)
    echo "P4: Logging for review"
    ./scripts/log-incident.sh
    ;;
esac

Incident Type Classification

Type	Description	Auto-Response
`drift`	Configuration has deviated from baseline	Impact assessment, baseline comparison report
`security`	SAST/SCA finding in production, vulnerability detected	Service isolation, emergency patch workflow
`agent-violation`	Agent exceeded trust boundary or permission scope	Agent suspension, audit log extraction
`data-breach`	PII exposure or unauthorized data access	Breach protocol activation, regulatory notification prep
`quality-degradation`	Coverage/mutation score below threshold	Alert + investigation assignment
`deployment-failure`	Production deployment failed health checks	Automatic rollback to last known good

Rollback Automation

The rollback script reverts the deployment to the last known good state:

#!/usr/bin/env bash
# scripts/rollback.sh
set -euo pipefail

TARGET="${1:---to-last-known-good}"
DEPLOYMENT_LOG="deployments/history.json"

case "$TARGET" in
  --to-last-known-good)
    ROLLBACK_SHA=$(jq -r '.deployments[] | select(.status == "healthy") | .commit' \
      "$DEPLOYMENT_LOG" | head -1)
    ;;
  --to-commit)
    ROLLBACK_SHA="${2:?Usage: rollback.sh --to-commit <sha>}"
    ;;
esac

echo "Rolling back to commit: $ROLLBACK_SHA"

# Verify the target commit passed all quality gates
PROVENANCE="provenance/${ROLLBACK_SHA}.json"
if [ ! -f "$PROVENANCE" ]; then
  echo "ERROR: No provenance record for $ROLLBACK_SHA"
  exit 1
fi

ALL_PASSED=$(jq -r '.stages | to_entries | all(.value.status == "pass")' "$PROVENANCE")
if [ "$ALL_PASSED" != "true" ]; then
  echo "ERROR: Target commit did not pass all quality gates"
  exit 1
fi

# Execute rollback
git checkout "$ROLLBACK_SHA"
docker compose build
docker compose up -d --force-recreate

# Verify health
sleep 10
HEALTH=$(curl -sf http://localhost:8080/health | jq -r '.status')
if [ "$HEALTH" != "ok" ]; then
  echo "ERROR: Rollback health check failed"
  exit 1
fi

echo "Rollback successful to $ROLLBACK_SHA"

# Record rollback event
jq --arg sha "$ROLLBACK_SHA" --arg ts "$(date -u +%Y-%m-%dT%H:%M:%SZ)" \
  '.rollbacks += [{"commit": $sha, "timestamp": $ts}]' \
  "$DEPLOYMENT_LOG" > tmp.json && mv tmp.json "$DEPLOYMENT_LOG"

Alert Routing Configuration

Alerts are routed based on severity and incident type:

Severity	Response Time	Notification Channel	Escalation
P1	Immediate	PagerDuty + Slack #incidents + Phone	VP Engineering within 15 min
P2	30 minutes	PagerDuty + Slack #incidents	Team lead within 1 hour
P3	Next business day	Slack #aeef-warnings	Sprint backlog
P4	Best effort	Slack #aeef-info	Monthly review

Slack Integration

{
  "channels": {
    "critical": "#aeef-incidents",
    "warning": "#aeef-warnings",
    "info": "#aeef-info"
  },
  "templates": {
    "incident": {
      "blocks": [
        {
          "type": "header",
          "text": "AEEF Incident: {{severity}} - {{type}}"
        },
        {
          "type": "section",
          "text": "**Description:** {{description}}\n**Detected:** {{timestamp}}\n**Auto-response:** {{action_taken}}"
        }
      ]
    }
  }
}

Incident Record Schema

Every incident generates a structured record for post-incident review:

{
  "$schema": "https://json-schema.org/draft/2020-12/schema",
  "type": "object",
  "required": ["incidentId", "severity", "type", "detectedAt", "status"],
  "properties": {
    "incidentId": { "type": "string", "format": "uuid" },
    "severity": { "enum": ["P1", "P2", "P3", "P4"] },
    "type": { "type": "string" },
    "description": { "type": "string" },
    "detectedAt": { "type": "string", "format": "date-time" },
    "resolvedAt": { "type": "string", "format": "date-time" },
    "status": { "enum": ["open", "investigating", "mitigated", "resolved", "closed"] },
    "affectedServices": {
      "type": "array",
      "items": { "type": "string" }
    },
    "rootCause": { "type": "string" },
    "actionsTaken": {
      "type": "array",
      "items": {
        "type": "object",
        "properties": {
          "timestamp": { "type": "string", "format": "date-time" },
          "action": { "type": "string" },
          "automated": { "type": "boolean" },
          "actor": { "type": "string" }
        }
      }
    },
    "postMortem": {
      "type": "object",
      "properties": {
        "timeline": { "type": "string" },
        "rootCauseAnalysis": { "type": "string" },
        "actionItems": {
          "type": "array",
          "items": { "type": "string" }
        }
      }
    }
  }
}

Integration with Incident Management Platforms

The incident response system integrates with external platforms:

Jira Integration

# scripts/create-ticket.sh
curl -X POST "https://your-org.atlassian.net/rest/api/3/issue" \
  -H "Authorization: Basic ${JIRA_TOKEN}" \
  -H "Content-Type: application/json" \
  -d @- <<EOF
{
  "fields": {
    "project": {"key": "AEEF"},
    "summary": "AEEF Incident: ${SEVERITY} - ${INCIDENT_TYPE}",
    "issuetype": {"name": "Bug"},
    "priority": {"name": "$(map_severity_to_priority $SEVERITY)"},
    "description": "${DESCRIPTION}"
  }
}
EOF

PagerDuty Integration

# scripts/page-oncall.sh
curl -X POST "https://events.pagerduty.com/v2/enqueue" \
  -H "Content-Type: application/json" \
  -d @- <<EOF
{
  "routing_key": "${PAGERDUTY_ROUTING_KEY}",
  "event_action": "trigger",
  "payload": {
    "summary": "AEEF ${SEVERITY}: ${INCIDENT_TYPE}",
    "severity": "$(map_severity_to_pd $SEVERITY)",
    "source": "aeef-monitoring",
    "component": "governance"
  }
}
EOF

OpsGenie Integration

Supported via AlertManager webhook configuration. See the Monitoring Setup page for AlertManager routing.

Next Steps

Monitoring alerts: Monitoring Setup for configuring alert thresholds and routing
Sovereign compliance: Sovereign Compliance Overlays for jurisdiction-specific incident protocols
Governance framework: Incident Response for normative requirements

Automated Triage Scripts​

Triage Script​

Incident Type Classification​

Rollback Automation​

Alert Routing Configuration​

Slack Integration​

Incident Record Schema​

Integration with Incident Management Platforms​

Jira Integration​

PagerDuty Integration​

OpsGenie Integration​

Next Steps​