Skip to main content

Incident Response Automation

Open Repo Download ZIP

git clone https://github.com/AEEF-AI/aeef-production.git

The Production tier includes automated incident response tooling that reduces mean time to resolution (MTTR) for AI governance incidents. This includes automated triage, rollback automation, structured alert routing, and incident record generation that feeds into post-incident review processes.

For the normative incident response requirements, see PRD-STD-010: AI Product Safety & Trust and the Incident Response governance guide.

Automated Triage Scripts

The triage system classifies incidents by type and severity, then routes them to the appropriate response automation:

Triage Script

#!/usr/bin/env bash
# scripts/triage.sh
set -euo pipefail

SEVERITY="${1:?Usage: triage.sh <P1|P2|P3|P4> <incident-type>}"
INCIDENT_TYPE="${2:-unknown}"

echo "Triaging incident: severity=$SEVERITY type=$INCIDENT_TYPE"

case "$INCIDENT_TYPE" in
drift)
echo "Configuration drift detected"
./scripts/assess-drift-impact.sh
;;
security)
echo "Security incident detected"
./scripts/isolate-affected-services.sh
;;
agent-violation)
echo "Agent trust boundary violation"
./scripts/suspend-agent.sh
;;
data-breach)
echo "Potential data breach"
./scripts/activate-breach-protocol.sh
;;
*)
echo "Unknown incident type, routing to on-call"
;;
esac

# Create incident record
./scripts/create-incident.sh "$SEVERITY" "$INCIDENT_TYPE"

# Route based on severity
case "$SEVERITY" in
P1)
echo "P1: Initiating immediate rollback and page on-call"
./scripts/rollback.sh --to-last-known-good
./scripts/page-oncall.sh --severity P1
;;
P2)
echo "P2: Alerting on-call team"
./scripts/page-oncall.sh --severity P2
;;
P3)
echo "P3: Creating ticket for next business day"
./scripts/create-ticket.sh --priority high
;;
P4)
echo "P4: Logging for review"
./scripts/log-incident.sh
;;
esac

Incident Type Classification

TypeDescriptionAuto-Response
driftConfiguration has deviated from baselineImpact assessment, baseline comparison report
securitySAST/SCA finding in production, vulnerability detectedService isolation, emergency patch workflow
agent-violationAgent exceeded trust boundary or permission scopeAgent suspension, audit log extraction
data-breachPII exposure or unauthorized data accessBreach protocol activation, regulatory notification prep
quality-degradationCoverage/mutation score below thresholdAlert + investigation assignment
deployment-failureProduction deployment failed health checksAutomatic rollback to last known good

Rollback Automation

The rollback script reverts the deployment to the last known good state:

#!/usr/bin/env bash
# scripts/rollback.sh
set -euo pipefail

TARGET="${1:---to-last-known-good}"
DEPLOYMENT_LOG="deployments/history.json"

case "$TARGET" in
--to-last-known-good)
ROLLBACK_SHA=$(jq -r '.deployments[] | select(.status == "healthy") | .commit' \
"$DEPLOYMENT_LOG" | head -1)
;;
--to-commit)
ROLLBACK_SHA="${2:?Usage: rollback.sh --to-commit <sha>}"
;;
esac

echo "Rolling back to commit: $ROLLBACK_SHA"

# Verify the target commit passed all quality gates
PROVENANCE="provenance/${ROLLBACK_SHA}.json"
if [ ! -f "$PROVENANCE" ]; then
echo "ERROR: No provenance record for $ROLLBACK_SHA"
exit 1
fi

ALL_PASSED=$(jq -r '.stages | to_entries | all(.value.status == "pass")' "$PROVENANCE")
if [ "$ALL_PASSED" != "true" ]; then
echo "ERROR: Target commit did not pass all quality gates"
exit 1
fi

# Execute rollback
git checkout "$ROLLBACK_SHA"
docker compose build
docker compose up -d --force-recreate

# Verify health
sleep 10
HEALTH=$(curl -sf http://localhost:8080/health | jq -r '.status')
if [ "$HEALTH" != "ok" ]; then
echo "ERROR: Rollback health check failed"
exit 1
fi

echo "Rollback successful to $ROLLBACK_SHA"

# Record rollback event
jq --arg sha "$ROLLBACK_SHA" --arg ts "$(date -u +%Y-%m-%dT%H:%M:%SZ)" \
'.rollbacks += [{"commit": $sha, "timestamp": $ts}]' \
"$DEPLOYMENT_LOG" > tmp.json && mv tmp.json "$DEPLOYMENT_LOG"

Alert Routing Configuration

Alerts are routed based on severity and incident type:

SeverityResponse TimeNotification ChannelEscalation
P1ImmediatePagerDuty + Slack #incidents + PhoneVP Engineering within 15 min
P230 minutesPagerDuty + Slack #incidentsTeam lead within 1 hour
P3Next business daySlack #aeef-warningsSprint backlog
P4Best effortSlack #aeef-infoMonthly review

Slack Integration

{
"channels": {
"critical": "#aeef-incidents",
"warning": "#aeef-warnings",
"info": "#aeef-info"
},
"templates": {
"incident": {
"blocks": [
{
"type": "header",
"text": "AEEF Incident: {{severity}} - {{type}}"
},
{
"type": "section",
"text": "**Description:** {{description}}\n**Detected:** {{timestamp}}\n**Auto-response:** {{action_taken}}"
}
]
}
}
}

Incident Record Schema

Every incident generates a structured record for post-incident review:

{
"$schema": "https://json-schema.org/draft/2020-12/schema",
"type": "object",
"required": ["incidentId", "severity", "type", "detectedAt", "status"],
"properties": {
"incidentId": { "type": "string", "format": "uuid" },
"severity": { "enum": ["P1", "P2", "P3", "P4"] },
"type": { "type": "string" },
"description": { "type": "string" },
"detectedAt": { "type": "string", "format": "date-time" },
"resolvedAt": { "type": "string", "format": "date-time" },
"status": { "enum": ["open", "investigating", "mitigated", "resolved", "closed"] },
"affectedServices": {
"type": "array",
"items": { "type": "string" }
},
"rootCause": { "type": "string" },
"actionsTaken": {
"type": "array",
"items": {
"type": "object",
"properties": {
"timestamp": { "type": "string", "format": "date-time" },
"action": { "type": "string" },
"automated": { "type": "boolean" },
"actor": { "type": "string" }
}
}
},
"postMortem": {
"type": "object",
"properties": {
"timeline": { "type": "string" },
"rootCauseAnalysis": { "type": "string" },
"actionItems": {
"type": "array",
"items": { "type": "string" }
}
}
}
}
}

Integration with Incident Management Platforms

The incident response system integrates with external platforms:

Jira Integration

# scripts/create-ticket.sh
curl -X POST "https://your-org.atlassian.net/rest/api/3/issue" \
-H "Authorization: Basic ${JIRA_TOKEN}" \
-H "Content-Type: application/json" \
-d @- <<EOF
{
"fields": {
"project": {"key": "AEEF"},
"summary": "AEEF Incident: ${SEVERITY} - ${INCIDENT_TYPE}",
"issuetype": {"name": "Bug"},
"priority": {"name": "$(map_severity_to_priority $SEVERITY)"},
"description": "${DESCRIPTION}"
}
}
EOF

PagerDuty Integration

# scripts/page-oncall.sh
curl -X POST "https://events.pagerduty.com/v2/enqueue" \
-H "Content-Type: application/json" \
-d @- <<EOF
{
"routing_key": "${PAGERDUTY_ROUTING_KEY}",
"event_action": "trigger",
"payload": {
"summary": "AEEF ${SEVERITY}: ${INCIDENT_TYPE}",
"severity": "$(map_severity_to_pd $SEVERITY)",
"source": "aeef-monitoring",
"component": "governance"
}
}
EOF

OpsGenie Integration

Supported via AlertManager webhook configuration. See the Monitoring Setup page for AlertManager routing.

Next Steps