Monitoring Setup
git clone https://github.com/AEEF-AI/aeef-production.git
The Production tier monitoring stack provides continuous visibility into AEEF governance health, KPI trends, and configuration drift. It deploys via Docker Compose and includes pre-configured Grafana dashboards, Prometheus metric collection, AlertManager routing, and a health check service.
Monitoring Stack Overview
docker-compose.monitoring.yml
grafana --> Pre-built AEEF dashboards (port 3001)
prometheus --> Metric collection and storage (port 9090)
alertmanager --> Alert routing to Slack/PagerDuty/email (port 9093)
healthcheck --> Periodic governance validation service (port 8081)
Deployment
docker compose -f docker-compose.monitoring.yml up -d
Docker Compose Configuration
services:
prometheus:
image: prom/prometheus:v2.51.0
volumes:
- ./monitoring/prometheus.yml:/etc/prometheus/prometheus.yml
- ./monitoring/rules/:/etc/prometheus/rules/
- prometheus-data:/prometheus
ports: ["9090:9090"]
grafana:
image: grafana/grafana:10.4.0
volumes:
- ./monitoring/dashboards/:/var/lib/grafana/dashboards/
- ./monitoring/provisioning/:/etc/grafana/provisioning/
- grafana-data:/var/lib/grafana
ports: ["3001:3000"]
environment:
- GF_SECURITY_ADMIN_PASSWORD=${GRAFANA_ADMIN_PASSWORD}
alertmanager:
image: prom/alertmanager:v0.27.0
volumes:
- ./monitoring/alertmanager.yml:/etc/alertmanager/alertmanager.yml
ports: ["9093:9093"]
healthcheck:
build: ./monitoring/healthcheck/
ports: ["8081:8081"]
environment:
- CHECK_INTERVAL=300
- AEEF_CONFIG_PATH=/etc/aeef/
volumes:
prometheus-data:
grafana-data:
KPI Dashboard Data Pipeline
The KPI dashboard visualizes metrics collected by the Metrics Pipeline:
Data Flow
CI Pipeline --> Provenance Records --> Collection Scripts --> JSON Records --> Prometheus Pushgateway --> Grafana
Prometheus Configuration
# monitoring/prometheus.yml
global:
scrape_interval: 30s
scrape_configs:
- job_name: 'aeef-app'
static_configs:
- targets: ['app:8080']
- job_name: 'aeef-healthcheck'
static_configs:
- targets: ['healthcheck:8081']
- job_name: 'pushgateway'
static_configs:
- targets: ['pushgateway:9091']
rule_files:
- /etc/prometheus/rules/aeef-alerts.yml
KPI Dashboard Panels
The pre-built KPI dashboard (monitoring/dashboards/aeef-kpi.json) includes:
| Panel | Metric | Visualization |
|---|---|---|
| AI Contribution Ratio | aeef_ai_contributions_total | Time series line chart |
| PR Cycle Time | aeef_pr_cycle_time_seconds | Histogram heatmap |
| Deployment Frequency | aeef_deployments_total | Bar chart (weekly) |
| Defect Density | aeef_defects_per_kloc | Gauge with thresholds |
| Security Scan Pass Rate | aeef_security_scan_pass_ratio | Stat panel (percentage) |
| Coverage Trend | aeef_test_coverage_percent | Time series with threshold line |
| Mutation Score Trend | aeef_mutation_score_percent | Time series with threshold line |
Trust Metrics Dashboard
The trust metrics dashboard (monitoring/dashboards/aeef-trust.json) monitors PRD-STD-010 (AI Product Safety & Trust):
| Panel | Description |
|---|---|
| Agent Trust Levels | Current trust level for each active agent |
| Trust Boundary Violations | Count of agent actions exceeding declared permissions |
| Human Override Rate | Percentage of agent outputs overridden by human reviewers |
| Escalation Frequency | Rate of automatic escalations from agents to human operators |
| Bias Detection Results | Latest bias detection pipeline output |
| Safety Test Pass Rate | Percentage of safety test suites passing |
Drift Detection Alerts
Prometheus alerting rules for configuration drift:
# monitoring/rules/aeef-alerts.yml
groups:
- name: aeef-drift
rules:
- alert: LinterConfigDrift
expr: aeef_drift_detected{category="linting"} > 0
for: 5m
labels:
severity: warning
annotations:
summary: "Linter configuration has drifted from baseline"
runbook: "https://aeef.ai/reference-implementations/production-tier/monitoring-setup#drift-remediation"
- alert: CIPipelineDrift
expr: aeef_drift_detected{category="ci"} > 0
for: 5m
labels:
severity: critical
annotations:
summary: "CI pipeline configuration has drifted from baseline"
- alert: SecurityPolicyDrift
expr: aeef_drift_detected{category="security"} > 0
for: 5m
labels:
severity: critical
annotations:
summary: "Security policy configuration has drifted from baseline"
- alert: CoverageBelowThreshold
expr: aeef_test_coverage_percent < 80
for: 1h
labels:
severity: warning
annotations:
summary: "Test coverage has fallen below 80% threshold"
- alert: MutationScoreBelowThreshold
expr: aeef_mutation_score_percent < 70
for: 1h
labels:
severity: warning
annotations:
summary: "Mutation score has fallen below 70% threshold"
AlertManager Routing
# monitoring/alertmanager.yml
route:
receiver: 'default'
group_by: ['alertname', 'severity']
routes:
- match:
severity: critical
receiver: 'pagerduty-critical'
repeat_interval: 15m
- match:
severity: warning
receiver: 'slack-warnings'
repeat_interval: 1h
receivers:
- name: 'default'
slack_configs:
- channel: '#aeef-alerts'
api_url: '${SLACK_WEBHOOK_URL}'
- name: 'pagerduty-critical'
pagerduty_configs:
- service_key: '${PAGERDUTY_SERVICE_KEY}'
- name: 'slack-warnings'
slack_configs:
- channel: '#aeef-warnings'
api_url: '${SLACK_WEBHOOK_URL}'
Health Check Service
The health check service periodically validates all AEEF governance controls:
| Check | Frequency | What It Validates |
|---|---|---|
| CI pipeline status | Every 5 min | All required status checks are configured and passing |
| Branch protection | Every 15 min | Protected branches have required reviewers and checks |
| Agent registry | Every 15 min | All agent contracts are valid and within trust boundaries |
| Security rules | Every 30 min | Semgrep rules are present and not modified |
| Overlay compliance | Every 1 hour | Sovereign overlay controls are active |
Drift Remediation
When drift is detected:
- Warning alerts indicate non-critical drift (e.g., linter rule relaxation). Investigate and either update the baseline or restore the configuration.
- Critical alerts indicate governance-impacting drift (e.g., CI stages removed, security rules disabled). These require immediate remediation.
- Use
scripts/restore-baseline.shto restore a specific category to its baseline state.
Next Steps
- Configure alerts: Customize routing rules for your team's notification preferences
- Add custom dashboards: Extend the pre-built dashboards with organization-specific panels
- Sovereign compliance: Sovereign Compliance Overlays for jurisdiction-specific monitoring