Skip to main content

Monitoring Setup

Open Repo Download ZIP

git clone https://github.com/AEEF-AI/aeef-production.git

The Production tier monitoring stack provides continuous visibility into AEEF governance health, KPI trends, and configuration drift. It deploys via Docker Compose and includes pre-configured Grafana dashboards, Prometheus metric collection, AlertManager routing, and a health check service.

Monitoring Stack Overview

docker-compose.monitoring.yml
grafana --> Pre-built AEEF dashboards (port 3001)
prometheus --> Metric collection and storage (port 9090)
alertmanager --> Alert routing to Slack/PagerDuty/email (port 9093)
healthcheck --> Periodic governance validation service (port 8081)

Deployment

docker compose -f docker-compose.monitoring.yml up -d

Docker Compose Configuration

services:
prometheus:
image: prom/prometheus:v2.51.0
volumes:
- ./monitoring/prometheus.yml:/etc/prometheus/prometheus.yml
- ./monitoring/rules/:/etc/prometheus/rules/
- prometheus-data:/prometheus
ports: ["9090:9090"]

grafana:
image: grafana/grafana:10.4.0
volumes:
- ./monitoring/dashboards/:/var/lib/grafana/dashboards/
- ./monitoring/provisioning/:/etc/grafana/provisioning/
- grafana-data:/var/lib/grafana
ports: ["3001:3000"]
environment:
- GF_SECURITY_ADMIN_PASSWORD=${GRAFANA_ADMIN_PASSWORD}

alertmanager:
image: prom/alertmanager:v0.27.0
volumes:
- ./monitoring/alertmanager.yml:/etc/alertmanager/alertmanager.yml
ports: ["9093:9093"]

healthcheck:
build: ./monitoring/healthcheck/
ports: ["8081:8081"]
environment:
- CHECK_INTERVAL=300
- AEEF_CONFIG_PATH=/etc/aeef/

volumes:
prometheus-data:
grafana-data:

KPI Dashboard Data Pipeline

The KPI dashboard visualizes metrics collected by the Metrics Pipeline:

Data Flow

CI Pipeline --> Provenance Records --> Collection Scripts --> JSON Records --> Prometheus Pushgateway --> Grafana

Prometheus Configuration

# monitoring/prometheus.yml
global:
scrape_interval: 30s

scrape_configs:
- job_name: 'aeef-app'
static_configs:
- targets: ['app:8080']

- job_name: 'aeef-healthcheck'
static_configs:
- targets: ['healthcheck:8081']

- job_name: 'pushgateway'
static_configs:
- targets: ['pushgateway:9091']

rule_files:
- /etc/prometheus/rules/aeef-alerts.yml

KPI Dashboard Panels

The pre-built KPI dashboard (monitoring/dashboards/aeef-kpi.json) includes:

PanelMetricVisualization
AI Contribution Ratioaeef_ai_contributions_totalTime series line chart
PR Cycle Timeaeef_pr_cycle_time_secondsHistogram heatmap
Deployment Frequencyaeef_deployments_totalBar chart (weekly)
Defect Densityaeef_defects_per_klocGauge with thresholds
Security Scan Pass Rateaeef_security_scan_pass_ratioStat panel (percentage)
Coverage Trendaeef_test_coverage_percentTime series with threshold line
Mutation Score Trendaeef_mutation_score_percentTime series with threshold line

Trust Metrics Dashboard

The trust metrics dashboard (monitoring/dashboards/aeef-trust.json) monitors PRD-STD-010 (AI Product Safety & Trust):

PanelDescription
Agent Trust LevelsCurrent trust level for each active agent
Trust Boundary ViolationsCount of agent actions exceeding declared permissions
Human Override RatePercentage of agent outputs overridden by human reviewers
Escalation FrequencyRate of automatic escalations from agents to human operators
Bias Detection ResultsLatest bias detection pipeline output
Safety Test Pass RatePercentage of safety test suites passing

Drift Detection Alerts

Prometheus alerting rules for configuration drift:

# monitoring/rules/aeef-alerts.yml
groups:
- name: aeef-drift
rules:
- alert: LinterConfigDrift
expr: aeef_drift_detected{category="linting"} > 0
for: 5m
labels:
severity: warning
annotations:
summary: "Linter configuration has drifted from baseline"
runbook: "https://aeef.ai/reference-implementations/production-tier/monitoring-setup#drift-remediation"

- alert: CIPipelineDrift
expr: aeef_drift_detected{category="ci"} > 0
for: 5m
labels:
severity: critical
annotations:
summary: "CI pipeline configuration has drifted from baseline"

- alert: SecurityPolicyDrift
expr: aeef_drift_detected{category="security"} > 0
for: 5m
labels:
severity: critical
annotations:
summary: "Security policy configuration has drifted from baseline"

- alert: CoverageBelowThreshold
expr: aeef_test_coverage_percent < 80
for: 1h
labels:
severity: warning
annotations:
summary: "Test coverage has fallen below 80% threshold"

- alert: MutationScoreBelowThreshold
expr: aeef_mutation_score_percent < 70
for: 1h
labels:
severity: warning
annotations:
summary: "Mutation score has fallen below 70% threshold"

AlertManager Routing

# monitoring/alertmanager.yml
route:
receiver: 'default'
group_by: ['alertname', 'severity']
routes:
- match:
severity: critical
receiver: 'pagerduty-critical'
repeat_interval: 15m
- match:
severity: warning
receiver: 'slack-warnings'
repeat_interval: 1h

receivers:
- name: 'default'
slack_configs:
- channel: '#aeef-alerts'
api_url: '${SLACK_WEBHOOK_URL}'

- name: 'pagerduty-critical'
pagerduty_configs:
- service_key: '${PAGERDUTY_SERVICE_KEY}'

- name: 'slack-warnings'
slack_configs:
- channel: '#aeef-warnings'
api_url: '${SLACK_WEBHOOK_URL}'

Health Check Service

The health check service periodically validates all AEEF governance controls:

CheckFrequencyWhat It Validates
CI pipeline statusEvery 5 minAll required status checks are configured and passing
Branch protectionEvery 15 minProtected branches have required reviewers and checks
Agent registryEvery 15 minAll agent contracts are valid and within trust boundaries
Security rulesEvery 30 minSemgrep rules are present and not modified
Overlay complianceEvery 1 hourSovereign overlay controls are active

Drift Remediation

When drift is detected:

  1. Warning alerts indicate non-critical drift (e.g., linter rule relaxation). Investigate and either update the baseline or restore the configuration.
  2. Critical alerts indicate governance-impacting drift (e.g., CI stages removed, security rules disabled). These require immediate remediation.
  3. Use scripts/restore-baseline.sh to restore a specific category to its baseline state.

Next Steps

  • Configure alerts: Customize routing rules for your team's notification preferences
  • Add custom dashboards: Extend the pre-built dashboards with organization-specific panels
  • Sovereign compliance: Sovereign Compliance Overlays for jurisdiction-specific monitoring