Monitoring Setup

git clone https://github.com/AEEF-AI/aeef-production.git

The Production tier monitoring stack provides continuous visibility into AEEF governance health, KPI trends, and configuration drift. It deploys via Docker Compose and includes pre-configured Grafana dashboards, Prometheus metric collection, AlertManager routing, and a health check service.

Monitoring Stack Overview

docker-compose.monitoring.yml
  grafana        --> Pre-built AEEF dashboards (port 3001)
  prometheus     --> Metric collection and storage (port 9090)
  alertmanager   --> Alert routing to Slack/PagerDuty/email (port 9093)
  healthcheck    --> Periodic governance validation service (port 8081)

Deployment

docker compose -f docker-compose.monitoring.yml up -d

Docker Compose Configuration

services:
  prometheus:
    image: prom/prometheus:v2.51.0
    volumes:
      - ./monitoring/prometheus.yml:/etc/prometheus/prometheus.yml
      - ./monitoring/rules/:/etc/prometheus/rules/
      - prometheus-data:/prometheus
    ports: ["9090:9090"]

  grafana:
    image: grafana/grafana:10.4.0
    volumes:
      - ./monitoring/dashboards/:/var/lib/grafana/dashboards/
      - ./monitoring/provisioning/:/etc/grafana/provisioning/
      - grafana-data:/var/lib/grafana
    ports: ["3001:3000"]
    environment:
      - GF_SECURITY_ADMIN_PASSWORD=${GRAFANA_ADMIN_PASSWORD}

  alertmanager:
    image: prom/alertmanager:v0.27.0
    volumes:
      - ./monitoring/alertmanager.yml:/etc/alertmanager/alertmanager.yml
    ports: ["9093:9093"]

  healthcheck:
    build: ./monitoring/healthcheck/
    ports: ["8081:8081"]
    environment:
      - CHECK_INTERVAL=300
      - AEEF_CONFIG_PATH=/etc/aeef/

volumes:
  prometheus-data:
  grafana-data:

KPI Dashboard Data Pipeline

The KPI dashboard visualizes metrics collected by the Metrics Pipeline:

Data Flow

CI Pipeline --> Provenance Records --> Collection Scripts --> JSON Records --> Prometheus Pushgateway --> Grafana

Prometheus Configuration

# monitoring/prometheus.yml
global:
  scrape_interval: 30s

scrape_configs:
  - job_name: 'aeef-app'
    static_configs:
      - targets: ['app:8080']

  - job_name: 'aeef-healthcheck'
    static_configs:
      - targets: ['healthcheck:8081']

  - job_name: 'pushgateway'
    static_configs:
      - targets: ['pushgateway:9091']

rule_files:
  - /etc/prometheus/rules/aeef-alerts.yml

KPI Dashboard Panels

The pre-built KPI dashboard (monitoring/dashboards/aeef-kpi.json) includes:

Panel	Metric	Visualization
AI Contribution Ratio	`aeef_ai_contributions_total`	Time series line chart
PR Cycle Time	`aeef_pr_cycle_time_seconds`	Histogram heatmap
Deployment Frequency	`aeef_deployments_total`	Bar chart (weekly)
Defect Density	`aeef_defects_per_kloc`	Gauge with thresholds
Security Scan Pass Rate	`aeef_security_scan_pass_ratio`	Stat panel (percentage)
Coverage Trend	`aeef_test_coverage_percent`	Time series with threshold line
Mutation Score Trend	`aeef_mutation_score_percent`	Time series with threshold line

Trust Metrics Dashboard

The trust metrics dashboard (monitoring/dashboards/aeef-trust.json) monitors PRD-STD-010 (AI Product Safety & Trust):

Panel	Description
Agent Trust Levels	Current trust level for each active agent
Trust Boundary Violations	Count of agent actions exceeding declared permissions
Human Override Rate	Percentage of agent outputs overridden by human reviewers
Escalation Frequency	Rate of automatic escalations from agents to human operators
Bias Detection Results	Latest bias detection pipeline output
Safety Test Pass Rate	Percentage of safety test suites passing

Drift Detection Alerts

Prometheus alerting rules for configuration drift:

# monitoring/rules/aeef-alerts.yml
groups:
  - name: aeef-drift
    rules:
      - alert: LinterConfigDrift
        expr: aeef_drift_detected{category="linting"} > 0
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Linter configuration has drifted from baseline"
          runbook: "https://aeef.ai/reference-implementations/production-tier/monitoring-setup#drift-remediation"

      - alert: CIPipelineDrift
        expr: aeef_drift_detected{category="ci"} > 0
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "CI pipeline configuration has drifted from baseline"

      - alert: SecurityPolicyDrift
        expr: aeef_drift_detected{category="security"} > 0
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "Security policy configuration has drifted from baseline"

      - alert: CoverageBelowThreshold
        expr: aeef_test_coverage_percent < 80
        for: 1h
        labels:
          severity: warning
        annotations:
          summary: "Test coverage has fallen below 80% threshold"

      - alert: MutationScoreBelowThreshold
        expr: aeef_mutation_score_percent < 70
        for: 1h
        labels:
          severity: warning
        annotations:
          summary: "Mutation score has fallen below 70% threshold"

AlertManager Routing

# monitoring/alertmanager.yml
route:
  receiver: 'default'
  group_by: ['alertname', 'severity']
  routes:
    - match:
        severity: critical
      receiver: 'pagerduty-critical'
      repeat_interval: 15m
    - match:
        severity: warning
      receiver: 'slack-warnings'
      repeat_interval: 1h

receivers:
  - name: 'default'
    slack_configs:
      - channel: '#aeef-alerts'
        api_url: '${SLACK_WEBHOOK_URL}'

  - name: 'pagerduty-critical'
    pagerduty_configs:
      - service_key: '${PAGERDUTY_SERVICE_KEY}'

  - name: 'slack-warnings'
    slack_configs:
      - channel: '#aeef-warnings'
        api_url: '${SLACK_WEBHOOK_URL}'

Health Check Service

The health check service periodically validates all AEEF governance controls:

Check	Frequency	What It Validates
CI pipeline status	Every 5 min	All required status checks are configured and passing
Branch protection	Every 15 min	Protected branches have required reviewers and checks
Agent registry	Every 15 min	All agent contracts are valid and within trust boundaries
Security rules	Every 30 min	Semgrep rules are present and not modified
Overlay compliance	Every 1 hour	Sovereign overlay controls are active

Drift Remediation

When drift is detected:

Warning alerts indicate non-critical drift (e.g., linter rule relaxation). Investigate and either update the baseline or restore the configuration.
Critical alerts indicate governance-impacting drift (e.g., CI stages removed, security rules disabled). These require immediate remediation.
Use scripts/restore-baseline.sh to restore a specific category to its baseline state.

Next Steps

Configure alerts: Customize routing rules for your team's notification preferences
Add custom dashboards: Extend the pre-built dashboards with organization-specific panels
Sovereign compliance: Sovereign Compliance Overlays for jurisdiction-specific monitoring

Monitoring Stack Overview​

Deployment​

Docker Compose Configuration​

KPI Dashboard Data Pipeline​

Data Flow​

Prometheus Configuration​

KPI Dashboard Panels​

Trust Metrics Dashboard​

Drift Detection Alerts​

AlertManager Routing​

Health Check Service​

Drift Remediation​

Next Steps​