Skip to main content

Observability & Metrics

Signals

  • Metrics: Prometheus counters/histograms for alerts, automations, orchestrations, approvals, API latency.
  • Logs: structured JSON (structlog) shipped to Cloud Logging.
  • Traces: OpenTelemetry to Cloud Trace (or Jaeger) with FastAPI/SQLAlchemy/requests instrumentation.

Selected metrics

  • sre_automations_executed_total{agent,action_type,status,tenant_id}
  • sre_automation_duration_seconds{agent,action_type,tenant_id}
  • sre_alerts_* families
  • API: sre_api_request_duration_seconds, sre_api_requests_total

Flow

Configuration

  • Tracing and metrics are configurable via environment settings (see src/core/config.py and src/monitoring/observability.py), including flags for enabling tracing/metrics and log level.
  • Observability initializes opportunistically; when optional deps are missing it logs a warning and continues.
  • src/monitoring/observability.py defines all Prometheus metrics (alerts, automations, orchestrations, approvals, rollbacks, API) and wires OpenTelemetry tracing and Cloud Logging.