Observability & Metrics
Signals
- Metrics: Prometheus counters/histograms for alerts, automations, orchestrations, approvals, API latency.
- Logs: structured JSON (structlog) shipped to Cloud Logging.
- Traces: OpenTelemetry to Cloud Trace (or Jaeger) with FastAPI/SQLAlchemy/requests instrumentation.
Selected metrics
sre_automations_executed_total{agent,action_type,status,tenant_id}sre_automation_duration_seconds{agent,action_type,tenant_id}sre_alerts_*families- API:
sre_api_request_duration_seconds,sre_api_requests_total
Flow
Configuration
- Tracing and metrics are configurable via environment settings (see
src/core/config.pyandsrc/monitoring/observability.py), including flags for enabling tracing/metrics and log level. - Observability initializes opportunistically; when optional deps are missing it logs a warning and continues.
src/monitoring/observability.pydefines all Prometheus metrics (alerts, automations, orchestrations, approvals, rollbacks, API) and wires OpenTelemetry tracing and Cloud Logging.