Skip to main content

Configuration Model and Multitenant/RBAC

This document consolidates environment-level settings, persistent configuration scopes, precedence rules, API contracts, and cache/audit behavior. It replaces configurability.md and CONFIGURATION_EXTERNALIZATION_RECOMMENDATIONS.md with a single source of truth.

Goals

  • Single, unambiguous source of truth for each setting.
  • Clear separation between operator-only (env) vs. tenant-editable (DB-backed) config.
  • Consistent precedence resolution across services with caching and auditing.

Source-of-Truth Matrix

  1. Env-Only (operator/process scope)
  • Examples: DATABASE_URL, DB_*, REDIS_URL, SMTP_*, SECRET_KEY, WEBHOOK_SECRET, BASE_URL, CORS_ORIGINS, FRONTEND_ORIGIN, LOG_LEVEL, CACHE_DEFAULT_TTL, PLATFORM_ADMIN_EMAILS, default HTTP client timeouts.
  • Ownership: platform operator.
  • Persistence: process env via src/core/config.py:Settings.
  • Change semantics: restart or operator-only reload endpoint; no tenant audit trails (logged as operator/system).
  1. Global (platform-wide, persisted)
  • Examples: default incident complexity/routing rules, allowed APIs/roles templates for onboarding, global alerting bands, other platform defaults that are safe to expose.
  • Ownership: platform_admin (optionally public via is_public=true).
  • Persistence: GlobalConfiguration.
  • Change semantics: cache key global_{key} invalidation and audit/log.
  1. Tenant Risk & Policy (persisted per tenant)
  • Examples: max_cost_impact_auto, max_cost_impact_approval, max_resource_scaling, approval_timeout_minutes, automation_enabled, business_hours_only, max_concurrent_automations; per service-type override JSON; notification_config (see schema below).
  • Ownership: tenant_admin.
  • Persistence: TenantRiskConfiguration and Tenant.notification_config.
  • Change semantics: invalidate risk_config_{tenant_id} and append ConfigurationHistory.
  1. Agent Defaults (persisted per tenant + agent type)
  • Examples: default_region/zone, metrics_duration_minutes, validation_timeout_seconds, default mem/cpu/scaling, monitoring thresholds, templates/tools.
  • Ownership: tenant_admin.
  • Persistence: AgentConfiguration.
  • Change semantics: invalidate agent_config_{tenant_id}_{agent_type} and append ConfigurationHistory.
  1. Service Type Defaults (persisted per service type)
  • Examples: default_settings, scaling_limits, monitoring_defaults for Cloud Run, GKE, BigQuery, Cloud SQL.
  • Ownership: platform_admin (optionally expose as presets).
  • Persistence: ServiceTypeConfiguration.
  • Change semantics: invalidate service_type_{service_type}_{configuration_name}; admin audit.
  1. Project-Level (persisted per tenant + project)
  • Examples: region, environment, SA reference, budget/labels, project_config flags.
  • Ownership: tenant_admin.
  • Persistence: GcpProject.
  • Change semantics: per-project cache invalidation and ConfigurationHistory (resource_id = project_id).
  1. Service Instance Overrides (persisted per tenant + service instance)
  • Examples: custom_thresholds, min/max_instances, monitoring_enabled, automation_enabled, maintenance_windows, notification_channels.
  • Ownership: tenant_admin (operators under guardrails).
  • Persistence: ServiceConfiguration.
  • Change semantics: invalidate instance cache; ConfigurationHistory with config_type=service_config.

Resolution order (highest → lowest):

  • Service instance override → AgentConfiguration (tenant+agent) → TenantRiskConfiguration → ServiceTypeConfiguration → GlobalConfiguration (public) → Env defaults (Settings).

Alignment: Env vs. Persisted Config

  • Keep a single env loader: src/core/config.py:Settings. Do not introduce a parallel loader. Use env exclusively for operator-only controls and as defaults when seeding DB rows.
  • Seed paths:
    • Tenant creation: seed TenantRiskConfiguration and AgentConfiguration using Settings values where sensible.
    • ConfigurationService should read settings.cache_default_ttl when constructing the cache TTL.
  • Transport/client knobs: defaults from env; allow per-tenant override inside Tenant.notification_config (e.g., webhook timeout/retries). Keep process-wide hard caps in env as safety rails.

Tenant.notification_config Schema

Add notification_config JSON to Tenant for approvals and notifications. Suggested schema:

Top-level fields

  • webhook: object
  • email: object
  • chat: object (optional; e.g., Google Chat)
  • approver_groups: object (risk level/group mapping)
  • escalation: object

webhook

  • Single endpoint form:
    • url (string, required)
    • headers (object, optional; default {})
    • auth (object, optional; one of)
      • bearer: { "type": "bearer", "token": "..." }
      • basic: { "type": "basic", "username": "...", "password": "..." }
      • api_key: { "type": "api_key", "api_key": "...", "header_name": "X-API-Key" }
    • secret (string, optional; HMAC signing)
    • timeout_seconds (int, optional; default 30)
    • retry_attempts (int, optional; default 3)
    • backoff (object, optional): { "initial_seconds": 1, "factor": 2, "max_seconds": 8 }
  • Multiple endpoints form:
    • endpoints (array of objects with the same fields as above)

email

  • approval_recipients (array[string], optional)
  • default_recipients (array[string], optional)
  • from_email (string, optional; override)
  • template_overrides (object, optional) e.g. { "approval_request": { "subject": "...", "branding": { ... } } }

chat (optional)

  • google_chat: { "webhook_url": "...", "spaces": ["..."], "include_links": true }

approver_groups

  • Map risk/action to approver sets, e.g.:
    • risk_levels: { "low": ["sre-oncall@"], "medium": ["sre@"], "high": ["secops@"], "critical": ["exec@"] }
    • action_overrides: { "scale_min_instances": ["app-owners@"], "increase_memory": ["platform@"] }

escalation

  • poll_interval_seconds (int, default 60)
  • escalation_window_minutes (int, default 60)
  • max_hops (int, default 2)

Validation and defaults

  • If timeout_seconds or retry_attempts are not set per endpoint, use process defaults from Settings (e.g., 30s and 3).
  • If secret is present, approver will compute X-SmartSRE-Signature (HMAC SHA256) as in src/approval/webhook_approver.py.

Example JSON

{
"webhook": {
"endpoints": [
{
"url": "https://approval.example.com/hook",
"headers": {"X-Env": "prod"},
"auth": {"type": "bearer", "token": "***"},
"secret": "hmac-secret",
"timeout_seconds": 30,
"retry_attempts": 3,
"backoff": {"initial_seconds": 1, "factor": 2, "max_seconds": 8}
}
]
},
"email": {
"approval_recipients": ["oncall@example.com"],
"default_recipients": ["sre@example.com"],
"from_email": "no-reply@example.com"
},
"approver_groups": {
"risk_levels": {"low": ["sre@"], "high": ["secops@"]},
"action_overrides": {"increase_memory": ["platform@"]}
},
"escalation": {"poll_interval_seconds": 60, "escalation_window_minutes": 60, "max_hops": 2}
}

Precedence Helper (shared)

Provide a tiny utility used by both ConfigurationService and DynamicConfigService to apply consistent overlay semantics.

Proposed module: src/services/config_precedence.py

from typing import Any, Dict, Iterable, Optional

def layered_merge(layers: Iterable[Optional[Dict[str, Any]]]) -> Dict[str, Any]:
"""Overlay dicts from left→right, skipping None. Later layers win."""
result: Dict[str, Any] = {}
for layer in layers:
if not layer:
continue
for k, v in layer.items():
# shallow merge; nested objects are replaced by later layers
result[k] = v
return result

def resolve_value(*values: Any, default: Any = None) -> Any:
"""Return the first non-None value in order, else default."""
for v in values:
if v is not None:
return v
return default

Usage examples

  • For agent config: final = layered_merge([service_instance, agent_tenant, tenant_risk, service_type, global_public, env_defaults])
  • For scalar lookup: timeout = resolve_value(tenant_override, service_type_default, global_band, settings_default, default=30)

Notes

  • Keep merges shallow for predictability; if nested deep-merge is required, constrain keys and add type-aware merges per section.

Operator Reload Endpoint

Endpoint

  • POST /config/reload

Auth and scope

  • Role: platform_admin only.
  • This endpoint is operator-only and outside tenant control (no ConfigurationHistory entries). It logs an operator/system audit event.

Request body

{
"reload_env": true,
"invalidate": { "all": true, "tenant_ids": [] },
"reinit": { "configuration_service": true, "dynamic_config_service": true },
"dry_run": false
}

Behavior

  • If reload_env: re-instantiate src/core/config.py:Settings (or refresh the global instance) so new env values take effect.
  • If reinit.configuration_service: reinitialize ConfigurationService with cache_ttl_minutes=settings.cache_default_ttl and clear its cache.
  • If reinit.dynamic_config_service: clear DynamicConfigService caches; if invalidate.tenant_ids provided, target those; otherwise clear all.
  • Always return a summary of actions taken and timestamps.

Response body (example)

{
"status": "ok",
"reloaded_env": true,
"invalidated": {"all": true, "tenant_ids": []},
"reinitialized": ["configuration_service", "dynamic_config_service"],
"timestamp": "2025-10-24T09:30:00Z"
}

Side effects & logging

  • Log an audit event: action config_reloaded, actor system/operator, include request payload (minus secrets).
  • If any sub-action fails, return partial success with an errors field and appropriate HTTP status code.

API Surface (summary)

Existing

  • GET /configurations/ — aggregate tenant config for agent types.
  • POST /configurations/cache/invalidate — cache invalidation helper (ensure tenant scoping).

Recommended tenant APIs

  • GET/PUT /tenants/{tenant_id}/config/risk — reads/writes TenantRiskConfiguration; audit + cache bust.
  • GET/PUT /tenants/{tenant_id}/config/agents/{agent_type} — reads/writes AgentConfiguration; audit + cache bust.
  • GET /tenants/{tenant_id}/services and GET/PUT /tenants/{tenant_id}/services/{service_id} — reads/writes ServiceConfiguration; audit + cache bust.
  • GET/PUT /tenants/{tenant_id}/projects/{project_id}/config — project-level settings; audit + cache bust.

Recommended platform APIs

  • GET/PUT /config/global/{key} — reads/writes GlobalConfiguration (platform_admin; is_public gating when read by tenants).
  • POST /config/reload — operator-only env reload + cache management.

Caching and Audit

  • ConfigurationService uses an in-memory cache with TTL. Initialize using settings.cache_default_ttl. Invalidate precise keys on write:
    • Tenant risk: risk_config_{tenant_id}
    • Agent: agent_config_{tenant_id}_{agent_type}
    • Service type: service_type_{service_type}_{name}
    • Global: global_{key}
  • DynamicConfigService keeps a 5m cache; call invalidate_cache(tenant_id) after tenant-scoped writes, or clear all for platform changes.
  • Audit trails:
    • Tenant-scoped writes append ConfigurationHistory and produce audit logs.
    • Platform/global writes log audit events; optionally extend to a global history table if parity is desired.

Phased Implementation Plan

Phase 0 — Prework

  • Add Tenant.notification_config JSON column and migration; wire approvers to use it.
  • Replace hardcoded transport knobs (timeouts/retries) in approvers with: tenant override → env default.

Phase 1 — Centralize env defaults

  • Use settings.cache_default_ttl to initialize ConfigurationService.
  • Replace remaining process-level hardcoded defaults with Settings fields, keeping them operator-only.
  • Document reload vs. restart semantics.

Phase 2 — Persisted config + APIs

  • Implement GET/PUT endpoints for Tenant Risk, Agent Defaults, Service Overrides, and Project Config with RBAC and audit.
  • Update ConfigurationService._create_default_* to seed from Settings and ServiceTypeConfiguration, not literals.
  • Refactor DynamicConfigService to read existing columns (drop dependencies on absent JSON fields) and adopt the shared precedence helper.

Phase 3 — UI self-service

  • Bind Settings pages to real endpoints; show "last updated by/at" from ConfigurationHistory.
  • On save, invalidate caches in scope and show propagation toast.

Phase 4 — Operations maturity

  • Add /config/reload with audit logging and targeted invalidation.
  • Use GlobalConfiguration.requires_restart for values that cannot be hot-reloaded.

Detailed Gap Closure Plan (by Phase)

The following is an actionable, file-specific checklist to complete the implementation. Follow in order; each step lists exact files, names, and expected side effects.

Phase 0 — Prework (Actionable Steps)

  1. Add Tenant.notification_config column
  • File: src/database/models/tenant_models.py

  • Change: add a new column on Tenant:

    from sqlalchemy import JSON
    # ... inside class Tenant
    notification_config = Column(JSON, nullable=True)
  • Migration: create an Alembic revision to add the column.

    • Command (example): alembic revision -m "add notification_config to tenant"
    • Migration upgrade (PostgreSQL/SQLite):
      def upgrade():
      op.add_column('tenants', sa.Column('notification_config', sa.JSON(), nullable=True))
      def downgrade():
      op.drop_column('tenants', 'notification_config')
  1. Webhook approver: per-endpoint transport knobs with env fallbacks
  • File: src/approval/webhook_approver.py
  • Behavior changes:
    • Respect per-endpoint fields from Tenant.notification_config.webhook:
      • timeout_seconds (int), retry_attempts (int), backoff (object with initial_seconds, factor, max_seconds).
    • Fallbacks to env defaults when any of the above is missing.
  • Add env defaults to Settings (see step 4 below) and use them here:
    • settings.approval_webhook_timeout_seconds
    • settings.approval_webhook_retry_attempts
    • Optional: settings.approval_webhook_backoff_initial_seconds, settings.approval_webhook_backoff_factor, settings.approval_webhook_backoff_max_seconds
  • Implementation sketch inside _send_webhook loop:
    timeout = webhook_config.get('timeout_seconds', settings.approval_webhook_timeout_seconds)
    retries = webhook_config.get('retry_attempts', settings.approval_webhook_retry_attempts)
    backoff_cfg = webhook_config.get('backoff', {})
    initial = backoff_cfg.get('initial_seconds', settings.approval_webhook_backoff_initial_seconds)
    factor = backoff_cfg.get('factor', settings.approval_webhook_backoff_factor)
    max_s = backoff_cfg.get('max_seconds', settings.approval_webhook_backoff_max_seconds)
    # sleep = min(max_s, initial * (factor ** attempt)) between attempts
  1. Email approver: unify env usage and defaults
  • File: src/approval/email_approver.py
  • Replace these lookups to match Settings field names:
    • Use settings.from_email (not smtp_from_email).
    • Add/use settings.smtp_use_tls boolean (see step 4 below).
  • The SMTP fields should be read as:
    • smtp_server, smtp_port, smtp_username, smtp_password, from_email, smtp_use_tls.
  1. Add explicit env defaults in Settings
  • File: src/core/config.py
  • Add fields to class Settings:
    approval_webhook_timeout_seconds: int = env_field(30, "APPROVAL_WEBHOOK_TIMEOUT_SECONDS")
    approval_webhook_retry_attempts: int = env_field(3, "APPROVAL_WEBHOOK_RETRY_ATTEMPTS")
    approval_webhook_backoff_initial_seconds: int = env_field(1, "APPROVAL_WEBHOOK_BACKOFF_INITIAL_SECONDS")
    approval_webhook_backoff_factor: int = env_field(2, "APPROVAL_WEBHOOK_BACKOFF_FACTOR")
    approval_webhook_backoff_max_seconds: int = env_field(8, "APPROVAL_WEBHOOK_BACKOFF_MAX_SECONDS")

    smtp_use_tls: bool = env_field(True, "SMTP_USE_TLS")
  1. Keep schema validation behavior for HMAC signing
  • File: src/approval/webhook_approver.py
  • Already supported via _generate_webhook_signature; no change needed beyond ensuring secret in endpoint config is honored.

Acceptance for Phase 0:

  • Approvers read tenant notification_config and respect per-endpoint timeout_seconds, retry_attempts, backoff with env fallbacks.
  • Email approver uses settings.from_email and settings.smtp_use_tls.
  • Tenant.notification_config is persisted and accessible (migration applied).

Phase 1 — Centralize Env Defaults (Actionable Steps)

  1. Initialize ConfigurationService with Settings TTL
  • File: src/services/configuration_service.py
  • Update the global factory to use env TTL:
    from ..core.config import settings
    def get_configuration_service() -> ConfigurationService:
    global _configuration_service
    if _configuration_service is None:
    _configuration_service = ConfigurationService(cache_ttl_minutes=settings.cache_default_ttl)
    return _configuration_service
  1. Align default risk config structure
  • File: src/core/config.py
  • Update DEFAULT_RISK_CONFIG to include fields returned by ConfigurationService.get_tenant_risk_config:
    • Under global_settings: add max_cost_impact_approval, require_approval_for_high_risk, auto_approve_rollbacks, max_concurrent_automations.
    • Add service_overrides map with keys cloud_run, gke, bigquery, cloud_sql defaulting to {}.
  • Shape example:
    DEFAULT_RISK_CONFIG = {
    "global_settings": {
    "max_cost_impact_auto": 15.0,
    "max_cost_impact_approval": 50.0,
    "max_resource_scaling": 3.0,
    "approval_timeout_minutes": 30,
    "require_approval_for_high_risk": True,
    "auto_approve_rollbacks": True,
    "automation_enabled": True,
    "business_hours_only": False,
    "max_concurrent_automations": 5,
    },
    "service_overrides": {"cloud_run": {}, "gke": {}, "bigquery": {}, "cloud_sql": {}},
    }
  1. Document reload vs. restart semantics
  • Env-only settings (operator-only) require a process restart or the /config/reload endpoint added in Phase 4.
  • Persisted configs take effect immediately after cache invalidation.

Acceptance for Phase 1:

  • ConfigurationService TTL equals settings.cache_default_ttl.
  • Default risk config aligns with DB-backed structure and service expectations.

Phase 2 — Persisted Config + APIs (Actionable Steps)

  1. Add shared precedence helper
  • File: src/services/config_precedence.py
  • Add the utility exactly as specified in this doc (see Precedence Helper section) and import it in services.
  • Usage pattern: final = layered_merge([...]) and value = resolve_value(...).
  1. Refactor DynamicConfigService to match models
  • File: src/services/dynamic_config_service.py
  • Replace references to non-existent fields:
    • Remove is_default, type_specific_config, agent_specific_config, and value_type usage.
    • Service type: select by ServiceTypeConfiguration.service_type and a configuration_name (e.g., "default") with is_active=True; read from default_settings, scaling_limits, monitoring_defaults.
    • Global: read JSON from GlobalConfiguration.config_value directly (no type coercion layer).
  • Adopt layered_merge and resolve_value where overlaying dictionaries or scalar fallbacks are required.
  1. Correct MigrationManager seeds to current schema
  • File: src/database/models/migration_manager.py
  • Replace usage of value_type, profile_name, is_default, and type_specific_config with current fields:
    • GlobalConfiguration(config_key, config_value=<JSON>, data_type="json" or omit if not used)
    • ServiceTypeConfiguration(service_type, configuration_name="default", default_settings=<JSON>, scaling_limits, monitoring_defaults, is_active=True)
  • Optionally, reduce seeding here and rely on Alembic data migrations for authoritative defaults.
  1. Expose tenant-scoped configuration endpoints with RBAC and audit
  • File: src/api/routes/configurations.py (or new route modules under src/api/routes/)
  • Add endpoints (all write operations must append ConfigurationHistory and invalidate caches):
    • GET/PUT /tenants/{tenant_id}/config/risk
      • Reads/writes TenantRiskConfiguration.
      • Require require_admin and enforce tenant_id matches security context.
      • Invalidate risk_config_{tenant_id}; append ConfigurationHistory with config_type="tenant_risk_config".
    • GET/PUT /tenants/{tenant_id}/config/agents/{agent_type}
      • Reads/writes AgentConfiguration for that tenant + agent type.
      • Invalidate agent_config_{tenant_id}_{agent_type}; append history.
    • GET /tenants/{tenant_id}/services and GET/PUT /tenants/{tenant_id}/services/{service_id}
      • Reads/writes ServiceConfiguration for instance overrides.
      • Invalidate instance cache key (service-specific cache) and append history with config_type="service_config".
    • GET/PUT /tenants/{tenant_id}/projects/{project_id}/config
      • Reads/writes GcpProject-level config attributes and custom JSON if present.
      • Invalidate per-project cache and append history with resource_id=project_id.
  1. Expose platform/global configuration endpoints
  • File: src/api/routes/configurations.py (or new src/api/routes/platform_config.py)
  • Endpoints:
    • GET/PUT /config/global/{key}
      • PUT requires PLATFORM_ADMIN role; GET returns only is_public=True keys for tenant-scoped callers.
      • Writes should invalidate global_{key} and honor requires_restart by returning a flag in response.
  1. Seed defaults from Settings/ServiceTypeConfiguration
  • File: src/services/configuration_service.py
  • _create_default_tenant_risk_config and _create_default_agent_config must seed values using Settings and ServiceTypeConfiguration instead of literals.
  • Example: approval_timeout_minutes=settings.DEFAULT_APPROVAL_TIMEOUT or reuse settings keys added in Phase 1.

Acceptance for Phase 2:

  • DynamicConfigService reads only existing columns; config overlays use the shared helper.
  • Endpoints exist with RBAC, audit, and precise cache invalidation.
  • Default seeds originate from Settings and ServiceTypeConfiguration.

Phase 3 — UI Self-Service (Actionable Steps)

  • Bind the admin UI Settings pages to the new endpoints from Phase 2.
  • Show last updated by and last updated at using ConfigurationHistory entries.
  • On save, call the appropriate endpoint, then trigger cache invalidation for the impacted scope and surface a propagation toast.

Acceptance for Phase 3:

  • UI updates persist via APIs, reflect in reads, and show audit metadata.

Phase 4 — Operations Maturity (Actionable Steps)

  1. Add operator reload endpoint
  • File: src/api/routes/configurations.py (or new src/api/routes/operator.py), mounted at /config.
  • Endpoint: POST /config/reload
  • RBAC: PLATFORM_ADMIN only; operator/system scope (no ConfigurationHistory), but do log an audit event config_reloaded.
  • Request body:
    {
    "reload_env": true,
    "invalidate": { "all": true, "tenant_ids": [] },
    "reinit": { "configuration_service": true, "dynamic_config_service": true },
    "dry_run": false
    }
  • Behavior:
    • If reload_env: re-instantiate src/core/config.py:Settings or refresh the global instance.
    • If reinit.configuration_service: reinitialize with cache_ttl_minutes=settings.cache_default_ttl and clear its cache.
    • If reinit.dynamic_config_service: clear caches; if tenant_ids provided, only clear those entries.
    • Return a summary with timestamps; include partial failures under errors if any sub-action fails.
  1. Enforce requires_restart for globals
  • File: src/api/routes/configurations.py (global PUT handler)
  • If updating a GlobalConfiguration where requires_restart is true, include {"requires_restart": true} in the response and document that an operator restart or /config/reload is needed for changes to take effect.

Acceptance for Phase 4:

  • /config/reload exists with proper RBAC, performs targeted invalidations, and reports actions taken.
  • Global updates communicate restart requirements using requires_restart.

Definition of Done and Verification

  • Approvers honor per-tenant webhook/email configuration with env fallbacks; HMAC signature set when secret present.
  • ConfigurationService uses settings.cache_default_ttl; cache keys invalidated precisely on writes.
  • DynamicConfigService references only existing model fields and uses the shared precedence helper.
  • Tenant and platform endpoints implemented with RBAC and audit; cache invalidations occur per-scope.
  • DEFAULT_RISK_CONFIG and seed paths align with database shapes and service return contracts.
  • Operator reload endpoint implemented; global requires_restart enforced in responses.

Quick checks

  • Run backend tests: .venv/bin/python -m pytest tests/
  • Smoke test endpoints:
    • GET /config/ (tenant aggregate)
    • PUT /tenants/{tenant_id}/config/risk (ensure cache bust + history)
    • PUT /config/global/{key} (platform admin only; verify requires_restart handling)
    • POST /config/reload (platform admin only)

Footguns and Notes

  • Avoid deep merges by default; if you need deep-merge behavior for specific nested keys, add dedicated, type-aware merges in the relevant service method and document them.
  • Keep a single env loader in src/core/config.py; do not introduce parallel loaders.
  • Ensure all write endpoints both invalidate caches and append ConfigurationHistory (for tenant scopes) or log audit events (for platform/global).
  • Update src/database/models/migration_manager.py to the current schema or move seeds into Alembic migrations to prevent drift.