Configuration Model and Multitenant/RBAC

This document consolidates environment-level settings, persistent configuration scopes, precedence rules, API contracts, and cache/audit behavior. It replaces configurability.md and CONFIGURATION_EXTERNALIZATION_RECOMMENDATIONS.md with a single source of truth.

Goals

Single, unambiguous source of truth for each setting.
Clear separation between operator-only (env) vs. tenant-editable (DB-backed) config.
Consistent precedence resolution across services with caching and auditing.

Source-of-Truth Matrix

Env-Only (operator/process scope)

Examples: DATABASE_URL, DB_*, REDIS_URL, SMTP_*, SECRET_KEY, WEBHOOK_SECRET, BASE_URL, CORS_ORIGINS, FRONTEND_ORIGIN, LOG_LEVEL, CACHE_DEFAULT_TTL, PLATFORM_ADMIN_EMAILS, default HTTP client timeouts.
Ownership: platform operator.
Persistence: process env via src/core/config.py:Settings.
Change semantics: restart or operator-only reload endpoint; no tenant audit trails (logged as operator/system).

Global (platform-wide, persisted)

Examples: default incident complexity/routing rules, allowed APIs/roles templates for onboarding, global alerting bands, other platform defaults that are safe to expose.
Ownership: platform_admin (optionally public via is_public=true).
Persistence: GlobalConfiguration.
Change semantics: cache key global_{key} invalidation and audit/log.

Tenant Risk & Policy (persisted per tenant)

Examples: max_cost_impact_auto, max_cost_impact_approval, max_resource_scaling, approval_timeout_minutes, automation_enabled, business_hours_only, max_concurrent_automations; per service-type override JSON; notification_config (see schema below).
Ownership: tenant_admin.
Persistence: TenantRiskConfiguration and Tenant.notification_config.
Change semantics: invalidate risk_config_{tenant_id} and append ConfigurationHistory.

Agent Defaults (persisted per tenant + agent type)

Examples: default_region/zone, metrics_duration_minutes, validation_timeout_seconds, default mem/cpu/scaling, monitoring thresholds, templates/tools.
Ownership: tenant_admin.
Persistence: AgentConfiguration.
Change semantics: invalidate agent_config_{tenant_id}_{agent_type} and append ConfigurationHistory.

Service Type Defaults (persisted per service type)

Examples: default_settings, scaling_limits, monitoring_defaults for Cloud Run, GKE, BigQuery, Cloud SQL.
Ownership: platform_admin (optionally expose as presets).
Persistence: ServiceTypeConfiguration.
Change semantics: invalidate service_type_{service_type}_{configuration_name}; admin audit.

Project-Level (persisted per tenant + project)

Examples: region, environment, SA reference, budget/labels, project_config flags.
Ownership: tenant_admin.
Persistence: GcpProject.
Change semantics: per-project cache invalidation and ConfigurationHistory (resource_id = project_id).

Service Instance Overrides (persisted per tenant + service instance)

Examples: custom_thresholds, min/max_instances, monitoring_enabled, automation_enabled, maintenance_windows, notification_channels.
Ownership: tenant_admin (operators under guardrails).
Persistence: ServiceConfiguration.
Change semantics: invalidate instance cache; ConfigurationHistory with config_type=service_config.

Resolution order (highest → lowest):

Service instance override → AgentConfiguration (tenant+agent) → TenantRiskConfiguration → ServiceTypeConfiguration → GlobalConfiguration (public) → Env defaults (Settings).

Alignment: Env vs. Persisted Config

Keep a single env loader: src/core/config.py:Settings. Do not introduce a parallel loader. Use env exclusively for operator-only controls and as defaults when seeding DB rows.
Seed paths:
- Tenant creation: seed TenantRiskConfiguration and AgentConfiguration using Settings values where sensible.
- ConfigurationService should read settings.cache_default_ttl when constructing the cache TTL.
Transport/client knobs: defaults from env; allow per-tenant override inside Tenant.notification_config (e.g., webhook timeout/retries). Keep process-wide hard caps in env as safety rails.

Tenant.notification_config Schema

Add notification_config JSON to Tenant for approvals and notifications. Suggested schema:

Top-level fields

webhook: object
email: object
chat: object (optional; e.g., Google Chat)
approver_groups: object (risk level/group mapping)
escalation: object

webhook

Single endpoint form:
- url (string, required)
- headers (object, optional; default {})
- auth (object, optional; one of)
  - bearer: { "type": "bearer", "token": "..." }
  - basic: { "type": "basic", "username": "...", "password": "..." }
  - api_key: { "type": "api_key", "api_key": "...", "header_name": "X-API-Key" }
- secret (string, optional; HMAC signing)
- timeout_seconds (int, optional; default 30)
- retry_attempts (int, optional; default 3)
- backoff (object, optional): { "initial_seconds": 1, "factor": 2, "max_seconds": 8 }
Multiple endpoints form:
- endpoints (array of objects with the same fields as above)

approval_recipients (array[string], optional)
default_recipients (array[string], optional)
from_email (string, optional; override)
template_overrides (object, optional) e.g. { "approval_request": { "subject": "...", "branding": { ... } } }

chat (optional)

google_chat: { "webhook_url": "...", "spaces": ["..."], "include_links": true }

approver_groups

Map risk/action to approver sets, e.g.:
- risk_levels: { "low": ["sre-oncall@"], "medium": ["sre@"], "high": ["secops@"], "critical": ["exec@"] }
- action_overrides: { "scale_min_instances": ["app-owners@"], "increase_memory": ["platform@"] }

escalation

poll_interval_seconds (int, default 60)
escalation_window_minutes (int, default 60)
max_hops (int, default 2)

Validation and defaults

If timeout_seconds or retry_attempts are not set per endpoint, use process defaults from Settings (e.g., 30s and 3).
If secret is present, approver will compute X-SmartSRE-Signature (HMAC SHA256) as in src/approval/webhook_approver.py.

Example JSON

{
  "webhook": {
    "endpoints": [
      {
        "url": "https://approval.example.com/hook",
        "headers": {"X-Env": "prod"},
        "auth": {"type": "bearer", "token": "***"},
        "secret": "hmac-secret",
        "timeout_seconds": 30,
        "retry_attempts": 3,
        "backoff": {"initial_seconds": 1, "factor": 2, "max_seconds": 8}
      }
    ]
  },
  "email": {
    "approval_recipients": ["oncall@example.com"],
    "default_recipients": ["sre@example.com"],
    "from_email": "no-reply@example.com"
  },
  "approver_groups": {
    "risk_levels": {"low": ["sre@"], "high": ["secops@"]},
    "action_overrides": {"increase_memory": ["platform@"]}
  },
  "escalation": {"poll_interval_seconds": 60, "escalation_window_minutes": 60, "max_hops": 2}
}

Precedence Helper (shared)

Provide a tiny utility used by both ConfigurationService and DynamicConfigService to apply consistent overlay semantics.

Proposed module: src/services/config_precedence.py

from typing import Any, Dict, Iterable, Optional

def layered_merge(layers: Iterable[Optional[Dict[str, Any]]]) -> Dict[str, Any]:
    """Overlay dicts from left→right, skipping None. Later layers win."""
    result: Dict[str, Any] = {}
    for layer in layers:
        if not layer:
            continue
        for k, v in layer.items():
            # shallow merge; nested objects are replaced by later layers
            result[k] = v
    return result

def resolve_value(*values: Any, default: Any = None) -> Any:
    """Return the first non-None value in order, else default."""
    for v in values:
        if v is not None:
            return v
    return default

Usage examples

For agent config: final = layered_merge([service_instance, agent_tenant, tenant_risk, service_type, global_public, env_defaults])
For scalar lookup: timeout = resolve_value(tenant_override, service_type_default, global_band, settings_default, default=30)

Notes

Keep merges shallow for predictability; if nested deep-merge is required, constrain keys and add type-aware merges per section.

Operator Reload Endpoint

Endpoint

POST /config/reload

Auth and scope

Role: platform_admin only.
This endpoint is operator-only and outside tenant control (no ConfigurationHistory entries). It logs an operator/system audit event.

Request body

{
  "reload_env": true,
  "invalidate": { "all": true, "tenant_ids": [] },
  "reinit": { "configuration_service": true, "dynamic_config_service": true },
  "dry_run": false
}

Behavior

If reload_env: re-instantiate src/core/config.py:Settings (or refresh the global instance) so new env values take effect.
If reinit.configuration_service: reinitialize ConfigurationService with cache_ttl_minutes=settings.cache_default_ttl and clear its cache.
If reinit.dynamic_config_service: clear DynamicConfigService caches; if invalidate.tenant_ids provided, target those; otherwise clear all.
Always return a summary of actions taken and timestamps.

Response body (example)

{
  "status": "ok",
  "reloaded_env": true,
  "invalidated": {"all": true, "tenant_ids": []},
  "reinitialized": ["configuration_service", "dynamic_config_service"],
  "timestamp": "2025-10-24T09:30:00Z"
}

Side effects & logging

Log an audit event: action config_reloaded, actor system/operator, include request payload (minus secrets).
If any sub-action fails, return partial success with an errors field and appropriate HTTP status code.

API Surface (summary)

Existing

GET /configurations/ — aggregate tenant config for agent types.
POST /configurations/cache/invalidate — cache invalidation helper (ensure tenant scoping).

Recommended tenant APIs

GET/PUT /tenants/{tenant_id}/config/risk — reads/writes TenantRiskConfiguration; audit + cache bust.
GET/PUT /tenants/{tenant_id}/config/agents/{agent_type} — reads/writes AgentConfiguration; audit + cache bust.
GET /tenants/{tenant_id}/services and GET/PUT /tenants/{tenant_id}/services/{service_id} — reads/writes ServiceConfiguration; audit + cache bust.
GET/PUT /tenants/{tenant_id}/projects/{project_id}/config — project-level settings; audit + cache bust.

Recommended platform APIs

GET/PUT /config/global/{key} — reads/writes GlobalConfiguration (platform_admin; is_public gating when read by tenants).
POST /config/reload — operator-only env reload + cache management.

Caching and Audit

ConfigurationService uses an in-memory cache with TTL. Initialize using settings.cache_default_ttl. Invalidate precise keys on write:
- Tenant risk: risk_config_{tenant_id}
- Agent: agent_config_{tenant_id}_{agent_type}
- Service type: service_type_{service_type}_{name}
- Global: global_{key}
DynamicConfigService keeps a 5m cache; call invalidate_cache(tenant_id) after tenant-scoped writes, or clear all for platform changes.
Audit trails:
- Tenant-scoped writes append ConfigurationHistory and produce audit logs.
- Platform/global writes log audit events; optionally extend to a global history table if parity is desired.

Phased Implementation Plan

Phase 0 — Prework

Add Tenant.notification_config JSON column and migration; wire approvers to use it.
Replace hardcoded transport knobs (timeouts/retries) in approvers with: tenant override → env default.

Phase 1 — Centralize env defaults

Use settings.cache_default_ttl to initialize ConfigurationService.
Replace remaining process-level hardcoded defaults with Settings fields, keeping them operator-only.
Document reload vs. restart semantics.

Phase 2 — Persisted config + APIs

Implement GET/PUT endpoints for Tenant Risk, Agent Defaults, Service Overrides, and Project Config with RBAC and audit.
Update ConfigurationService._create_default_* to seed from Settings and ServiceTypeConfiguration, not literals.
Refactor DynamicConfigService to read existing columns (drop dependencies on absent JSON fields) and adopt the shared precedence helper.

Phase 3 — UI self-service

Bind Settings pages to real endpoints; show "last updated by/at" from ConfigurationHistory.
On save, invalidate caches in scope and show propagation toast.

Phase 4 — Operations maturity

Add /config/reload with audit logging and targeted invalidation.
Use GlobalConfiguration.requires_restart for values that cannot be hot-reloaded.

Detailed Gap Closure Plan (by Phase)

The following is an actionable, file-specific checklist to complete the implementation. Follow in order; each step lists exact files, names, and expected side effects.

Phase 0 — Prework (Actionable Steps)

Add Tenant.notification_config column

File: src/database/models/tenant_models.py

Change: add a new column on Tenant:

from sqlalchemy import JSON
# ... inside class Tenant
notification_config = Column(JSON, nullable=True)

Migration: create an Alembic revision to add the column.

Command (example): alembic revision -m "add notification_config to tenant"

Migration upgrade (PostgreSQL/SQLite):

def upgrade():
    op.add_column('tenants', sa.Column('notification_config', sa.JSON(), nullable=True))
def downgrade():
    op.drop_column('tenants', 'notification_config')

Webhook approver: per-endpoint transport knobs with env fallbacks

File: src/approval/webhook_approver.py
Behavior changes:
- Respect per-endpoint fields from Tenant.notification_config.webhook:
  - timeout_seconds (int), retry_attempts (int), backoff (object with initial_seconds, factor, max_seconds).
- Fallbacks to env defaults when any of the above is missing.
Add env defaults to Settings (see step 4 below) and use them here:
- settings.approval_webhook_timeout_seconds
- settings.approval_webhook_retry_attempts
- Optional: settings.approval_webhook_backoff_initial_seconds, settings.approval_webhook_backoff_factor, settings.approval_webhook_backoff_max_seconds

Implementation sketch inside _send_webhook loop:

timeout = webhook_config.get('timeout_seconds', settings.approval_webhook_timeout_seconds)
retries = webhook_config.get('retry_attempts', settings.approval_webhook_retry_attempts)
backoff_cfg = webhook_config.get('backoff', {})
initial = backoff_cfg.get('initial_seconds', settings.approval_webhook_backoff_initial_seconds)
factor = backoff_cfg.get('factor', settings.approval_webhook_backoff_factor)
max_s = backoff_cfg.get('max_seconds', settings.approval_webhook_backoff_max_seconds)
# sleep = min(max_s, initial * (factor ** attempt)) between attempts

Email approver: unify env usage and defaults

File: src/approval/email_approver.py
Replace these lookups to match Settings field names:
- Use settings.from_email (not smtp_from_email).
- Add/use settings.smtp_use_tls boolean (see step 4 below).
The SMTP fields should be read as:
- smtp_server, smtp_port, smtp_username, smtp_password, from_email, smtp_use_tls.

Add explicit env defaults in Settings

File: src/core/config.py

Add fields to class Settings:

approval_webhook_timeout_seconds: int = env_field(30, "APPROVAL_WEBHOOK_TIMEOUT_SECONDS")
approval_webhook_retry_attempts: int = env_field(3, "APPROVAL_WEBHOOK_RETRY_ATTEMPTS")
approval_webhook_backoff_initial_seconds: int = env_field(1, "APPROVAL_WEBHOOK_BACKOFF_INITIAL_SECONDS")
approval_webhook_backoff_factor: int = env_field(2, "APPROVAL_WEBHOOK_BACKOFF_FACTOR")
approval_webhook_backoff_max_seconds: int = env_field(8, "APPROVAL_WEBHOOK_BACKOFF_MAX_SECONDS")

smtp_use_tls: bool = env_field(True, "SMTP_USE_TLS")

Keep schema validation behavior for HMAC signing

File: src/approval/webhook_approver.py
Already supported via _generate_webhook_signature; no change needed beyond ensuring secret in endpoint config is honored.

Acceptance for Phase 0:

Approvers read tenant notification_config and respect per-endpoint timeout_seconds, retry_attempts, backoff with env fallbacks.
Email approver uses settings.from_email and settings.smtp_use_tls.
Tenant.notification_config is persisted and accessible (migration applied).

Phase 1 — Centralize Env Defaults (Actionable Steps)

Initialize ConfigurationService with Settings TTL

File: src/services/configuration_service.py

Update the global factory to use env TTL:

from ..core.config import settings
def get_configuration_service() -> ConfigurationService:
    global _configuration_service
    if _configuration_service is None:
        _configuration_service = ConfigurationService(cache_ttl_minutes=settings.cache_default_ttl)
    return _configuration_service

Align default risk config structure

File: src/core/config.py
Update DEFAULT_RISK_CONFIG to include fields returned by ConfigurationService.get_tenant_risk_config:
- Under global_settings: add max_cost_impact_approval, require_approval_for_high_risk, auto_approve_rollbacks, max_concurrent_automations.
- Add service_overrides map with keys cloud_run, gke, bigquery, cloud_sql defaulting to {}.

Shape example:

DEFAULT_RISK_CONFIG = {
    "global_settings": {
        "max_cost_impact_auto": 15.0,
        "max_cost_impact_approval": 50.0,
        "max_resource_scaling": 3.0,
        "approval_timeout_minutes": 30,
        "require_approval_for_high_risk": True,
        "auto_approve_rollbacks": True,
        "automation_enabled": True,
        "business_hours_only": False,
        "max_concurrent_automations": 5,
    },
    "service_overrides": {"cloud_run": {}, "gke": {}, "bigquery": {}, "cloud_sql": {}},
}

Document reload vs. restart semantics

Env-only settings (operator-only) require a process restart or the /config/reload endpoint added in Phase 4.
Persisted configs take effect immediately after cache invalidation.

Acceptance for Phase 1:

ConfigurationService TTL equals settings.cache_default_ttl.
Default risk config aligns with DB-backed structure and service expectations.

Phase 2 — Persisted Config + APIs (Actionable Steps)

Add shared precedence helper

File: src/services/config_precedence.py
Add the utility exactly as specified in this doc (see Precedence Helper section) and import it in services.
Usage pattern: final = layered_merge([...]) and value = resolve_value(...).

Refactor DynamicConfigService to match models

File: src/services/dynamic_config_service.py
Replace references to non-existent fields:
- Remove is_default, type_specific_config, agent_specific_config, and value_type usage.
- Service type: select by ServiceTypeConfiguration.service_type and a configuration_name (e.g., "default") with is_active=True; read from default_settings, scaling_limits, monitoring_defaults.
- Global: read JSON from GlobalConfiguration.config_value directly (no type coercion layer).
Adopt layered_merge and resolve_value where overlaying dictionaries or scalar fallbacks are required.

Correct MigrationManager seeds to current schema

File: src/database/models/migration_manager.py
Replace usage of value_type, profile_name, is_default, and type_specific_config with current fields:
- GlobalConfiguration(config_key, config_value=<JSON>, data_type="json" or omit if not used)
- ServiceTypeConfiguration(service_type, configuration_name="default", default_settings=<JSON>, scaling_limits, monitoring_defaults, is_active=True)
Optionally, reduce seeding here and rely on Alembic data migrations for authoritative defaults.

Expose tenant-scoped configuration endpoints with RBAC and audit

File: src/api/routes/configurations.py (or new route modules under src/api/routes/)
Add endpoints (all write operations must append ConfigurationHistory and invalidate caches):
- GET/PUT /tenants/{tenant_id}/config/risk
  - Reads/writes TenantRiskConfiguration.
  - Require require_admin and enforce tenant_id matches security context.
  - Invalidate risk_config_{tenant_id}; append ConfigurationHistory with config_type="tenant_risk_config".
- GET/PUT /tenants/{tenant_id}/config/agents/{agent_type}
  - Reads/writes AgentConfiguration for that tenant + agent type.
  - Invalidate agent_config_{tenant_id}_{agent_type}; append history.
- GET /tenants/{tenant_id}/services and GET/PUT /tenants/{tenant_id}/services/{service_id}
  - Reads/writes ServiceConfiguration for instance overrides.
  - Invalidate instance cache key (service-specific cache) and append history with config_type="service_config".
- GET/PUT /tenants/{tenant_id}/projects/{project_id}/config
  - Reads/writes GcpProject-level config attributes and custom JSON if present.
  - Invalidate per-project cache and append history with resource_id=project_id.

Expose platform/global configuration endpoints

File: src/api/routes/configurations.py (or new src/api/routes/platform_config.py)
Endpoints:
- GET/PUT /config/global/{key}
  - PUT requires PLATFORM_ADMIN role; GET returns only is_public=True keys for tenant-scoped callers.
  - Writes should invalidate global_{key} and honor requires_restart by returning a flag in response.

Seed defaults from Settings/ServiceTypeConfiguration

File: src/services/configuration_service.py
_create_default_tenant_risk_config and _create_default_agent_config must seed values using Settings and ServiceTypeConfiguration instead of literals.
Example: approval_timeout_minutes=settings.DEFAULT_APPROVAL_TIMEOUT or reuse settings keys added in Phase 1.

Acceptance for Phase 2:

DynamicConfigService reads only existing columns; config overlays use the shared helper.
Endpoints exist with RBAC, audit, and precise cache invalidation.
Default seeds originate from Settings and ServiceTypeConfiguration.

Phase 3 — UI Self-Service (Actionable Steps)

Bind the admin UI Settings pages to the new endpoints from Phase 2.
Show last updated by and last updated at using ConfigurationHistory entries.
On save, call the appropriate endpoint, then trigger cache invalidation for the impacted scope and surface a propagation toast.

Acceptance for Phase 3:

UI updates persist via APIs, reflect in reads, and show audit metadata.

Phase 4 — Operations Maturity (Actionable Steps)

Add operator reload endpoint

File: src/api/routes/configurations.py (or new src/api/routes/operator.py), mounted at /config.
Endpoint: POST /config/reload
RBAC: PLATFORM_ADMIN only; operator/system scope (no ConfigurationHistory), but do log an audit event config_reloaded.

Request body:

{
  "reload_env": true,
  "invalidate": { "all": true, "tenant_ids": [] },
  "reinit": { "configuration_service": true, "dynamic_config_service": true },
  "dry_run": false
}

Behavior:
- If reload_env: re-instantiate src/core/config.py:Settings or refresh the global instance.
- If reinit.configuration_service: reinitialize with cache_ttl_minutes=settings.cache_default_ttl and clear its cache.
- If reinit.dynamic_config_service: clear caches; if tenant_ids provided, only clear those entries.
- Return a summary with timestamps; include partial failures under errors if any sub-action fails.

Enforce requires_restart for globals

File: src/api/routes/configurations.py (global PUT handler)
If updating a GlobalConfiguration where requires_restart is true, include {"requires_restart": true} in the response and document that an operator restart or /config/reload is needed for changes to take effect.

Acceptance for Phase 4:

/config/reload exists with proper RBAC, performs targeted invalidations, and reports actions taken.
Global updates communicate restart requirements using requires_restart.

Definition of Done and Verification

Approvers honor per-tenant webhook/email configuration with env fallbacks; HMAC signature set when secret present.
ConfigurationService uses settings.cache_default_ttl; cache keys invalidated precisely on writes.
DynamicConfigService references only existing model fields and uses the shared precedence helper.
Tenant and platform endpoints implemented with RBAC and audit; cache invalidations occur per-scope.
DEFAULT_RISK_CONFIG and seed paths align with database shapes and service return contracts.
Operator reload endpoint implemented; global requires_restart enforced in responses.

Quick checks

Run backend tests: .venv/bin/python -m pytest tests/
Smoke test endpoints:
- GET /config/ (tenant aggregate)
- PUT /tenants/{tenant_id}/config/risk (ensure cache bust + history)
- PUT /config/global/{key} (platform admin only; verify requires_restart handling)
- POST /config/reload (platform admin only)

Footguns and Notes

Avoid deep merges by default; if you need deep-merge behavior for specific nested keys, add dedicated, type-aware merges in the relevant service method and document them.
Keep a single env loader in src/core/config.py; do not introduce parallel loaders.
Ensure all write endpoints both invalidate caches and append ConfigurationHistory (for tenant scopes) or log audit events (for platform/global).
Update src/database/models/migration_manager.py to the current schema or move seeds into Alembic migrations to prevent drift.

Goals​

Source-of-Truth Matrix​

Alignment: Env vs. Persisted Config​

Tenant.notification_config Schema​

Precedence Helper (shared)​

Operator Reload Endpoint​

API Surface (summary)​

Caching and Audit​

Phased Implementation Plan​

Detailed Gap Closure Plan (by Phase)​

Phase 0 — Prework (Actionable Steps)​

Phase 1 — Centralize Env Defaults (Actionable Steps)​

Phase 2 — Persisted Config + APIs (Actionable Steps)​

Phase 3 — UI Self-Service (Actionable Steps)​

Phase 4 — Operations Maturity (Actionable Steps)​

Definition of Done and Verification​

Footguns and Notes​