Configuration Model and Multitenant/RBAC
This document consolidates environment-level settings, persistent configuration scopes, precedence rules, API contracts, and cache/audit behavior. It replaces configurability.md and CONFIGURATION_EXTERNALIZATION_RECOMMENDATIONS.md with a single source of truth.
Goals
- Single, unambiguous source of truth for each setting.
- Clear separation between operator-only (env) vs. tenant-editable (DB-backed) config.
- Consistent precedence resolution across services with caching and auditing.
Source-of-Truth Matrix
- Env-Only (operator/process scope)
- Examples:
DATABASE_URL,DB_*,REDIS_URL,SMTP_*,SECRET_KEY,WEBHOOK_SECRET,BASE_URL,CORS_ORIGINS,FRONTEND_ORIGIN,LOG_LEVEL,CACHE_DEFAULT_TTL,PLATFORM_ADMIN_EMAILS, default HTTP client timeouts. - Ownership: platform operator.
- Persistence: process env via
src/core/config.py:Settings. - Change semantics: restart or operator-only reload endpoint; no tenant audit trails (logged as operator/system).
- Global (platform-wide, persisted)
- Examples: default incident complexity/routing rules, allowed APIs/roles templates for onboarding, global alerting bands, other platform defaults that are safe to expose.
- Ownership: platform_admin (optionally public via
is_public=true). - Persistence:
GlobalConfiguration. - Change semantics: cache key
global_{key}invalidation and audit/log.
- Tenant Risk & Policy (persisted per tenant)
- Examples:
max_cost_impact_auto,max_cost_impact_approval,max_resource_scaling,approval_timeout_minutes,automation_enabled,business_hours_only,max_concurrent_automations; per service-type override JSON;notification_config(see schema below). - Ownership: tenant_admin.
- Persistence:
TenantRiskConfigurationandTenant.notification_config. - Change semantics: invalidate
risk_config_{tenant_id}and appendConfigurationHistory.
- Agent Defaults (persisted per tenant + agent type)
- Examples:
default_region/zone,metrics_duration_minutes,validation_timeout_seconds, default mem/cpu/scaling, monitoring thresholds, templates/tools. - Ownership: tenant_admin.
- Persistence:
AgentConfiguration. - Change semantics: invalidate
agent_config_{tenant_id}_{agent_type}and appendConfigurationHistory.
- Service Type Defaults (persisted per service type)
- Examples:
default_settings,scaling_limits,monitoring_defaultsfor Cloud Run, GKE, BigQuery, Cloud SQL. - Ownership: platform_admin (optionally expose as presets).
- Persistence:
ServiceTypeConfiguration. - Change semantics: invalidate
service_type_{service_type}_{configuration_name}; admin audit.
- Project-Level (persisted per tenant + project)
- Examples: region, environment, SA reference, budget/labels,
project_configflags. - Ownership: tenant_admin.
- Persistence:
GcpProject. - Change semantics: per-project cache invalidation and
ConfigurationHistory(resource_id = project_id).
- Service Instance Overrides (persisted per tenant + service instance)
- Examples:
custom_thresholds,min/max_instances,monitoring_enabled,automation_enabled,maintenance_windows,notification_channels. - Ownership: tenant_admin (operators under guardrails).
- Persistence:
ServiceConfiguration. - Change semantics: invalidate instance cache;
ConfigurationHistorywithconfig_type=service_config.
Resolution order (highest → lowest):
- Service instance override → AgentConfiguration (tenant+agent) → TenantRiskConfiguration → ServiceTypeConfiguration → GlobalConfiguration (public) → Env defaults (
Settings).
Alignment: Env vs. Persisted Config
- Keep a single env loader:
src/core/config.py:Settings. Do not introduce a parallel loader. Use env exclusively for operator-only controls and as defaults when seeding DB rows. - Seed paths:
- Tenant creation: seed
TenantRiskConfigurationandAgentConfigurationusingSettingsvalues where sensible. ConfigurationServiceshould readsettings.cache_default_ttlwhen constructing the cache TTL.
- Tenant creation: seed
- Transport/client knobs: defaults from env; allow per-tenant override inside
Tenant.notification_config(e.g., webhook timeout/retries). Keep process-wide hard caps in env as safety rails.
Tenant.notification_config Schema
Add notification_config JSON to Tenant for approvals and notifications. Suggested schema:
Top-level fields
webhook: objectemail: objectchat: object (optional; e.g., Google Chat)approver_groups: object (risk level/group mapping)escalation: object
webhook
- Single endpoint form:
url(string, required)headers(object, optional; default{})auth(object, optional; one of)- bearer:
{ "type": "bearer", "token": "..." } - basic:
{ "type": "basic", "username": "...", "password": "..." } - api_key:
{ "type": "api_key", "api_key": "...", "header_name": "X-API-Key" }
- bearer:
secret(string, optional; HMAC signing)timeout_seconds(int, optional; default 30)retry_attempts(int, optional; default 3)backoff(object, optional):{ "initial_seconds": 1, "factor": 2, "max_seconds": 8 }
- Multiple endpoints form:
endpoints(array of objects with the same fields as above)
approval_recipients(array[string], optional)default_recipients(array[string], optional)from_email(string, optional; override)template_overrides(object, optional) e.g.{ "approval_request": { "subject": "...", "branding": { ... } } }
chat (optional)
google_chat:{ "webhook_url": "...", "spaces": ["..."], "include_links": true }
approver_groups
- Map risk/action to approver sets, e.g.:
risk_levels:{ "low": ["sre-oncall@"], "medium": ["sre@"], "high": ["secops@"], "critical": ["exec@"] }action_overrides:{ "scale_min_instances": ["app-owners@"], "increase_memory": ["platform@"] }
escalation
poll_interval_seconds(int, default 60)escalation_window_minutes(int, default 60)max_hops(int, default 2)
Validation and defaults
- If
timeout_secondsorretry_attemptsare not set per endpoint, use process defaults fromSettings(e.g., 30s and 3). - If
secretis present, approver will computeX-SmartSRE-Signature(HMAC SHA256) as insrc/approval/webhook_approver.py.
Example JSON
{
"webhook": {
"endpoints": [
{
"url": "https://approval.example.com/hook",
"headers": {"X-Env": "prod"},
"auth": {"type": "bearer", "token": "***"},
"secret": "hmac-secret",
"timeout_seconds": 30,
"retry_attempts": 3,
"backoff": {"initial_seconds": 1, "factor": 2, "max_seconds": 8}
}
]
},
"email": {
"approval_recipients": ["oncall@example.com"],
"default_recipients": ["sre@example.com"],
"from_email": "no-reply@example.com"
},
"approver_groups": {
"risk_levels": {"low": ["sre@"], "high": ["secops@"]},
"action_overrides": {"increase_memory": ["platform@"]}
},
"escalation": {"poll_interval_seconds": 60, "escalation_window_minutes": 60, "max_hops": 2}
}
Precedence Helper (shared)
Provide a tiny utility used by both ConfigurationService and DynamicConfigService to apply consistent overlay semantics.
Proposed module: src/services/config_precedence.py
from typing import Any, Dict, Iterable, Optional
def layered_merge(layers: Iterable[Optional[Dict[str, Any]]]) -> Dict[str, Any]:
"""Overlay dicts from left→right, skipping None. Later layers win."""
result: Dict[str, Any] = {}
for layer in layers:
if not layer:
continue
for k, v in layer.items():
# shallow merge; nested objects are replaced by later layers
result[k] = v
return result
def resolve_value(*values: Any, default: Any = None) -> Any:
"""Return the first non-None value in order, else default."""
for v in values:
if v is not None:
return v
return default
Usage examples
- For agent config:
final = layered_merge([service_instance, agent_tenant, tenant_risk, service_type, global_public, env_defaults]) - For scalar lookup:
timeout = resolve_value(tenant_override, service_type_default, global_band, settings_default, default=30)
Notes
- Keep merges shallow for predictability; if nested deep-merge is required, constrain keys and add type-aware merges per section.
Operator Reload Endpoint
Endpoint
POST /config/reload
Auth and scope
- Role:
platform_adminonly. - This endpoint is operator-only and outside tenant control (no
ConfigurationHistoryentries). It logs an operator/system audit event.
Request body
{
"reload_env": true,
"invalidate": { "all": true, "tenant_ids": [] },
"reinit": { "configuration_service": true, "dynamic_config_service": true },
"dry_run": false
}
Behavior
- If
reload_env: re-instantiatesrc/core/config.py:Settings(or refresh the global instance) so new env values take effect. - If
reinit.configuration_service: reinitializeConfigurationServicewithcache_ttl_minutes=settings.cache_default_ttland clear its cache. - If
reinit.dynamic_config_service: clearDynamicConfigServicecaches; ifinvalidate.tenant_idsprovided, target those; otherwise clear all. - Always return a summary of actions taken and timestamps.
Response body (example)
{
"status": "ok",
"reloaded_env": true,
"invalidated": {"all": true, "tenant_ids": []},
"reinitialized": ["configuration_service", "dynamic_config_service"],
"timestamp": "2025-10-24T09:30:00Z"
}
Side effects & logging
- Log an audit event: action
config_reloaded, actorsystem/operator, include request payload (minus secrets). - If any sub-action fails, return partial success with an
errorsfield and appropriate HTTP status code.
API Surface (summary)
Existing
GET /configurations/— aggregate tenant config for agent types.POST /configurations/cache/invalidate— cache invalidation helper (ensure tenant scoping).
Recommended tenant APIs
GET/PUT /tenants/{tenant_id}/config/risk— reads/writesTenantRiskConfiguration; audit + cache bust.GET/PUT /tenants/{tenant_id}/config/agents/{agent_type}— reads/writesAgentConfiguration; audit + cache bust.GET /tenants/{tenant_id}/servicesandGET/PUT /tenants/{tenant_id}/services/{service_id}— reads/writesServiceConfiguration; audit + cache bust.GET/PUT /tenants/{tenant_id}/projects/{project_id}/config— project-level settings; audit + cache bust.
Recommended platform APIs
GET/PUT /config/global/{key}— reads/writesGlobalConfiguration(platform_admin;is_publicgating when read by tenants).POST /config/reload— operator-only env reload + cache management.
Caching and Audit
ConfigurationServiceuses an in-memory cache with TTL. Initialize usingsettings.cache_default_ttl. Invalidate precise keys on write:- Tenant risk:
risk_config_{tenant_id} - Agent:
agent_config_{tenant_id}_{agent_type} - Service type:
service_type_{service_type}_{name} - Global:
global_{key}
- Tenant risk:
DynamicConfigServicekeeps a 5m cache; callinvalidate_cache(tenant_id)after tenant-scoped writes, or clear all for platform changes.- Audit trails:
- Tenant-scoped writes append
ConfigurationHistoryand produce audit logs. - Platform/global writes log audit events; optionally extend to a global history table if parity is desired.
- Tenant-scoped writes append
Phased Implementation Plan
Phase 0 — Prework
- Add
Tenant.notification_configJSON column and migration; wire approvers to use it. - Replace hardcoded transport knobs (timeouts/retries) in approvers with: tenant override → env default.
Phase 1 — Centralize env defaults
- Use
settings.cache_default_ttlto initializeConfigurationService. - Replace remaining process-level hardcoded defaults with
Settingsfields, keeping them operator-only. - Document reload vs. restart semantics.
Phase 2 — Persisted config + APIs
- Implement
GET/PUTendpoints for Tenant Risk, Agent Defaults, Service Overrides, and Project Config with RBAC and audit. - Update
ConfigurationService._create_default_*to seed fromSettingsandServiceTypeConfiguration, not literals. - Refactor
DynamicConfigServiceto read existing columns (drop dependencies on absent JSON fields) and adopt the shared precedence helper.
Phase 3 — UI self-service
- Bind Settings pages to real endpoints; show "last updated by/at" from
ConfigurationHistory. - On save, invalidate caches in scope and show propagation toast.
Phase 4 — Operations maturity
- Add
/config/reloadwith audit logging and targeted invalidation. - Use
GlobalConfiguration.requires_restartfor values that cannot be hot-reloaded.
Detailed Gap Closure Plan (by Phase)
The following is an actionable, file-specific checklist to complete the implementation. Follow in order; each step lists exact files, names, and expected side effects.
Phase 0 — Prework (Actionable Steps)
- Add Tenant.notification_config column
-
File:
src/database/models/tenant_models.py -
Change: add a new column on
Tenant:from sqlalchemy import JSON
# ... inside class Tenant
notification_config = Column(JSON, nullable=True) -
Migration: create an Alembic revision to add the column.
- Command (example):
alembic revision -m "add notification_config to tenant" - Migration upgrade (PostgreSQL/SQLite):
def upgrade():
op.add_column('tenants', sa.Column('notification_config', sa.JSON(), nullable=True))
def downgrade():
op.drop_column('tenants', 'notification_config')
- Command (example):
- Webhook approver: per-endpoint transport knobs with env fallbacks
- File:
src/approval/webhook_approver.py - Behavior changes:
- Respect per-endpoint fields from
Tenant.notification_config.webhook:timeout_seconds(int),retry_attempts(int),backoff(object withinitial_seconds,factor,max_seconds).
- Fallbacks to env defaults when any of the above is missing.
- Respect per-endpoint fields from
- Add env defaults to
Settings(see step 4 below) and use them here:settings.approval_webhook_timeout_secondssettings.approval_webhook_retry_attempts- Optional:
settings.approval_webhook_backoff_initial_seconds,settings.approval_webhook_backoff_factor,settings.approval_webhook_backoff_max_seconds
- Implementation sketch inside
_send_webhookloop:timeout = webhook_config.get('timeout_seconds', settings.approval_webhook_timeout_seconds)
retries = webhook_config.get('retry_attempts', settings.approval_webhook_retry_attempts)
backoff_cfg = webhook_config.get('backoff', {})
initial = backoff_cfg.get('initial_seconds', settings.approval_webhook_backoff_initial_seconds)
factor = backoff_cfg.get('factor', settings.approval_webhook_backoff_factor)
max_s = backoff_cfg.get('max_seconds', settings.approval_webhook_backoff_max_seconds)
# sleep = min(max_s, initial * (factor ** attempt)) between attempts
- Email approver: unify env usage and defaults
- File:
src/approval/email_approver.py - Replace these lookups to match
Settingsfield names:- Use
settings.from_email(notsmtp_from_email). - Add/use
settings.smtp_use_tlsboolean (see step 4 below).
- Use
- The SMTP fields should be read as:
smtp_server,smtp_port,smtp_username,smtp_password,from_email,smtp_use_tls.
- Add explicit env defaults in Settings
- File:
src/core/config.py - Add fields to
class Settings:approval_webhook_timeout_seconds: int = env_field(30, "APPROVAL_WEBHOOK_TIMEOUT_SECONDS")
approval_webhook_retry_attempts: int = env_field(3, "APPROVAL_WEBHOOK_RETRY_ATTEMPTS")
approval_webhook_backoff_initial_seconds: int = env_field(1, "APPROVAL_WEBHOOK_BACKOFF_INITIAL_SECONDS")
approval_webhook_backoff_factor: int = env_field(2, "APPROVAL_WEBHOOK_BACKOFF_FACTOR")
approval_webhook_backoff_max_seconds: int = env_field(8, "APPROVAL_WEBHOOK_BACKOFF_MAX_SECONDS")
smtp_use_tls: bool = env_field(True, "SMTP_USE_TLS")
- Keep schema validation behavior for HMAC signing
- File:
src/approval/webhook_approver.py - Already supported via
_generate_webhook_signature; no change needed beyond ensuringsecretin endpoint config is honored.
Acceptance for Phase 0:
- Approvers read tenant
notification_configand respect per-endpointtimeout_seconds,retry_attempts,backoffwith env fallbacks. - Email approver uses
settings.from_emailandsettings.smtp_use_tls. Tenant.notification_configis persisted and accessible (migration applied).
Phase 1 — Centralize Env Defaults (Actionable Steps)
- Initialize ConfigurationService with Settings TTL
- File:
src/services/configuration_service.py - Update the global factory to use env TTL:
from ..core.config import settings
def get_configuration_service() -> ConfigurationService:
global _configuration_service
if _configuration_service is None:
_configuration_service = ConfigurationService(cache_ttl_minutes=settings.cache_default_ttl)
return _configuration_service
- Align default risk config structure
- File:
src/core/config.py - Update
DEFAULT_RISK_CONFIGto include fields returned byConfigurationService.get_tenant_risk_config:- Under
global_settings: addmax_cost_impact_approval,require_approval_for_high_risk,auto_approve_rollbacks,max_concurrent_automations. - Add
service_overridesmap with keyscloud_run,gke,bigquery,cloud_sqldefaulting to{}.
- Under
- Shape example:
DEFAULT_RISK_CONFIG = {
"global_settings": {
"max_cost_impact_auto": 15.0,
"max_cost_impact_approval": 50.0,
"max_resource_scaling": 3.0,
"approval_timeout_minutes": 30,
"require_approval_for_high_risk": True,
"auto_approve_rollbacks": True,
"automation_enabled": True,
"business_hours_only": False,
"max_concurrent_automations": 5,
},
"service_overrides": {"cloud_run": {}, "gke": {}, "bigquery": {}, "cloud_sql": {}},
}
- Document reload vs. restart semantics
- Env-only settings (operator-only) require a process restart or the
/config/reloadendpoint added in Phase 4. - Persisted configs take effect immediately after cache invalidation.
Acceptance for Phase 1:
ConfigurationServiceTTL equalssettings.cache_default_ttl.- Default risk config aligns with DB-backed structure and service expectations.
Phase 2 — Persisted Config + APIs (Actionable Steps)
- Add shared precedence helper
- File:
src/services/config_precedence.py - Add the utility exactly as specified in this doc (see Precedence Helper section) and import it in services.
- Usage pattern:
final = layered_merge([...])andvalue = resolve_value(...).
- Refactor DynamicConfigService to match models
- File:
src/services/dynamic_config_service.py - Replace references to non-existent fields:
- Remove
is_default,type_specific_config,agent_specific_config, andvalue_typeusage. - Service type: select by
ServiceTypeConfiguration.service_typeand aconfiguration_name(e.g., "default") withis_active=True; read fromdefault_settings,scaling_limits,monitoring_defaults. - Global: read JSON from
GlobalConfiguration.config_valuedirectly (no type coercion layer).
- Remove
- Adopt
layered_mergeandresolve_valuewhere overlaying dictionaries or scalar fallbacks are required.
- Correct MigrationManager seeds to current schema
- File:
src/database/models/migration_manager.py - Replace usage of
value_type,profile_name,is_default, andtype_specific_configwith current fields:GlobalConfiguration(config_key, config_value=<JSON>, data_type="json" or omit if not used)ServiceTypeConfiguration(service_type, configuration_name="default", default_settings=<JSON>, scaling_limits, monitoring_defaults, is_active=True)
- Optionally, reduce seeding here and rely on Alembic data migrations for authoritative defaults.
- Expose tenant-scoped configuration endpoints with RBAC and audit
- File:
src/api/routes/configurations.py(or new route modules undersrc/api/routes/) - Add endpoints (all write operations must append
ConfigurationHistoryand invalidate caches):GET/PUT /tenants/{tenant_id}/config/risk- Reads/writes
TenantRiskConfiguration. - Require
require_adminand enforcetenant_idmatches security context. - Invalidate
risk_config_{tenant_id}; appendConfigurationHistorywithconfig_type="tenant_risk_config".
- Reads/writes
GET/PUT /tenants/{tenant_id}/config/agents/{agent_type}- Reads/writes
AgentConfigurationfor that tenant + agent type. - Invalidate
agent_config_{tenant_id}_{agent_type}; append history.
- Reads/writes
GET /tenants/{tenant_id}/servicesandGET/PUT /tenants/{tenant_id}/services/{service_id}- Reads/writes
ServiceConfigurationfor instance overrides. - Invalidate instance cache key (service-specific cache) and append history with
config_type="service_config".
- Reads/writes
GET/PUT /tenants/{tenant_id}/projects/{project_id}/config- Reads/writes
GcpProject-level config attributes and custom JSON if present. - Invalidate per-project cache and append history with
resource_id=project_id.
- Reads/writes
- Expose platform/global configuration endpoints
- File:
src/api/routes/configurations.py(or newsrc/api/routes/platform_config.py) - Endpoints:
GET/PUT /config/global/{key}- PUT requires
PLATFORM_ADMINrole; GET returns onlyis_public=Truekeys for tenant-scoped callers. - Writes should invalidate
global_{key}and honorrequires_restartby returning a flag in response.
- PUT requires
- Seed defaults from Settings/ServiceTypeConfiguration
- File:
src/services/configuration_service.py _create_default_tenant_risk_configand_create_default_agent_configmust seed values usingSettingsandServiceTypeConfigurationinstead of literals.- Example:
approval_timeout_minutes=settings.DEFAULT_APPROVAL_TIMEOUTor reusesettingskeys added in Phase 1.
Acceptance for Phase 2:
- DynamicConfigService reads only existing columns; config overlays use the shared helper.
- Endpoints exist with RBAC, audit, and precise cache invalidation.
- Default seeds originate from
SettingsandServiceTypeConfiguration.
Phase 3 — UI Self-Service (Actionable Steps)
- Bind the admin UI Settings pages to the new endpoints from Phase 2.
- Show
last updated byandlast updated atusingConfigurationHistoryentries. - On save, call the appropriate endpoint, then trigger cache invalidation for the impacted scope and surface a propagation toast.
Acceptance for Phase 3:
- UI updates persist via APIs, reflect in reads, and show audit metadata.
Phase 4 — Operations Maturity (Actionable Steps)
- Add operator reload endpoint
- File:
src/api/routes/configurations.py(or newsrc/api/routes/operator.py), mounted at/config. - Endpoint:
POST /config/reload - RBAC:
PLATFORM_ADMINonly; operator/system scope (noConfigurationHistory), but do log an audit eventconfig_reloaded. - Request body:
{
"reload_env": true,
"invalidate": { "all": true, "tenant_ids": [] },
"reinit": { "configuration_service": true, "dynamic_config_service": true },
"dry_run": false
} - Behavior:
- If
reload_env: re-instantiatesrc/core/config.py:Settingsor refresh the global instance. - If
reinit.configuration_service: reinitialize withcache_ttl_minutes=settings.cache_default_ttland clear its cache. - If
reinit.dynamic_config_service: clear caches; iftenant_idsprovided, only clear those entries. - Return a summary with timestamps; include partial failures under
errorsif any sub-action fails.
- If
- Enforce requires_restart for globals
- File:
src/api/routes/configurations.py(global PUT handler) - If updating a
GlobalConfigurationwhererequires_restartis true, include{"requires_restart": true}in the response and document that an operator restart or/config/reloadis needed for changes to take effect.
Acceptance for Phase 4:
/config/reloadexists with proper RBAC, performs targeted invalidations, and reports actions taken.- Global updates communicate restart requirements using
requires_restart.
Definition of Done and Verification
- Approvers honor per-tenant webhook/email configuration with env fallbacks; HMAC signature set when
secretpresent. ConfigurationServiceusessettings.cache_default_ttl; cache keys invalidated precisely on writes.DynamicConfigServicereferences only existing model fields and uses the shared precedence helper.- Tenant and platform endpoints implemented with RBAC and audit; cache invalidations occur per-scope.
DEFAULT_RISK_CONFIGand seed paths align with database shapes and service return contracts.- Operator reload endpoint implemented; global
requires_restartenforced in responses.
Quick checks
- Run backend tests:
.venv/bin/python -m pytest tests/ - Smoke test endpoints:
GET /config/(tenant aggregate)PUT /tenants/{tenant_id}/config/risk(ensure cache bust + history)PUT /config/global/{key}(platform admin only; verifyrequires_restarthandling)POST /config/reload(platform admin only)
Footguns and Notes
- Avoid deep merges by default; if you need deep-merge behavior for specific nested keys, add dedicated, type-aware merges in the relevant service method and document them.
- Keep a single env loader in
src/core/config.py; do not introduce parallel loaders. - Ensure all write endpoints both invalidate caches and append
ConfigurationHistory(for tenant scopes) or log audit events (for platform/global). - Update
src/database/models/migration_manager.pyto the current schema or move seeds into Alembic migrations to prevent drift.