Risk Guardrails
Risk guardrails protect your infrastructure by enforcing cost and impact limits on all SmartSRE operations.
The 4-Tier Risk Model
SmartSRE classifies every remediation operation into one of four tiers:
| Tier | Behavior | Example Operations |
|---|---|---|
| Tier 1: Auto-Execute | Executes immediately without approval | Memory increase within limits, min instance adjustment |
| Tier 2: Execute + Notify | Executes and sends notification | Max instance scaling, CPU increase |
| Tier 3: Require Approval | Pauses for human approval | High-cost changes, service restart |
| Tier 4: Manual Only | Creates ticket, never auto-executes | Resource deletion, IAM changes, cross-region moves |
Risk Evaluation Flow
Configuring Guardrails
Global Settings
Navigate to Settings → Risk Policy to configure tenant-wide limits:
| Setting | Description | Default |
|---|---|---|
max_cost_impact_auto_percent | Max cost increase (% of budget) for auto-execute | 5% |
max_cost_impact_approval_percent | Max cost increase even with approval | 25% |
max_impact_score_auto | Max impact score (0-100) for auto-execute | 50 |
require_approval_for_high_risk | Always require approval for high-risk operations | true |
Per-Service Overrides
Override specific limits for individual services:
{
"service_overrides": {
"cloudrun": {
"max_cost_impact_auto_percent": 10,
"max_memory_gi": 8,
"max_cpu_m": 4000
},
"bigquery": {
"min_slots": 100,
"max_slots": 1000
}
}
}
Cost Baselines
Project Budget (Recommended)
If your GCP project has a configured monthly_budget_usd, guardrails use this as the baseline:
percent_impact = estimated_cost_change / monthly_budget_usd × 100
Virtual Baseline (Fallback)
When no budget is configured, SmartSRE uses a virtual baseline (default: $100/month):
- Allows small changes while catching large step-function increases
- Shown in approval UI as "virtual baseline" vs "budget"
Circuit Breakers
Automatic safety stops that halt automation when anomalies are detected:
| Circuit Breaker | Trigger | Effect |
|---|---|---|
| Failure Rate | > 10% of operations failing | Pause all auto-execute |
| Cost Overrun | Monthly costs exceed budget by 25% | Block cost-increasing changes |
| Error Spike | Error rate > 5% post-change | Trigger automatic rollback |
Impact Scoring
Each operation has an impact score from 0-100 based on:
| Factor | Weight | Examples |
|---|---|---|
| Blast Radius | 40% | Single resource vs entire service |
| Reversibility | 30% | Easy rollback vs permanent delete |
| Service Criticality | 20% | Production vs development |
| Time Sensitivity | 10% | Business hours vs maintenance window |
Example Scores
| Operation | Impact Score |
|---|---|
| Scale Cloud Run memory up | 15 |
| Set min instances to 1 | 25 |
| Restart Cloud Run service | 45 |
| Delete GCS bucket | 90 |
Approval Integration
When guardrails require approval:
- Approval Request Created — Contains ChangeSet, cost estimate, risk assessment
- Notification Sent — Via configured channels (webhook, email)
- Timer Starts — Default 30-minute expiration
- Decision Made — Approved, Rejected, or Expired
- Execution or Rollback — Based on decision
See Approvals Guide for details.
Currency Handling
Mixed Currency Scenarios
When project cost currency differs from guardrail baseline (USD):
- Auto-execute is disabled for cost-impacting changes
- Approval is forced with
CURRENCY_MISMATCHreason - UI shows both project currency and baseline currency
Best Practices
Start Conservative
Begin with low thresholds and increase as you gain confidence:
{
"max_cost_impact_auto_percent": 2,
"max_impact_score_auto": 30,
"require_approval_for_high_risk": true
}
Use Per-Service Overrides
Production Cloud Run may need stricter limits than development BigQuery.
Monitor Circuit Breakers
Enable alerts for circuit breaker activations to catch systemic issues.
Viewing Risk Assessments
Every scan and execution includes a risk assessment visible in:
- Run Details → Risk Tab — Full breakdown of risk factors
- Approval Requests — Summary of why approval is required
- Audit Trail — Historical risk decisions
Next Steps
- Approvals — Handle approval workflows
- Scope Management — Restrict allowed operations per scope
- Cost Control — Budget configuration