Skip to main content

Risk Guardrails

Risk guardrails protect your infrastructure by enforcing cost and impact limits on all SmartSRE operations.

The 4-Tier Risk Model

SmartSRE classifies every remediation operation into one of four tiers:

TierBehaviorExample Operations
Tier 1: Auto-ExecuteExecutes immediately without approvalMemory increase within limits, min instance adjustment
Tier 2: Execute + NotifyExecutes and sends notificationMax instance scaling, CPU increase
Tier 3: Require ApprovalPauses for human approvalHigh-cost changes, service restart
Tier 4: Manual OnlyCreates ticket, never auto-executesResource deletion, IAM changes, cross-region moves

Risk Evaluation Flow

Configuring Guardrails

Global Settings

Navigate to Settings → Risk Policy to configure tenant-wide limits:

SettingDescriptionDefault
max_cost_impact_auto_percentMax cost increase (% of budget) for auto-execute5%
max_cost_impact_approval_percentMax cost increase even with approval25%
max_impact_score_autoMax impact score (0-100) for auto-execute50
require_approval_for_high_riskAlways require approval for high-risk operationstrue

Per-Service Overrides

Override specific limits for individual services:

{
"service_overrides": {
"cloudrun": {
"max_cost_impact_auto_percent": 10,
"max_memory_gi": 8,
"max_cpu_m": 4000
},
"bigquery": {
"min_slots": 100,
"max_slots": 1000
}
}
}

Cost Baselines

If your GCP project has a configured monthly_budget_usd, guardrails use this as the baseline:

percent_impact = estimated_cost_change / monthly_budget_usd × 100

Virtual Baseline (Fallback)

When no budget is configured, SmartSRE uses a virtual baseline (default: $100/month):

  • Allows small changes while catching large step-function increases
  • Shown in approval UI as "virtual baseline" vs "budget"

Circuit Breakers

Automatic safety stops that halt automation when anomalies are detected:

Circuit BreakerTriggerEffect
Failure Rate> 10% of operations failingPause all auto-execute
Cost OverrunMonthly costs exceed budget by 25%Block cost-increasing changes
Error SpikeError rate > 5% post-changeTrigger automatic rollback

Impact Scoring

Each operation has an impact score from 0-100 based on:

FactorWeightExamples
Blast Radius40%Single resource vs entire service
Reversibility30%Easy rollback vs permanent delete
Service Criticality20%Production vs development
Time Sensitivity10%Business hours vs maintenance window

Example Scores

OperationImpact Score
Scale Cloud Run memory up15
Set min instances to 125
Restart Cloud Run service45
Delete GCS bucket90

Approval Integration

When guardrails require approval:

  1. Approval Request Created — Contains ChangeSet, cost estimate, risk assessment
  2. Notification Sent — Via configured channels (webhook, email)
  3. Timer Starts — Default 30-minute expiration
  4. Decision Made — Approved, Rejected, or Expired
  5. Execution or Rollback — Based on decision

See Approvals Guide for details.

Currency Handling

Mixed Currency Scenarios

When project cost currency differs from guardrail baseline (USD):

  • Auto-execute is disabled for cost-impacting changes
  • Approval is forced with CURRENCY_MISMATCH reason
  • UI shows both project currency and baseline currency

Best Practices

Start Conservative

Begin with low thresholds and increase as you gain confidence:

{
"max_cost_impact_auto_percent": 2,
"max_impact_score_auto": 30,
"require_approval_for_high_risk": true
}

Use Per-Service Overrides

Production Cloud Run may need stricter limits than development BigQuery.

Monitor Circuit Breakers

Enable alerts for circuit breaker activations to catch systemic issues.

Viewing Risk Assessments

Every scan and execution includes a risk assessment visible in:

  • Run Details → Risk Tab — Full breakdown of risk factors
  • Approval Requests — Summary of why approval is required
  • Audit Trail — Historical risk decisions

Next Steps