Rollbacks
SmartSRE creates rollback checkpoints before every execution, enabling safe recovery if issues arise.
How Rollbacks Work
Checkpoint Creation
Before applying any change, SmartSRE:
- Captures pre-change state — Current resource configuration
- Creates checkpoint record — Stored with TTL (default 72 hours)
- Links to execution — Associates checkpoint with the specific change
What's Captured
Each checkpoint includes:
- Original resource configuration
- Which changes were applied
- Timestamp and expiration
Triggering Rollbacks
From Run Details
- Navigate to the Run Details page
- Find the executed items with Rollback buttons
- Click Rollback on individual items, or Rollback All for bulk rollback
Automatic Rollback
SmartSRE can automatically trigger rollbacks when:
- Validation fails — Post-execution health checks detect issues
- Error rate spikes — Monitoring detects increased errors
- Circuit breaker activates — Systemic issues detected
Rollback-Capable Operations
Most SmartSRE operations support rollback:
| Service | Operation | Rollback Capability |
|---|---|---|
| Cloud Run | Memory/CPU scaling | ✅ Full |
| Cloud Run | Min/max instances | ✅ Full |
| BigQuery | Slot capacity | ✅ Full |
| GCS | Lifecycle rules | ✅ Full |
| GCS | Delete objects | ❌ None (destructive) |
| Cloud SQL | Resize instance | ⚠️ Partial |
Rollback States
| State | Description |
|---|---|
| Available | Checkpoint exists, rollback can be triggered |
| In Progress | Rollback is executing |
| Completed | Successfully restored previous state |
| Failed | Error occurred during rollback |
| Expired | Checkpoint TTL exceeded (72 hours) |
Failed Rollbacks
If a critical rollback fails:
- Immediate Alert — Sent to configured notification channels
- Manual Steps Provided — UI shows commands to manually restore state
- Support Escalation — Contact support if needed
Viewing Failed Rollbacks
Navigate to Operations → Rollback History to see:
- Failed rollback attempts
- Error messages
- Manual remediation steps
Checkpoint Lifecycle
Retention
- Default TTL: 72 hours
- Successful rollback: Checkpoint marked as used
- Expired: Automatically cleaned up
Best Practices
Test Rollbacks in Non-Production
Before relying on rollbacks in production, verify they work in staging.
Monitor Post-Execution
Set up alerts for error rate spikes so rollbacks can be triggered quickly.
Review Rollback History
Regularly check rollback history to identify patterns in failed changes.
Limitations
- Destructive operations cannot be rolled back (e.g., delete bucket)
- External dependencies may not be restored
- Time-sensitive data may be affected
Next Steps
- Applying Fixes — Execute changes with rollback protection
- Risk Guardrails — Configure automatic rollback triggers
- Approvals — Approve high-risk changes