Skip to main content

Rollbacks

SmartSRE creates rollback checkpoints before every execution, enabling safe recovery if issues arise.

How Rollbacks Work

Checkpoint Creation

Before applying any change, SmartSRE:

  1. Captures pre-change state — Current resource configuration
  2. Creates checkpoint record — Stored with TTL (default 72 hours)
  3. Links to execution — Associates checkpoint with the specific change

What's Captured

Each checkpoint includes:

  • Original resource configuration
  • Which changes were applied
  • Timestamp and expiration

Triggering Rollbacks

From Run Details

  1. Navigate to the Run Details page
  2. Find the executed items with Rollback buttons
  3. Click Rollback on individual items, or Rollback All for bulk rollback

Automatic Rollback

SmartSRE can automatically trigger rollbacks when:

  • Validation fails — Post-execution health checks detect issues
  • Error rate spikes — Monitoring detects increased errors
  • Circuit breaker activates — Systemic issues detected

Rollback-Capable Operations

Most SmartSRE operations support rollback:

ServiceOperationRollback Capability
Cloud RunMemory/CPU scaling✅ Full
Cloud RunMin/max instances✅ Full
BigQuerySlot capacity✅ Full
GCSLifecycle rules✅ Full
GCSDelete objects❌ None (destructive)
Cloud SQLResize instance⚠️ Partial

Rollback States

StateDescription
AvailableCheckpoint exists, rollback can be triggered
In ProgressRollback is executing
CompletedSuccessfully restored previous state
FailedError occurred during rollback
ExpiredCheckpoint TTL exceeded (72 hours)

Failed Rollbacks

If a critical rollback fails:

  1. Immediate Alert — Sent to configured notification channels
  2. Manual Steps Provided — UI shows commands to manually restore state
  3. Support Escalation — Contact support if needed

Viewing Failed Rollbacks

Navigate to Operations → Rollback History to see:

  • Failed rollback attempts
  • Error messages
  • Manual remediation steps

Checkpoint Lifecycle

Retention

  • Default TTL: 72 hours
  • Successful rollback: Checkpoint marked as used
  • Expired: Automatically cleaned up

Best Practices

Test Rollbacks in Non-Production

Before relying on rollbacks in production, verify they work in staging.

Monitor Post-Execution

Set up alerts for error rate spikes so rollbacks can be triggered quickly.

Review Rollback History

Regularly check rollback history to identify patterns in failed changes.

Limitations

  • Destructive operations cannot be rolled back (e.g., delete bucket)
  • External dependencies may not be restored
  • Time-sensitive data may be affected

Next Steps