Skip to main content

Error Handling & Resilience

Principles

  • Classify errors consistently and attach rich context for auditability.
  • Prefer bounded retries with jitter for transient classes; never loop forever.
  • Capture checkpoints before apply to enable targeted rollback.

Core building blocks

  • src/utils/error_handling.py: smart_retry, ErrorContext, classify_error, error_recovery_manager.
  • src/services/change_set_executor.py: guardrails enforcement and safe parameter parsing.
  • src/services/agent_job_queue.py: rate and concurrency limiters, status change notifications.

Error taxonomy (example)

  • Network: timeouts, DNS, transient 5xx → retryable.
  • Rate limit / quota: backoff and retry with budget.
  • GCP API errors: classify by code; retry idempotent operations.
  • Validation errors: fail fast; return structured problem details.

Sequence (retry + recovery)

Idempotency & timeouts

  • Use idempotency keys for multi‑step apply where possible (e.g., change set step hashes).
  • Enforce per‑step and global timeouts (configurable via dynamic config).

Audit & metrics

  • Record metrics for success/failed/timeout per agent and service.
  • Persist audit events with context (tenant, service, action, duration, error category).