Error Handling & Resilience
Principles
- Classify errors consistently and attach rich context for auditability.
- Prefer bounded retries with jitter for transient classes; never loop forever.
- Capture checkpoints before apply to enable targeted rollback.
Core building blocks
src/utils/error_handling.py:smart_retry,ErrorContext,classify_error,error_recovery_manager.src/services/change_set_executor.py: guardrails enforcement and safe parameter parsing.src/services/agent_job_queue.py: rate and concurrency limiters, status change notifications.
Error taxonomy (example)
- Network: timeouts, DNS, transient 5xx → retryable.
- Rate limit / quota: backoff and retry with budget.
- GCP API errors: classify by code; retry idempotent operations.
- Validation errors: fail fast; return structured problem details.
Sequence (retry + recovery)
Idempotency & timeouts
- Use idempotency keys for multi‑step apply where possible (e.g., change set step hashes).
- Enforce per‑step and global timeouts (configurable via dynamic config).
Audit & metrics
- Record metrics for success/failed/timeout per agent and service.
- Persist audit events with context (tenant, service, action, duration, error category).