Error Handling & Resilience

Principles

Core building blocks

src/utils/error_handling.py: smart_retry, ErrorContext, classify_error, error_recovery_manager.
src/services/change_set_executor.py: guardrails enforcement and safe parameter parsing.
src/services/agent_job_queue.py: rate and concurrency limiters, status change notifications.

Error taxonomy (example)

Sequence (retry + recovery)

Idempotency & timeouts

Use idempotency keys for multi‑step apply where possible (e.g., change set step hashes).
Enforce per‑step and global timeouts (configurable via dynamic config).

Audit & metrics

Record metrics for success/failed/timeout per agent and service.
Persist audit events with context (tenant, service, action, duration, error category).