Background Workflows
Queues & Workers
-
AgentJobQueue
- In‑process asyncio queue for advise/remediate jobs.
- Provides per‑tenant rate limiting, concurrency limits, de‑duplication and stale job reaping.
- Implemented in
src/services/agent_job_queue.pyand used by the/agents/adviseand/agents/applyflows.
-
AgentScheduler
- Optional periodic scheduler that enqueues jobs into AgentJobQueue based on tenant scopes and inventory.
- Respects maintenance/freeze windows and per‑resource intervals.
- Implemented in
src/services/agent_scheduler.py.
-
Cloud Tasks / Scheduler (optional)
- For managed execution in some deployments, Cloud Tasks / Scheduler can trigger internal endpoints.
- Internal routes (
/internal/*) are protected via OIDC (seeruntime-flow.md).
Job Lifecycle
AgentJobQueue persists job state in the database via AgentRun records. On startup, it uses a recovery pass to mark stale running jobs as failed so workers can safely resume without double‑executing changes.
Rollback Strategy
Principles
- Deterministic Reversal: Every
ChangeStepmust be reversible. If a step is irreversible (e.g., deleting a KMS key without soft-delete), it requires explicit "High Severity" approval. - State-Based vs. Action-Based:
- Action-Based: For simple API calls (e.g.,
add_iam_binding), the rollback isremove_iam_binding. - State-Based (Snapshots): For data-heavy resources (BigQuery Tables, Cloud SQL Databases), we DO NOT attempt to serialize the data into the Postgres checkpoint. Instead, we rely on native cloud snapshots (Time Travel).
- Action-Based: For simple API calls (e.g.,
These principles are implemented centrally in RollbackManager (src/rollback/rollback_manager.py), which consumes checkpoints produced during apply flows and generates compensating steps.
BigQuery and Data‑Heavy Resources
- Checkpoint: For BigQuery tables (and similar stateful resources),
RollbackCheckpointcaptures a snapshot reference (e.g.,snapshot_timefor Time Travel orsnapshot_table_idfor a physical clone) instead of copying raw data into Postgres. - Restore:
RollbackManagerand, in workflow contexts,RollbackNodeuse that checkpoint to issue restore operations (e.g.,CREATE OR REPLACE TABLE ... CLONE ... FOR SYSTEM_TIME AS OF ...).
Limitations
- Schema Drifts: Rollback assumes the schema is compatible with the snapshot.
- External Dependencies: We cannot rollback side effects (e.g., emails sent, webhooks triggered).
Notifications
- Realtime notifications are emitted on job and rollback state changes so the UI can update live via WebSockets.