Skip to main content

Background Workflows

Queues & Workers

  • AgentJobQueue

    • In‑process asyncio queue for advise/remediate jobs.
    • Provides per‑tenant rate limiting, concurrency limits, de‑duplication and stale job reaping.
    • Implemented in src/services/agent_job_queue.py and used by the /agents/advise and /agents/apply flows.
  • AgentScheduler

    • Optional periodic scheduler that enqueues jobs into AgentJobQueue based on tenant scopes and inventory.
    • Respects maintenance/freeze windows and per‑resource intervals.
    • Implemented in src/services/agent_scheduler.py.
  • Cloud Tasks / Scheduler (optional)

    • For managed execution in some deployments, Cloud Tasks / Scheduler can trigger internal endpoints.
    • Internal routes (/internal/*) are protected via OIDC (see runtime-flow.md).

Job Lifecycle

AgentJobQueue persists job state in the database via AgentRun records. On startup, it uses a recovery pass to mark stale running jobs as failed so workers can safely resume without double‑executing changes.


Rollback Strategy

Principles

  1. Deterministic Reversal: Every ChangeStep must be reversible. If a step is irreversible (e.g., deleting a KMS key without soft-delete), it requires explicit "High Severity" approval.
  2. State-Based vs. Action-Based:
    • Action-Based: For simple API calls (e.g., add_iam_binding), the rollback is remove_iam_binding.
    • State-Based (Snapshots): For data-heavy resources (BigQuery Tables, Cloud SQL Databases), we DO NOT attempt to serialize the data into the Postgres checkpoint. Instead, we rely on native cloud snapshots (Time Travel).

These principles are implemented centrally in RollbackManager (src/rollback/rollback_manager.py), which consumes checkpoints produced during apply flows and generates compensating steps.

BigQuery and Data‑Heavy Resources

  • Checkpoint: For BigQuery tables (and similar stateful resources), RollbackCheckpoint captures a snapshot reference (e.g., snapshot_time for Time Travel or snapshot_table_id for a physical clone) instead of copying raw data into Postgres.
  • Restore: RollbackManager and, in workflow contexts, RollbackNode use that checkpoint to issue restore operations (e.g., CREATE OR REPLACE TABLE ... CLONE ... FOR SYSTEM_TIME AS OF ...).

Limitations

  • Schema Drifts: Rollback assumes the schema is compatible with the snapshot.
  • External Dependencies: We cannot rollback side effects (e.g., emails sent, webhooks triggered).

Notifications

  • Realtime notifications are emitted on job and rollback state changes so the UI can update live via WebSockets.