Skip to main content

Auth Error Alerting (Log-Based Metrics)

This note describes how to create log-based metrics and alerts for auth-related error codes so incidents (e.g., DB outages impacting login) surface quickly.

Logging Shape

Backend auth routes emit structured logs and HTTP errors with:

  • detail.code in the HTTP JSON body (see Resources → Auth Error Codes).
  • error_code in the structured log fields for most 5xx paths.

Example response:

HTTP/1.1 500
{
"detail": {
"status": "error",
"code": "auth_login_db_error",
"message": "Database error during login"
}
}

Example log field (Cloud Logging):

  • jsonPayload.error_code="auth_login_db_error"

Create log-based metrics per auth error code in Cloud Logging:

  • Metric: auth/login_db_error_count

    • Filter: resource.type="cloud_run_revision" AND jsonPayload.error_code="auth_login_db_error"
    • Value: 1 per matching log entry.
  • Metric: auth/login_internal_error_count

    • Filter: resource.type="cloud_run_revision" AND jsonPayload.error_code="auth_login_internal_error"
  • Metric: auth/refresh_db_error_count

    • Filter: resource.type="cloud_run_revision" AND jsonPayload.error_code="auth_refresh_db_error"
  • Metric: auth/webauthn_db_error_count

    • Filter: resource.type="cloud_run_revision" AND jsonPayload.error_code="auth_webauthn_db_error"
  • Metric: auth/google_db_error_count

    • Filter: resource.type="cloud_run_revision" AND jsonPayload.error_code="auth_google_db_error"

You can add similar metrics for:

  • auth_webauthn_internal_error
  • auth_google_internal_error
  • auth_api_key_internal_error

Alert Policies

For each metric:

  • Condition: metric > 0 for a short window (e.g., 5 minutes) AND cluster health indicates DB is degraded/unhealthy.
  • Notify: SRE on-call and incident channel.

Suggested thresholds:

  • auth/login_db_error_count:

    • Critical: >= 5 events in 5 minutes.
    • Warning: >= 1 event in 5 minutes.
  • auth/refresh_db_error_count:

    • Critical: >= 10 events in 5 minutes (indicates many sessions failing to refresh).

Dashboards

Add a small “Auth Health” widget group to the main SRE dashboard:

  • Time series for:
    • auth/login_db_error_count
    • auth/login_internal_error_count
    • auth/refresh_db_error_count
  • Stacked bar or heatmap by error_code for quick root-cause hints during incidents.