Auth Error Alerting (Log-Based Metrics)
This note describes how to create log-based metrics and alerts for auth-related error codes so incidents (e.g., DB outages impacting login) surface quickly.
Logging Shape
Backend auth routes emit structured logs and HTTP errors with:
detail.codein the HTTP JSON body (see Resources → Auth Error Codes).error_codein the structured log fields for most 5xx paths.
Example response:
HTTP/1.1 500
{
"detail": {
"status": "error",
"code": "auth_login_db_error",
"message": "Database error during login"
}
}
Example log field (Cloud Logging):
jsonPayload.error_code="auth_login_db_error"
Recommended Log-Based Metrics
Create log-based metrics per auth error code in Cloud Logging:
-
Metric:
auth/login_db_error_count- Filter:
resource.type="cloud_run_revision" AND jsonPayload.error_code="auth_login_db_error" - Value:
1per matching log entry.
- Filter:
-
Metric:
auth/login_internal_error_count- Filter:
resource.type="cloud_run_revision" AND jsonPayload.error_code="auth_login_internal_error"
- Filter:
-
Metric:
auth/refresh_db_error_count- Filter:
resource.type="cloud_run_revision" AND jsonPayload.error_code="auth_refresh_db_error"
- Filter:
-
Metric:
auth/webauthn_db_error_count- Filter:
resource.type="cloud_run_revision" AND jsonPayload.error_code="auth_webauthn_db_error"
- Filter:
-
Metric:
auth/google_db_error_count- Filter:
resource.type="cloud_run_revision" AND jsonPayload.error_code="auth_google_db_error"
- Filter:
You can add similar metrics for:
auth_webauthn_internal_errorauth_google_internal_errorauth_api_key_internal_error
Alert Policies
For each metric:
- Condition:
metric > 0for a short window (e.g., 5 minutes) AND cluster health indicates DB is degraded/unhealthy. - Notify: SRE on-call and incident channel.
Suggested thresholds:
-
auth/login_db_error_count:- Critical:
>= 5events in 5 minutes. - Warning:
>= 1event in 5 minutes.
- Critical:
-
auth/refresh_db_error_count:- Critical:
>= 10events in 5 minutes (indicates many sessions failing to refresh).
- Critical:
Dashboards
Add a small “Auth Health” widget group to the main SRE dashboard:
- Time series for:
auth/login_db_error_countauth/login_internal_error_countauth/refresh_db_error_count
- Stacked bar or heatmap by
error_codefor quick root-cause hints during incidents.