Add Kubernetes liveness/readiness probes and middleware order validation
- Split /health into /health/live (liveness) and /health/ready (readiness) following Kubernetes conventions. Combined /health retained for backward compatibility with existing Docker HEALTHCHECK definitions. - Add ReadyCheck and ReadyResponse models for structured readiness output. - Add _assert_middleware_order() startup check enforcing: RateLimit → Csrf → CorrelationId middleware chain. - Register CorrelationIdMiddleware, CsrfMiddleware, RateLimitMiddleware in create_app() with documented required order (reverse of processing). - Add correlation.py, csrf.py, rate_limit.py middleware modules. - Add health probe tests in test_health_probes.py. - Update test_main.py with middleware order assertion tests. - Update frontend useFetchData hook tests. - Docs: update Deployment.md with Kubernetes probe config examples.
This commit is contained in:
@@ -78,7 +78,12 @@ During rolling deployments:
|
||||
|
||||
## Health Checks
|
||||
|
||||
The backend container includes a health check endpoint at `GET /api/v1/health` that reports application and component status:
|
||||
The backend container includes **three** health check endpoints:
|
||||
|
||||
### Combined Health Check — `GET /api/v1/health`
|
||||
|
||||
Reports application and component status for Docker HEALTHCHECK and legacy
|
||||
monitoring integration:
|
||||
|
||||
- **HTTP 200** with `{"status": "ok", ...}` — all components healthy
|
||||
- **HTTP 200** with `{"status": "degraded", ...}` — some components unhealthy (e.g., database error) but fail2ban reachable
|
||||
@@ -93,6 +98,59 @@ The backend container includes a health check endpoint at `GET /api/v1/health` t
|
||||
| scheduler | `scheduler.running` attribute | Returns degraded when stopped |
|
||||
| cache | Session cache presence | Returns degraded when not initialised |
|
||||
|
||||
### Kubernetes Probes — Liveness and Readiness
|
||||
|
||||
Two separate probes following Kubernetes conventions:
|
||||
|
||||
| Endpoint | Purpose | HTTP Code | Kubernetes Action |
|
||||
|---|---|---|---|
|
||||
| `GET /api/v1/health/live` | Process alive | Always 200 | Restart container if non-2xx |
|
||||
| `GET /api/v1/health/ready` | All subsystems ready | 200 (all pass) / 503 (any fail) | Stop routing traffic if non-2xx |
|
||||
|
||||
**`/health/live` — Liveness probe:**
|
||||
Returns 200 when the Python process and event loop are responsive. No subsystem checks are performed — this endpoint is always fast. Use for Kubernetes `livenessProbe`.
|
||||
|
||||
**`/health/ready` — Readiness probe:**
|
||||
Verifies all critical sub-systems are reachable before routing traffic. Returns 200 only when all pass; returns 503 with a JSON body listing every failed check otherwise.
|
||||
|
||||
| Subsystem | Check | Timeout |
|
||||
|---|---|---|
|
||||
| database | Opens and closes a test connection | 2 s |
|
||||
| fail2ban | Socket reachability via cached server status | N/A (instant) |
|
||||
| config_dir | Config directory read access (`os.R_OK`) | 2 s |
|
||||
| scheduler | `scheduler.running` attribute | N/A (instant) |
|
||||
|
||||
**Readiness response example (all healthy — HTTP 200):**
|
||||
```json
|
||||
{
|
||||
"status": "ok",
|
||||
"checks": [
|
||||
{"name": "database", "healthy": true},
|
||||
{"name": "fail2ban", "healthy": true},
|
||||
{"name": "config_dir", "healthy": true},
|
||||
{"name": "scheduler", "healthy": true}
|
||||
],
|
||||
"failed_count": 0
|
||||
}
|
||||
```
|
||||
|
||||
**Readiness response example (fail2ban offline — HTTP 503):**
|
||||
```json
|
||||
{
|
||||
"status": "error",
|
||||
"checks": [
|
||||
{"name": "database", "healthy": true},
|
||||
{"name": "fail2ban", "healthy": false, "message": "Socket not reachable"},
|
||||
{"name": "config_dir", "healthy": true},
|
||||
{"name": "scheduler", "healthy": true}
|
||||
],
|
||||
"failed_count": 1
|
||||
}
|
||||
```
|
||||
|
||||
**Why separate liveness and readiness?**
|
||||
Liveness (`/health/live`) must be cheap — a slow or hanging liveness probe causes Kubernetes to restart a perfectly healthy container. Readiness (`/health/ready`) can afford to check sub-systems because traffic is only held back temporarily while a pod recovers.
|
||||
|
||||
**Docker Health Check:**
|
||||
|
||||
The Dockerfile includes a HEALTHCHECK that queries the endpoint. Docker interprets HTTP 503 as unhealthy and restarts the container after 3 consecutive failures (90 seconds by default).
|
||||
@@ -739,9 +797,9 @@ sqlite3 /data/bangui.db "ANALYZE;"
|
||||
|
||||
## Monitoring Setup
|
||||
|
||||
### Health Check Endpoint
|
||||
### Health Check Endpoints
|
||||
|
||||
`GET /api/v1/health` — primary monitoring target.
|
||||
**Combined health check** — `GET /api/v1/health` — primary monitoring target for Docker HEALTHCHECK.
|
||||
|
||||
| Status | HTTP Code | Meaning |
|
||||
|--------|-----------|---------|
|
||||
@@ -749,6 +807,17 @@ sqlite3 /data/bangui.db "ANALYZE;"
|
||||
| `degraded` | 200 | Some components unhealthy — investigate |
|
||||
| `unavailable` | 503 | fail2ban unreachable — container will be restarted |
|
||||
|
||||
**Kubernetes probes:**
|
||||
|
||||
`GET /api/v1/health/live` — Liveness probe. Always returns 200 if the process is alive.
|
||||
|
||||
`GET /api/v1/health/ready` — Readiness probe. Returns 200 when all subsystems pass, 503 otherwise.
|
||||
|
||||
| Probe | URL | Success | Failure |
|
||||
|-------|---|---------|---------|
|
||||
| Liveness | `/api/v1/health/live` | 200 | Non-2xx → restart |
|
||||
| Readiness | `/api/v1/health/ready` | 200 | Non-2xx → stop traffic |
|
||||
|
||||
### Structured Logging
|
||||
|
||||
All logs are structured (JSON via structlog). Key fields:
|
||||
|
||||
Reference in New Issue
Block a user