Add Kubernetes liveness/readiness probes and middleware order validation

- Split /health into /health/live (liveness) and /health/ready (readiness) following Kubernetes conventions. Combined /health retained for backward compatibility with existing Docker HEALTHCHECK definitions. - Add ReadyCheck and ReadyResponse models for structured readiness output. - Add _assert_middleware_order() startup check enforcing: RateLimit → Csrf → CorrelationId middleware chain. - Register CorrelationIdMiddleware, CsrfMiddleware, RateLimitMiddleware in create_app() with documented required order (reverse of processing). - Add correlation.py, csrf.py, rate_limit.py middleware modules. - Add health probe tests in test_health_probes.py. - Update test_main.py with middleware order assertion tests. - Update frontend useFetchData hook tests. - Docs: update Deployment.md with Kubernetes probe config examples.
2026-05-04 02:42:09 +02:00
parent 65fe747cba
commit eb339efcfd
13 changed files with 882 additions and 129 deletions
--- a/Docs/Deployment.md
+++ b/Docs/Deployment.md
@@ -78,7 +78,12 @@ During rolling deployments:

 ## Health Checks

-The backend container includes a health check endpoint at `GET /api/v1/health` that reports application and component status:
+The backend container includes **three** health check endpoints:
+
+### Combined Health Check — `GET /api/v1/health`
+
+Reports application and component status for Docker HEALTHCHECK and legacy
+monitoring integration:

 - **HTTP 200** with `{"status": "ok", ...}` — all components healthy
 - **HTTP 200** with `{"status": "degraded", ...}` — some components unhealthy (e.g., database error) but fail2ban reachable
@@ -93,6 +98,59 @@ The backend container includes a health check endpoint at `GET /api/v1/health` t
 | scheduler | `scheduler.running` attribute | Returns degraded when stopped |
 | cache | Session cache presence | Returns degraded when not initialised |

+### Kubernetes Probes — Liveness and Readiness
+
+Two separate probes following Kubernetes conventions:
+
+| Endpoint | Purpose | HTTP Code | Kubernetes Action |
+|---|---|---|---|
+| `GET /api/v1/health/live` | Process alive | Always 200 | Restart container if non-2xx |
+| `GET /api/v1/health/ready` | All subsystems ready | 200 (all pass) / 503 (any fail) | Stop routing traffic if non-2xx |
+
+**`/health/live` — Liveness probe:**
+Returns 200 when the Python process and event loop are responsive. No subsystem checks are performed — this endpoint is always fast. Use for Kubernetes `livenessProbe`.
+
+**`/health/ready` — Readiness probe:**
+Verifies all critical sub-systems are reachable before routing traffic. Returns 200 only when all pass; returns 503 with a JSON body listing every failed check otherwise.
+
+| Subsystem | Check | Timeout |
+|---|---|---|
+| database | Opens and closes a test connection | 2 s |
+| fail2ban | Socket reachability via cached server status | N/A (instant) |
+| config_dir | Config directory read access (`os.R_OK`) | 2 s |
+| scheduler | `scheduler.running` attribute | N/A (instant) |
+
+**Readiness response example (all healthy — HTTP 200):**
+```json
+{
+  "status": "ok",
+  "checks": [
+    {"name": "database", "healthy": true},
+    {"name": "fail2ban", "healthy": true},
+    {"name": "config_dir", "healthy": true},
+    {"name": "scheduler", "healthy": true}
+  ],
+  "failed_count": 0
+}
+```
+
+**Readiness response example (fail2ban offline — HTTP 503):**
+```json
+{
+  "status": "error",
+  "checks": [
+    {"name": "database", "healthy": true},
+    {"name": "fail2ban", "healthy": false, "message": "Socket not reachable"},
+    {"name": "config_dir", "healthy": true},
+    {"name": "scheduler", "healthy": true}
+  ],
+  "failed_count": 1
+}
+```
+
+**Why separate liveness and readiness?**
+Liveness (`/health/live`) must be cheap — a slow or hanging liveness probe causes Kubernetes to restart a perfectly healthy container. Readiness (`/health/ready`) can afford to check sub-systems because traffic is only held back temporarily while a pod recovers.
+
 **Docker Health Check:**

 The Dockerfile includes a HEALTHCHECK that queries the endpoint. Docker interprets HTTP 503 as unhealthy and restarts the container after 3 consecutive failures (90 seconds by default).
@@ -739,9 +797,9 @@ sqlite3 /data/bangui.db "ANALYZE;"

 ## Monitoring Setup

-### Health Check Endpoint
+### Health Check Endpoints

-`GET /api/v1/health` — primary monitoring target.
+**Combined health check** — `GET /api/v1/health` — primary monitoring target for Docker HEALTHCHECK.

 | Status | HTTP Code | Meaning |
 |--------|-----------|---------|
@@ -749,6 +807,17 @@ sqlite3 /data/bangui.db "ANALYZE;"
 | `degraded` | 200 | Some components unhealthy — investigate |
 | `unavailable` | 503 | fail2ban unreachable — container will be restarted |

+**Kubernetes probes:**
+
+`GET /api/v1/health/live` — Liveness probe. Always returns 200 if the process is alive.
+
+`GET /api/v1/health/ready` — Readiness probe. Returns 200 when all subsystems pass, 503 otherwise.
+
+| Probe | URL | Success | Failure |
+|-------|---|---------|---------|
+| Liveness | `/api/v1/health/live` | 200 | Non-2xx → restart |
+| Readiness | `/api/v1/health/ready` | 200 | Non-2xx → stop traffic |
+
 ### Structured Logging

 All logs are structured (JSON via structlog). Key fields:
--- a/Docs/Tasks.md
+++ b/Docs/Tasks.md
@@ -1,84 +1,3 @@
-### Issue #56: MEDIUM - No API Versioning or Deprecation Strategy
-
-**Where found**:
- All backend routers register under `/api/v1/` prefix but no versioning mechanism exists
-
-**Why this is needed**:
-Breaking backend changes immediately break all frontend clients. Without a deprecation path, there is no safe way to evolve the API.
-
-**Goal**:
-Define and implement an API lifecycle policy.
-
-**What to do**:
-1. Document the versioning strategy (URL versioning is already in place; formalize it).
-2. Add a `Deprecation` response header to endpoints scheduled for removal.
-3. Implement a `/api/v2/` prefix for the next breaking change cycle.
-4. Add a CI check that flags new breaking changes against the OpenAPI spec.
-
-**Possible traps and issues**:
- Running two API versions simultaneously doubles maintenance surface; set a sunset date policy.
-
-**Docs changes needed**:
- `Docs/`: create `API_VERSIONING.md` documenting the lifecycle and deprecation process.
-
-**Doc references**:
- All router files under `backend/app/routers/`
-
---
-
-### Issue #57: MEDIUM - Health Endpoint Does Not Check Subsystems
-
-**Where found**:
- `backend/app/routers/health.py`
-
-**Why this is needed**:
-A process that is running but cannot reach the fail2ban socket, database, or config directory still returns `200 OK`. Load balancers and orchestrators treat it as healthy and route traffic to it, causing silent failures.
-
-**Goal**:
-Health endpoint reflects true readiness of all critical subsystems.
-
-**What to do**:
-1. Add a structured health check that tests: database connectivity, fail2ban socket accessibility, config directory read access, scheduler liveness.
-2. Return `200` only when all checks pass; return `503` with a JSON body listing failed checks otherwise.
-3. Expose a separate `/health/live` (process alive) and `/health/ready` (subsystems ready) endpoint for Kubernetes probes.
-
-**Possible traps and issues**:
- Slow health checks (e.g., DB connect timeout) can overwhelm the endpoint under load; set short timeouts per check.
-
-**Docs changes needed**:
- `Docs/Deployment.md`: document liveness vs readiness probe URLs.
-
-**Doc references**:
- `backend/app/routers/health.py`
-
---
-
-### Issue #58: MEDIUM - Abort Signal Not Propagated in Request Deduplication
-
-**Where found**:
- `frontend/src/hooks/useFetchData.ts:93-113`
-
-**Why this is needed**:
-When multiple hook instances share a `requestKey`, they await a single in-flight promise. When one component unmounts and aborts its signal, the shared request continues and calls `setData()` / `onSuccess()` on the unmounted component, causing React "state update on unmounted component" warnings and memory leaks.
-
-**Goal**:
-Unmounting a component that joined a deduplicated request must not receive the result.
-
-**What to do**:
-1. In the deduplication await path, check the component's own abort signal before calling `setData()` or `onSuccess()`.
-2. Wrap the deduplication subscriber list so each subscriber can individually opt out on abort.
-
-**Possible traps and issues**:
- If all subscribers abort before the request resolves, consider whether the underlying request should also be cancelled.
-
-**Docs changes needed**:
- `frontend/src/hooks/README.md`: document abort signal contract for deduplicated requests.
-
-**Doc references**:
- `frontend/src/hooks/README.md`
-
---
-
 ### Issue #59: MEDIUM - Middleware Registration Order Not Validated at Startup

 **Where found**: