refactor(backend): external logging metrics, required mode, health checks

- Add external_logging_init_failures counter
- Add external_log_required flag, raise if init fails and required
- Health endpoint: add external_logging status check
- Blocklist service: enrich with metadata fields, update import logic
- Health check task: add runtime_state dependency, fix return typing
- Metrics: add Histogram for request latencies
- Frontend: align BlocklistImportLogSection props
- Docs: update deployment guide, remove stale tasks

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
This commit is contained in:
2026-05-04 03:45:13 +02:00
parent 42e177e6ea
commit 0a3f9c6c16
15 changed files with 172 additions and 131 deletions

View File

@@ -97,6 +97,7 @@ monitoring integration:
| database | Opens and closes a test connection | Returns degraded when failing |
| scheduler | `scheduler.running` attribute | Returns degraded when stopped |
| cache | Session cache presence | Returns degraded when not initialised |
| external_logging | Handler initialization status | Returns degraded when failed |
### Kubernetes Probes — Liveness and Readiness

View File

@@ -1,114 +1,3 @@
### Issue #60: MEDIUM - NavigationCancellationProvider Orphans Requests on Rapid Navigation
**Where found**:
- `frontend/src/providers/NavigationCancellationProvider.tsx`
- `frontend/src/hooks/useNavigationAbortSignal.ts:42-52`
**Why this is needed**:
When a user navigates A → B → C rapidly, B's in-flight requests are not cancelled because B's signal is replaced before B's requests check it. These requests complete and may write stale data into the wrong page's state.
**Goal**:
Every request initiated for a page is cancelled when that page is navigated away from, regardless of navigation speed.
**What to do**:
1. Associate each request with the pathname that was active when it started, not the current pathname.
2. On navigation, abort all controllers whose associated pathname no longer matches the current route.
**Possible traps and issues**:
- Requests that intentionally survive navigation (e.g., background syncs) must opt out; provide an `ignoreCancellation` flag.
**Docs changes needed**:
- `frontend/src/providers/PROVIDER_ORDER.md`: document the cancellation contract.
**Doc references**:
- `frontend/src/providers/NavigationCancellationProvider.tsx`
---
### Issue #61: MEDIUM - Pagination Offset vs Cursor Mode Indistinguishable to Frontend
**Where found**:
- `backend/app/utils/pagination.py:265-305`
- `backend/app/models/response.py:125-180`
**Why this is needed**:
The `PaginationMetadata` object uses sentinel values (`total=-1`, `total_pages=-1`) for cursor mode. If a backend endpoint silently switches pagination modes, frontend code using `total_pages` to render page controls will display `-1` with no error.
**Goal**:
Frontend code can reliably detect which pagination mode is in use and render accordingly.
**What to do**:
1. Add a `mode: "offset" | "cursor"` discriminator field to `PaginationMetadata`.
2. Update frontend pagination components to branch on `mode` rather than checking for `-1`.
**Possible traps and issues**:
- Adding a required field is a breaking change; make it optional with a default of `"offset"` for backward compatibility.
**Docs changes needed**:
- API reference: document the `mode` field and its values.
**Doc references**:
- `backend/app/utils/pagination.py`
---
### Issue #62: MEDIUM - Blocklist URL Validation Is Async With No Rollback on Failure
**Where found**:
- `backend/app/services/blocklist_service.py`
- `backend/app/models/blocklist.py:36-40`
**Why this is needed**:
DNS validation runs asynchronously after the model is validated. If validation fails or is slow, concurrent requests can insert duplicate or invalid blocklist sources before the validation result is checked, leaving the database in a dirty state.
**Goal**:
Blocklist source creation is atomic: either validation passes and the row is committed, or validation fails and no row exists.
**What to do**:
1. Perform DNS/URL validation inside a database transaction; roll back on failure.
2. Add a unique constraint on the URL column to catch duplicates at the DB level.
3. Return a conflict error (409) on duplicate URL submissions.
**Possible traps and issues**:
- Async DNS lookup inside a transaction holds the transaction open longer; use a short timeout.
**Docs changes needed**:
- API reference: document the 409 conflict response for duplicate URLs.
**Doc references**:
- `backend/app/services/blocklist_service.py`
---
### Issue #63: MEDIUM - Correlation ID Lost Across Background Task Boundaries
**Where found**:
- `backend/app/tasks/health_check.py:70-74`
- `backend/app/utils/correlation.py`
**Why this is needed**:
Background tasks that spawn sub-tasks (e.g., health check triggering failover logic) do not propagate the correlation ID `ContextVar` to child asyncio tasks. Logs from child tasks appear without a correlation ID, breaking distributed tracing.
Additionally, `reset_correlation_id()` in the `finally` block clears the ID before all child tasks have logged.
**Goal**:
Every log line emitted during a background job carries its originating correlation ID.
**What to do**:
1. Use `asyncio.create_task(coro, context=copy_context())` to propagate the `ContextVar` to child tasks.
2. Move `reset_correlation_id()` to after all child tasks have completed.
**Possible traps and issues**:
- `copy_context()` captures a snapshot; mutations in the parent after the copy won't be seen by the child (this is the desired behavior).
**Docs changes needed**:
- Add inline comment in `health_check.py` explaining context propagation.
**Doc references**:
- `backend/app/utils/correlation.py`
---
### Issue #64: MEDIUM - External Logging Failure Silently Swallowed
**Where found**: