Update observability docs and task utilities

- Add Observability.md documentation - Standardize task logging with correlation_id support - Add log_sanitizer utility for PII masking - Update Tasks.md tracking - Update geo_cache tasks and other task modules with correlation_id Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-05-03 11:52:09 +02:00
parent 7b93499551
commit 0133489920
17 changed files with 582 additions and 124 deletions
--- a/Docs/Observability.md
+++ b/Docs/Observability.md
@@ -62,6 +62,40 @@ log.info("password_check", password=password_hash)  # Never!

 Structlog provides context variable filtering to prevent accidental logging of sensitive data. Code reviews must verify compliance with this rule.

+### Log Sanitization
+
+All external output (subprocess results, API responses, config file contents) passed to structlog **must** be sanitized first using `sanitize_for_logging()` from `app.utils.log_sanitizer`.
+
+This prevents sensitive data — passwords, API keys, tokens, private keys — from leaking into logs.
+
+```python
+from app.utils.log_sanitizer import sanitize_for_logging
+
+# ✓ Correct: Sanitize before logging
+log.error(
+    "fail2ban_start_failed",
+    command=" ".join(start_cmd_parts),
+    returncode=process.returncode,
+    stdout=sanitize_for_logging(stdout.decode("utf-8", errors="replace")),
+    stderr=sanitize_for_logging(stderr.decode("utf-8", errors="replace")),
+)
+
+# ✗ Wrong: Raw output may contain secrets
+log.error("fail2ban_start_failed", stdout=stdout_raw, stderr=stderr_raw)  # Never!
+```
+
+`sanitize_for_logging()` redacts the following patterns:
+
+| Pattern | Example match | Replacement |
+|---------|---------------|-------------|
+| `password=X` | `password=Secret123` | `password=***` |
+| `api_key=X` / `api-key=X` | `api_key=key123` | `api_key=***` |
+| `token=X` | `token=eyJhbG...` | `token=***` |
+| `Authorization: Bearer X` | `Authorization: Bearer tok...` | `Authorization: ***` |
+| `secret=X` | `secret=myvalue` | `secret=***` |
+| `-----BEGIN RSA PRIVATE KEY-----` | (key header) | `*** PRIVATE KEY ***` |
+| `AKIA...` | `AKIAIOSFODNN7EXAMPLE` | `AKIA***` |
+
 ---

 ## Structured Logging Best Practices
@@ -102,6 +136,35 @@ log.info("user_action", action="create_jail")  # Automatically includes correlat
 structlog.contextvars.clear_contextvars()
 ```

+### Background Task Correlation
+
+Background tasks (APScheduler jobs) run outside the HTTP request context.
+Use :mod:`app.utils.correlation` to propagate correlation IDs through tasks:
+
+```python
+from app.utils.correlation import get_correlation_id, reset_correlation_id, set_correlation_id
+
+async def my_background_task(correlation_id: str | None = None) -> None:
+    # Generate a new ID if not provided (scheduled tasks have no parent request)
+    if correlation_id is None:
+        import uuid
+        correlation_id = str(uuid.uuid4())
+
+    # Set the correlation ID for all logs in this task
+    token = set_correlation_id(correlation_id)
+    try:
+        log.info("task_started")  # Now includes correlation_id
+        # ... task logic ...
+    finally:
+        reset_correlation_id(token)
+
+# When scheduling, optionally pass the current correlation ID:
+# scheduler.add_job(my_background_task, kwargs={"correlation_id": get_correlation_id()})
+```
+
+Scheduled tasks (no parent request) generate a fresh UUID for each run.
+Tasks triggered by a request inherit the request's correlation ID.
+
 ### Event Naming Convention

 Use snake_case for event names, prefixed with the component or module name:
--- a/Docs/Tasks.md
+++ b/Docs/Tasks.md
@@ -1,103 +1,3 @@
-### Issue #18: MEDIUM - Configuration Validation Missing at Startup
-
-**Where found**: 
- `backend/app/config.py` (lines 37-95)
- `database_path` has no validation
- `fail2ban_socket` not verified to exist
- Hard-coded paths assumed in Docker
-
-**Why this is needed**: 
-Configuration errors not caught at startup:
- Database path doesn't exist - fails on first DB operation (confusing error)
- fail2ban socket wrong path - only fails when health check runs
- Directory not writable - discovered hours after deployment
-
-**Goal**: 
-Validate all configuration at startup with clear error messages.
-
-**What to do**:
-1. Add validators to config fields:
-   ```python
-   @field_validator("database_path")
-   def validate_db_path(cls, v):
-       path = Path(v)
-       parent = path.parent
-       
-       if not parent.exists():
-           raise ValueError(
-               f"Database parent directory does not exist: {parent}\n"
-               f"Create it with: mkdir -p {parent}"
-           )
-       
-       if not os.access(parent, os.W_OK):
-           raise ValueError(
-               f"Database directory not writable: {parent}\n"
-               f"Fix with: chmod 755 {parent}"
-           )
-       
-       return v
-   ```
-2. Validate fail2ban socket exists and is readable
-3. Verify session secret is sufficiently long
-4. Check environment variables are set
-5. Provide actionable error messages
-
-**Possible traps and issues**:
- Validation might be too strict for some deployments
- Need to handle cases where files don't exist yet but will be created
- Docker initialization order might delay file creation
-
-**Docs changes needed**:
- Document required directories and permissions
- Create setup validation troubleshooting guide
-
-**Doc references**:
- DATABASE_API_DEPLOYMENT_ISSUES.md - Issue "5.2 Missing Configuration Validation"
-
---
-
-### Issue #19: MEDIUM - Sensitive Data Could Leak in Logs
-
-**Where found**: 
- `backend/app/utils/fail2ban_client.py` (line 148) - Logs subprocess output without sanitization
- Could contain passwords, API keys from config files
-
-**Why this is needed**: 
-If subprocess output contains secrets, logs become security liability and violate compliance requirements.
-
-**Goal**: 
-Sanitize logs to remove sensitive information patterns.
-
-**What to do**:
-1. Create sanitizer function:
-   ```python
-   def sanitize_for_logging(text: str) -> str:
-       # Remove passwords
-       text = re.sub(r'password[=:]\S+', 'password=***', text, flags=re.IGNORECASE)
-       # Remove API keys
-       text = re.sub(r'api[_-]?key[=:]\S+', 'api_key=***', text, flags=re.IGNORECASE)
-       # Remove tokens
-       text = re.sub(r'token[=:]\S+', 'token=***', text, flags=re.IGNORECASE)
-       return text
-   ```
-2. Apply to all subprocess output and external responses
-3. Add to logging middleware
-4. Audit existing logs for sensitive data
-
-**Possible traps and issues**:
- Patterns might miss some sensitive data formats
- Over-sanitization might hide helpful debug info
- Performance cost if sanitizing large outputs
-
-**Docs changes needed**:
- Add logging best practices guide
- Document what's sanitized
-
-**Doc references**:
- DETAILED_FINDINGS.md - Issue #25 "Sensitive Data in Logs"
-
---
-
 ### Issue #20: MEDIUM - No Correlation ID in Background Tasks

 **Where found**: 
--- a/backend/app/services/blocklist_service.py
+++ b/backend/app/services/blocklist_service.py
@@ -391,7 +391,7 @@ def schedule_blocklist_job(
    settings: Settings,
    http_session: aiohttp.ClientSession,
    config: ScheduleConfig,
-    run_import_callback: Callable[[Settings, aiohttp.ClientSession], Awaitable[None]],
+    run_import_callback: Callable[..., Awaitable[None]],
 ) -> None:
    """Register or replace the scheduled blocklist import job."""

--- a/backend/app/tasks/blocklist_import.py
+++ b/backend/app/tasks/blocklist_import.py
@@ -9,10 +9,16 @@ be updated at runtime through the blocklist router.
 The scheduler job ID is ``"blocklist_import"`` — using a stable id means
 re-registering the job (e.g. after a schedule update) safely replaces the
 existing entry without creating duplicates.
+
+Correlation IDs are propagated through the task using
+:mod:`app.utils.correlation` so that task logs can be correlated with the
+triggering request (for manually triggered imports) or assigned a unique
+ID per run (for scheduled imports).
 """

 from __future__ import annotations

+import uuid
 from typing import TYPE_CHECKING, Any

 import structlog
@@ -20,6 +26,7 @@ import structlog
 from app.services import ban_service, blocklist_service
 from app.tasks.db import task_db
 from app.tasks.timeout_utils import run_with_timeout
+from app.utils.correlation import get_correlation_id, reset_correlation_id, set_correlation_id
 from app.utils.runtime_state import get_effective_settings

 if TYPE_CHECKING:
@@ -37,18 +44,36 @@ JOB_ID: str = "blocklist_import"
 TASK_TIMEOUT_SECONDS: int = 300


-async def _run_import_with_resources(settings: Settings, http_session: ClientSession) -> None:
+async def _run_import_with_resources(
+    settings: Settings,
+    http_session: ClientSession,
+    correlation_id: str | None = None,
+) -> None:
    """APScheduler callback that imports all enabled blocklist sources.

    Args:
        settings: The resolved application settings used for database access.
        http_session: The shared aiohttp session used for blocklist downloads.
+        correlation_id: Optional correlation ID from the triggering request.
+            If ``None``, a new UUID4 is generated to identify this run.
    """
+    if correlation_id is None:
+        correlation_id = str(uuid.uuid4())
+
+    token = set_correlation_id(correlation_id)
+    try:
+        await _do_import_with_settings(settings, http_session)
+    finally:
+        reset_correlation_id(token)
+
+
+async def _do_import_with_settings(settings: Settings, http_session: ClientSession) -> None:
+    """Inner import logic that runs with correlation context set."""

    async def _do_import() -> None:
        socket_path: str = settings.fail2ban_socket

-        log.info("blocklist_import_starting")
+        log.info("blocklist_import_starting", correlation_id=get_correlation_id())
        try:
            async with task_db(settings) as db:
                result = await blocklist_service.import_all(
@@ -59,12 +84,13 @@ async def _run_import_with_resources(settings: Settings, http_session: ClientSes
                )
            log.info(
                "blocklist_import_finished",
+                correlation_id=get_correlation_id(),
                total_imported=result.total_imported,
                total_skipped=result.total_skipped,
                errors=result.errors_count,
            )
        except Exception:
-            log.exception("blocklist_import_unexpected_error")
+            log.exception("blocklist_import_unexpected_error", correlation_id=get_correlation_id())

    await run_with_timeout("blocklist_import", _do_import(), TASK_TIMEOUT_SECONDS)

--- a/backend/app/tasks/geo_cache_cleanup.py
+++ b/backend/app/tasks/geo_cache_cleanup.py
@@ -7,10 +7,14 @@ database file and maintains query performance on geo lookups.

 When a stale IP is encountered again after purge, it will be re-resolved from
 the MaxMind database or ip-api.com (if configured), which is acceptable.
+
+Correlation IDs are propagated through the task using :mod:`app.utils.correlation`
+so that task logs can be correlated across runs.
 """

 from __future__ import annotations

+import uuid
 from datetime import UTC, datetime, timedelta
 from typing import TYPE_CHECKING

@@ -19,6 +23,7 @@ import structlog
 from app.repositories import geo_cache_repo
 from app.tasks.db import task_db
 from app.tasks.timeout_utils import run_with_timeout
+from app.utils.correlation import get_correlation_id, reset_correlation_id, set_correlation_id
 from app.utils.runtime_state import get_effective_settings

 if TYPE_CHECKING:
@@ -41,7 +46,10 @@ JOB_ID: str = "geo_cache_cleanup"
 TASK_TIMEOUT_SECONDS: int = 60


-async def _run_cleanup_with_resources(settings: Settings) -> None:
+async def _run_cleanup_with_resources(
+    settings: Settings,
+    correlation_id: str | None = None,
+) -> None:
    """Delete stale entries from the geo cache.

    Calculates a cutoff timestamp (now - retention period) and removes all
@@ -49,7 +57,20 @@ async def _run_cleanup_with_resources(settings: Settings) -> None:

    Args:
        settings: The resolved application settings used for database access.
+        correlation_id: Optional correlation ID for log correlation.
    """
+    if correlation_id is None:
+        correlation_id = str(uuid.uuid4())
+
+    token = set_correlation_id(correlation_id)
+    try:
+        await _do_cleanup_with_settings(settings)
+    finally:
+        reset_correlation_id(token)
+
+
+async def _do_cleanup_with_settings(settings: Settings) -> None:
+    """Inner cleanup logic that runs with correlation context set."""

    async def _do_cleanup() -> None:
        cutoff_dt = datetime.now(UTC) - timedelta(days=GEO_CACHE_RETENTION_DAYS)
@@ -60,9 +81,19 @@ async def _run_cleanup_with_resources(settings: Settings) -> None:
            await db.commit()

        if deleted > 0:
-            log.info("geo_cache_cleanup_ran", deleted=deleted, retention_days=GEO_CACHE_RETENTION_DAYS)
+            log.info(
+                "geo_cache_cleanup_ran",
+                correlation_id=get_correlation_id(),
+                deleted=deleted,
+                retention_days=GEO_CACHE_RETENTION_DAYS,
+            )
        else:
-            log.debug("geo_cache_cleanup_ran", deleted=deleted, retention_days=GEO_CACHE_RETENTION_DAYS)
+            log.debug(
+                "geo_cache_cleanup_ran",
+                correlation_id=get_correlation_id(),
+                deleted=deleted,
+                retention_days=GEO_CACHE_RETENTION_DAYS,
+            )

    await run_with_timeout("geo_cache_cleanup", _do_cleanup(), TASK_TIMEOUT_SECONDS)

--- a/backend/app/tasks/geo_cache_flush.py
+++ b/backend/app/tasks/geo_cache_flush.py
@@ -7,16 +7,21 @@ After Task 2 removed geo cache writes from GET requests, newly resolved IPs
 are only held in the in-memory cache until this task flushes them.  With the
 default 60-second interval, at most one minute of new resolution results is
 at risk on an unexpected process restart.
+
+Correlation IDs are propagated through the task using :mod:`app.utils.correlation`
+so that task logs can be correlated across runs.
 """

 from __future__ import annotations

+import uuid
 from typing import TYPE_CHECKING

 import structlog

 from app.tasks.db import task_db
 from app.tasks.timeout_utils import run_with_timeout
+from app.utils.correlation import get_correlation_id, reset_correlation_id, set_correlation_id
 from app.utils.runtime_state import get_effective_settings

 if TYPE_CHECKING:
@@ -37,20 +42,37 @@ JOB_ID: str = "geo_cache_flush"
 TASK_TIMEOUT_SECONDS: int = 60


-async def _run_flush_with_resources(geo_cache: GeoCache, settings: Settings) -> None:
+async def _run_flush_with_resources(
+    geo_cache: GeoCache,
+    settings: Settings,
+    correlation_id: str | None = None,
+) -> None:
    """Flush the geo cache dirty set to the application database.

    Args:
        geo_cache: The application's GeoCache instance.
        settings: The resolved application settings used for database access.
+        correlation_id: Optional correlation ID for log correlation.
    """
+    if correlation_id is None:
+        correlation_id = str(uuid.uuid4())
+
+    token = set_correlation_id(correlation_id)
+    try:
+        await _do_flush_with_resources(geo_cache, settings)
+    finally:
+        reset_correlation_id(token)
+
+
+async def _do_flush_with_resources(geo_cache: GeoCache, settings: Settings) -> None:
+    """Inner flush logic that runs with correlation context set."""

    async def _do_flush() -> None:
        async with task_db(settings) as db:
            count = await geo_cache.flush_dirty(db)

        if count > 0:
-            log.debug("geo_cache_flush_ran", flushed=count)
+            log.debug("geo_cache_flush_ran", correlation_id=get_correlation_id(), flushed=count)

    await run_with_timeout("geo_cache_flush", _do_flush(), TASK_TIMEOUT_SECONDS)

--- a/backend/app/tasks/geo_re_resolve.py
+++ b/backend/app/tasks/geo_re_resolve.py
@@ -13,16 +13,21 @@ The task runs every 10 minutes.  On each invocation it:
 3. Delegates to :meth:`~app.services.geo_cache.GeoCache.lookup_batch` which
   already handles rate-limit throttling and retries.
 4. Logs how many IPs were retried and how many resolved successfully.
+
+Correlation IDs are propagated through the task using :mod:`app.utils.correlation`
+so that task logs can be correlated across runs.
 """

 from __future__ import annotations

+import uuid
 from typing import TYPE_CHECKING

 import structlog

 from app.tasks.db import task_db
 from app.tasks.timeout_utils import run_with_timeout
+from app.utils.correlation import get_correlation_id, reset_correlation_id, set_correlation_id
 from app.utils.runtime_state import get_effective_settings

 if TYPE_CHECKING:
@@ -45,7 +50,10 @@ TASK_TIMEOUT_SECONDS: int = 120


 async def _run_re_resolve_with_resources(
-    geo_cache: GeoCache, settings: Settings, http_session: ClientSession
+    geo_cache: GeoCache,
+    settings: Settings,
+    http_session: ClientSession,
+    correlation_id: str | None = None,
 ) -> None:
    """Query NULL-country IPs from the database and re-resolve them.

@@ -53,7 +61,24 @@ async def _run_re_resolve_with_resources(
        geo_cache: The application's GeoCache instance.
        settings: The resolved application settings used for database access.
        http_session: The shared aiohttp session used for external lookups.
+        correlation_id: Optional correlation ID for log correlation.
    """
+    if correlation_id is None:
+        correlation_id = str(uuid.uuid4())
+
+    token = set_correlation_id(correlation_id)
+    try:
+        await _do_re_resolve_with_resources(geo_cache, settings, http_session)
+    finally:
+        reset_correlation_id(token)
+
+
+async def _do_re_resolve_with_resources(
+    geo_cache: GeoCache,
+    settings: Settings,
+    http_session: ClientSession,
+) -> None:
+    """Inner re-resolve logic that runs with correlation context set."""

    async def _do_re_resolve() -> None:
        async with task_db(settings) as db:
@@ -61,10 +86,10 @@ async def _run_re_resolve_with_resources(
            unresolved_ips = await geo_cache.get_unresolved_ips(db)

            if not unresolved_ips:
-                log.debug("geo_re_resolve_skip", reason="no_unresolved_ips")
+                log.debug("geo_re_resolve_skip", correlation_id=get_correlation_id(), reason="no_unresolved_ips")
                return

-            log.info("geo_re_resolve_start", unresolved=len(unresolved_ips))
+            log.info("geo_re_resolve_start", correlation_id=get_correlation_id(), unresolved=len(unresolved_ips))

            # Clear the negative cache so these IPs are eligible for fresh API calls.
            await geo_cache.clear_neg_cache()
@@ -78,6 +103,7 @@ async def _run_re_resolve_with_resources(
        )
        log.info(
            "geo_re_resolve_complete",
+            correlation_id=get_correlation_id(),
            retried=len(unresolved_ips),
            resolved=resolved_count,
        )
--- a/backend/app/tasks/health_check.py
+++ b/backend/app/tasks/health_check.py
@@ -13,11 +13,15 @@ keys).  If the health probe subsequently detects an online→offline transition
 within 60 seconds of that activation, a
 :class:`~app.models.config.PendingRecovery` record is written to
 ``app.state.pending_recovery`` so the UI can offer a one-click rollback.
+
+Correlation IDs are propagated through the task using :mod:`app.utils.correlation`
+so that task logs can be correlated across runs.
 """

 from __future__ import annotations

 import datetime
+import uuid
 from typing import TYPE_CHECKING

 import structlog
@@ -25,6 +29,7 @@ import structlog
 from app.models.server import ServerStatus
 from app.services import health_service
 from app.tasks.timeout_utils import run_with_timeout
+from app.utils.correlation import reset_correlation_id, set_correlation_id
 from app.utils.runtime_state import (
    RuntimeState,
    get_effective_settings,
@@ -47,13 +52,30 @@ HEALTH_CHECK_INTERVAL: int = 30
 HEALTH_PROBE_TIMEOUT_SECONDS: int = 10


-async def _run_probe_with_resources(settings: Settings, runtime_state: RuntimeState) -> None:
+async def _run_probe_with_resources(
+    settings: Settings,
+    runtime_state: RuntimeState,
+    correlation_id: str | None = None,
+) -> None:
    """Probe fail2ban and cache the result on the runtime state.

    Args:
        settings: The resolved application settings used for the probe.
        runtime_state: The mutable runtime state manager.
+        correlation_id: Optional correlation ID for log correlation.
    """
+    if correlation_id is None:
+        correlation_id = str(uuid.uuid4())
+
+    token = set_correlation_id(correlation_id)
+    try:
+        await _do_probe_with_resources(settings, runtime_state)
+    finally:
+        reset_correlation_id(token)
+
+
+async def _do_probe_with_resources(settings: Settings, runtime_state: RuntimeState) -> None:
+    """Inner probe logic that runs with correlation context set."""

    async def _do_probe() -> None:
        socket_path: str = settings.fail2ban_socket
--- a/backend/app/tasks/history_sync.py
+++ b/backend/app/tasks/history_sync.py
@@ -2,11 +2,15 @@

 Periodically copies new records from the fail2ban sqlite database into the
 BanGUI application archive table to prevent gaps when fail2ban purges old rows.
+
+Correlation IDs are propagated through the task using :mod:`app.utils.correlation`
+so that task logs can be correlated across runs.
 """

 from __future__ import annotations

 import datetime
+import uuid
 from typing import TYPE_CHECKING

 import structlog
@@ -14,6 +18,7 @@ import structlog
 from app.services import history_service
 from app.tasks.db import task_db
 from app.tasks.timeout_utils import run_with_timeout
+from app.utils.correlation import get_correlation_id, reset_correlation_id, set_correlation_id
 from app.utils.runtime_state import get_effective_settings

 if TYPE_CHECKING:
@@ -36,16 +41,32 @@ BACKFILL_WINDOW: int = 648000
 TASK_TIMEOUT_SECONDS: int = 60


-async def _run_sync_with_settings(settings: Settings) -> None:
+async def _run_sync_with_settings(
+    settings: Settings,
+    correlation_id: str | None = None,
+) -> None:
+    """Run the history sync with correlation ID context."""
+    if correlation_id is None:
+        correlation_id = str(uuid.uuid4())
+
+    token = set_correlation_id(correlation_id)
+    try:
+        await _do_sync_with_settings(settings)
+    finally:
+        reset_correlation_id(token)
+
+
+async def _do_sync_with_settings(settings: Settings) -> None:
+    """Inner sync logic that runs with correlation context set."""
    socket_path: str = settings.fail2ban_socket

    async def _do_sync() -> None:
        try:
            async with task_db(settings) as db:
                synced = await history_service.sync_from_fail2ban_db(db, socket_path)
-            log.info("history_sync_complete", synced=synced)
+            log.info("history_sync_complete", correlation_id=get_correlation_id(), synced=synced)
        except Exception:
-            log.exception("history_sync_failed")
+            log.exception("history_sync_failed", correlation_id=get_correlation_id())

    await run_with_timeout("history_sync", _do_sync(), TASK_TIMEOUT_SECONDS)

--- a/backend/app/tasks/rate_limiter_cleanup.py
+++ b/backend/app/tasks/rate_limiter_cleanup.py
@@ -8,15 +8,20 @@ consuming excessive memory.
 The cleanup is conservative: it only removes IPs with no recent attempts
 (all timestamps outside the rate-limit window), so active or recently-active
 IPs are preserved.
+
+Correlation IDs are propagated through the task using :mod:`app.utils.correlation`
+so that task logs can be correlated across runs.
 """

 from __future__ import annotations

+import uuid
 from typing import TYPE_CHECKING

 import structlog

 from app.tasks.timeout_utils import run_with_timeout
+from app.utils.correlation import get_correlation_id, reset_correlation_id, set_correlation_id

 if TYPE_CHECKING:
    from fastapi import FastAPI
@@ -35,7 +40,10 @@ JOB_ID: str = "rate_limiter_cleanup"
 TASK_TIMEOUT_SECONDS: int = 5


-async def _run_cleanup(app: FastAPI) -> None:
+async def _run_cleanup(
+    app: FastAPI,
+    correlation_id: str | None = None,
+) -> None:
    """Trigger cleanup of expired rate-limiter entries.

    Cleans up both the login-specific rate limiter (exponential backoff)
@@ -43,13 +51,27 @@ async def _run_cleanup(app: FastAPI) -> None:

    Args:
        app: The FastAPI application instance (holds the rate limiters).
+        correlation_id: Optional correlation ID for log correlation.
    """
+    if correlation_id is None:
+        correlation_id = str(uuid.uuid4())
+
+    token = set_correlation_id(correlation_id)
+    try:
+        await _do_cleanup_with_app(app)
+    finally:
+        reset_correlation_id(token)
+
+
+async def _do_cleanup_with_app(app: FastAPI) -> None:
+    """Inner cleanup logic that runs with correlation context set."""

    async def _do_cleanup() -> None:
        login_limiter = getattr(app.state, "login_rate_limiter", None)
        if login_limiter is None:
            log.warning(
                "rate_limiter_cleanup_skipped",
+                correlation_id=get_correlation_id(),
                reason="login_rate_limiter not found on app.state",
            )
        else:
@@ -59,6 +81,7 @@ async def _run_cleanup(app: FastAPI) -> None:
        if global_limiter is None:
            log.warning(
                "rate_limiter_cleanup_skipped",
+                correlation_id=get_correlation_id(),
                reason="global_rate_limiter not found on app.state",
            )
        else:
--- a/backend/app/tasks/scheduler_lock_heartbeat.py
+++ b/backend/app/tasks/scheduler_lock_heartbeat.py
@@ -7,16 +7,21 @@ if the running instance experiences temporary delays or high load.
 Without this heartbeat, stale lock detection (based on TTL) could incorrectly
 determine that the scheduler instance has crashed when it's merely busy, and
 a new instance could take over.
+
+Correlation IDs are propagated through the task using :mod:`app.utils.correlation`
+so that task logs can be correlated across runs.
 """

 from __future__ import annotations

+import uuid
 from typing import TYPE_CHECKING

 import structlog

 from app.tasks.db import task_db
 from app.tasks.timeout_utils import run_with_timeout
+from app.utils.correlation import get_correlation_id, reset_correlation_id, set_correlation_id
 from app.utils.runtime_state import get_effective_settings
 from app.utils.scheduler_lock import update_scheduler_lock_heartbeat

@@ -38,7 +43,10 @@ JOB_ID: str = "scheduler_lock_heartbeat"
 TASK_TIMEOUT_SECONDS: int = 5


-async def _update_heartbeat_with_resources(settings: Settings) -> None:
+async def _update_heartbeat_with_resources(
+    settings: Settings,
+    correlation_id: str | None = None,
+) -> None:
    """Update the scheduler lock heartbeat timestamp.

    If the heartbeat update fails (e.g., we no longer hold the lock), log
@@ -51,17 +59,31 @@ async def _update_heartbeat_with_resources(settings: Settings) -> None:

    Args:
        settings: The resolved application settings used for database access.
+        correlation_id: Optional correlation ID from the triggering request.
    """
+    if correlation_id is None:
+        correlation_id = str(uuid.uuid4())
+
+    token = set_correlation_id(correlation_id)
+    try:
+        await _do_update_heartbeat_with_settings(settings)
+    finally:
+        reset_correlation_id(token)
+
+
+async def _do_update_heartbeat_with_settings(settings: Settings) -> None:
+    """Inner heartbeat logic that runs with correlation context set."""

    async def _do_update() -> None:
        async with task_db(settings) as db:
            success = await update_scheduler_lock_heartbeat(db)

        if success:
-            log.debug("scheduler_lock_heartbeat_updated")
+            log.debug("scheduler_lock_heartbeat_updated", correlation_id=get_correlation_id())
        else:
            log.warning(
                "scheduler_lock_heartbeat_failed",
+                correlation_id=get_correlation_id(),
                message="Failed to update heartbeat; we no longer hold the lock. "
                "Another instance may have taken over or the database connection failed.",
            )
@@ -71,12 +93,14 @@ async def _update_heartbeat_with_resources(settings: Settings) -> None:
    except TimeoutError:
        log.error(
            "scheduler_lock_heartbeat_timeout",
+            correlation_id=get_correlation_id(),
            timeout_seconds=TASK_TIMEOUT_SECONDS,
            message="Heartbeat update exceeded timeout. The database may be slow or unresponsive.",
        )
    except Exception as e:
        log.error(
            "scheduler_lock_heartbeat_error",
+            correlation_id=get_correlation_id(),
            error=str(e),
            message="Unexpected error during heartbeat update.",
        )
--- a/backend/app/tasks/session_cleanup.py
+++ b/backend/app/tasks/session_cleanup.py
@@ -6,10 +6,14 @@ months of operation and degrades query performance.

 Individual expired sessions are removed on-demand when validated, but the bulk
 cleanup ensures comprehensive pruning at a predictable interval.
+
+Correlation IDs are propagated through the task using :mod:`app.utils.correlation`
+so that task logs can be correlated across runs.
 """

 from __future__ import annotations

+import uuid
 from typing import TYPE_CHECKING

 import structlog
@@ -17,6 +21,7 @@ import structlog
 from app.repositories import session_repo
 from app.tasks.db import task_db
 from app.tasks.timeout_utils import run_with_timeout
+from app.utils.correlation import get_correlation_id, reset_correlation_id, set_correlation_id
 from app.utils.runtime_state import get_effective_settings
 from app.utils.time_utils import utc_now

@@ -37,19 +42,40 @@ JOB_ID: str = "session_cleanup"
 TASK_TIMEOUT_SECONDS: int = 30


-async def _run_cleanup_with_resources(settings: Settings) -> None:
+async def _run_cleanup_with_resources(
+    settings: Settings,
+    correlation_id: str | None = None,
+) -> None:
    """Delete all expired sessions from the database.

    Args:
        settings: The resolved application settings used for database access.
+        correlation_id: Optional correlation ID from the triggering request.
    """
+    if correlation_id is None:
+        correlation_id = str(uuid.uuid4())
+
+    token = set_correlation_id(correlation_id)
+    try:
+        await _do_cleanup_with_settings(settings)
+    finally:
+        reset_correlation_id(token)
+
+
+async def _do_cleanup_with_settings(settings: Settings) -> None:
+    """Inner cleanup logic that runs with correlation context set."""

    async def _do_cleanup() -> None:
        now_iso = utc_now().isoformat()
        async with task_db(settings) as db:
            deleted_count = await session_repo.delete_expired_sessions(db, now_iso)

-        log.info("session_cleanup_ran", deleted_count=deleted_count, cutoff_time=now_iso)
+        log.info(
+            "session_cleanup_ran",
+            correlation_id=get_correlation_id(),
+            deleted_count=deleted_count,
+            cutoff_time=now_iso,
+        )

    await run_with_timeout("session_cleanup", _do_cleanup(), TASK_TIMEOUT_SECONDS)

--- a/backend/app/utils/config_file_utils.py
+++ b/backend/app/utils/config_file_utils.py
@@ -29,6 +29,7 @@ from app.utils.fail2ban_client import (
    Fail2BanResponse,
 )
 from app.utils.fail2ban_response import ok, to_dict
+from app.utils.log_sanitizer import sanitize_for_logging

 log: structlog.stdlib.BoundLogger = structlog.get_logger()

@@ -500,8 +501,8 @@ async def start_daemon(start_cmd_parts: list[str]) -> bool:
            "fail2ban_start_failed",
            command=" ".join(start_cmd_parts),
            returncode=process.returncode,
-            stdout=stdout.decode("utf-8", errors="replace"),
-            stderr=stderr.decode("utf-8", errors="replace"),
+            stdout=sanitize_for_logging(stdout.decode("utf-8", errors="replace")),
+            stderr=sanitize_for_logging(stderr.decode("utf-8", errors="replace")),
        )
        return False
    log.info(
--- a/backend/app/utils/correlation.py
+++ b/backend/app/utils/correlation.py
@@ -0,0 +1,70 @@
+"""Correlation ID context variable for distributed tracing.
+
+This module provides a :class:`contextvars.ContextVar` that stores the correlation
+ID for the current execution context. Background tasks inherit or explicitly set
+the correlation ID to enable log correlation across request/task boundaries.
+
+Usage in background tasks::
+
+    from app.utils.correlation import get_correlation_id, set_correlation_id
+
+    async def my_background_task(correlation_id: str) -> None:
+        token = set_correlation_id(correlation_id)
+        try:
+            log.info("task_started")  # Logs include correlation_id
+        finally:
+            reset_correlation_id(token)
+
+For APScheduler jobs, pass the correlation ID through kwargs and call
+:func:`set_correlation_id` at the start of the async callback.
+"""
+
+from __future__ import annotations
+
+from contextvars import ContextVar
+from typing import TYPE_CHECKING
+
+if TYPE_CHECKING:
+    from contextvars import Token
+
+#: Default correlation ID used when no parent request context exists.
+_DEFAULT_BG_CORRELATION_ID: str = "bg-task"
+
+#: ContextVar holding the current correlation ID.
+correlation_id_var: ContextVar[str] = ContextVar(
+    "correlation_id",
+    default=_DEFAULT_BG_CORRELATION_ID,
+)
+
+
+def get_correlation_id() -> str:
+    """Return the current correlation ID from context.
+
+    Returns:
+        The current correlation ID string. Returns ``"bg-task"`` if no
+        correlation ID has been set in the current context.
+    """
+    return correlation_id_var.get()
+
+
+def set_correlation_id(correlation_id: str) -> Token[str]:
+    """Set the correlation ID for the current context.
+
+    Args:
+        correlation_id: The correlation ID to bind (typically passed from
+            the triggering request).
+
+    Returns:
+        A :class:`contextvars.Token` that must be passed to
+        :func:`reset_correlation_id` when the task completes.
+    """
+    return correlation_id_var.set(correlation_id)
+
+
+def reset_correlation_id(token: Token[str]) -> None:
+    """Reset the correlation ID to its previous value.
+
+    Args:
+        token: The token returned by :func:`set_correlation_id`.
+    """
+    correlation_id_var.reset(token)
--- a/backend/app/utils/log_sanitizer.py
+++ b/backend/app/utils/log_sanitizer.py
@@ -0,0 +1,50 @@
+"""Log sanitization utilities for preventing sensitive data leakage.
+
+All external output (subprocess, API responses, config data) passed to
+structlog MUST be sanitized first. This module provides the canonical
+sanitize_for_logging() function used across the codebase.
+"""
+
+from __future__ import annotations
+
+import re
+
+# Patterns for sensitive data that must never appear in logs.
+_SENSITIVE_PATTERNS: list[tuple[re.Pattern[str], str]] = [
+    # Passwords: password=PASS, password:PASS, password = PASS, etc.
+    (re.compile(r"password[=:]\S+", re.IGNORECASE), "password=***"),
+    # API keys: api_key=X, api-key=X, APIKEY=X, etc.
+    (re.compile(r"api[_-]?key[=:]\S+", re.IGNORECASE), "api_key=***"),
+    # Auth tokens: token=X, token: X, etc.
+    (re.compile(r"token[=:]\S+", re.IGNORECASE), "token=***"),
+    # Authorization headers: Authorization: Bearer <token>
+    (re.compile(r"(?i)Authorization:\s*(?:Bearer\s+)?\S+"), "Authorization: ***"),
+    # Secret values: secret=X, secret: X
+    (re.compile(r"secret[=:]\S+", re.IGNORECASE), "secret=***"),
+    # Private keys: -----BEGIN (RSA/DSA/EC/OPENSSH) PRIVATE KEY-----
+    (re.compile(r"-----BEGIN +(?:RSA|DSA|EC|OPENSSH) +PRIVATE KEY-----"), "*** PRIVATE KEY ***"),
+    # AWS access keys: AKIA...
+    (re.compile(r"AKIA[0-9A-Z]{16}"), "AKIA***"),
+    # Generic secret= patterns
+    (re.compile(r"secret[_-]?key[=:]\S+", re.IGNORECASE), "secret_key=***"),
+    # Bearer tokens in Authorization headers
+    (re.compile(r"Bearer\s+[A-Za-z0-9_-]+\.[A-Za-z0-9_-]+\.[A-Za-z0-9_-]+"), "Bearer ***"),
+]
+
+
+def sanitize_for_logging(text: str) -> str:
+    """Remove sensitive data patterns from text before logging.
+
+    Applies a set of regex substitutions to strip passwords, API keys,
+    tokens, secrets, private keys, and other credential-like strings.
+    Substituted values are replaced with a fixed redaction marker.
+
+    Args:
+        text: Raw text that may contain sensitive data.
+
+    Returns:
+        Text with sensitive patterns replaced by ``***``.
+    """
+    for pattern, replacement in _SENSITIVE_PATTERNS:
+        text = pattern.sub(replacement, text)
+    return text
--- a/backend/tests/test_tasks/test_blocklist_import.py
+++ b/backend/tests/test_tasks/test_blocklist_import.py
@@ -176,7 +176,10 @@ class TestRunImport:
            # Must not raise — the task swallows unexpected errors.
            await _run_import(app)

-        mock_log.exception.assert_called_once_with("blocklist_import_unexpected_error")
+        mock_log.exception.assert_called_once()
+        call_args = mock_log.exception.call_args
+        assert call_args[0][0] == "blocklist_import_unexpected_error"
+        assert "correlation_id" in call_args[1]


 # ---------------------------------------------------------------------------
--- a/backend/tests/test_utils/test_log_sanitizer.py
+++ b/backend/tests/test_utils/test_log_sanitizer.py
@@ -0,0 +1,150 @@
+"""Tests for app.utils.log_sanitizer."""
+
+from __future__ import annotations
+
+import pytest
+
+from app.utils.log_sanitizer import sanitize_for_logging
+
+
+class TestSanitizeForLogging:
+    """Tests for sanitize_for_logging()."""
+
+    def test_passthrough_clean_text(self) -> None:
+        """No sensitive patterns: text unchanged."""
+        text = "Server started on port 8000"
+        assert sanitize_for_logging(text) == text
+
+    def test_password_eq(self) -> None:
+        """password=X replaced."""
+        assert sanitize_for_logging("password=Secret123") == "password=***"
+
+    def test_password_colon(self) -> None:
+        """password:X replaced."""
+        assert sanitize_for_logging("password:Secret123") == "password=***"
+
+    def test_password_case_insensitive(self) -> None:
+        """Password matching is case-insensitive."""
+        assert sanitize_for_logging("PASSWORD=Secret123") == "password=***"
+        assert sanitize_for_logging("Password:Secret123") == "password=***"
+
+    def test_api_key_underscore(self) -> None:
+        """api_key=X replaced."""
+        assert sanitize_for_logging("api_key=my-secret-key") == "api_key=***"
+
+    def test_api_key_dash(self) -> None:
+        """api-key=X replaced."""
+        assert sanitize_for_logging("api-key=my-secret-key") == "api_key=***"
+
+    def test_api_key_no_separator(self) -> None:
+        """api_keyXYZ (no separator) is NOT matched by the api_key pattern.
+
+        The pattern requires =, :, _, or - after 'key', so plain 'api_keyXYZ'
+        passes through unsanitized. This is intentional — the risk of
+        false-positives (normal words like 'api_keyboard') outweighs coverage.
+        """
+        result = sanitize_for_logging("api_keyXYZ")
+        # No separator → not matched → unchanged
+        assert result == "api_keyXYZ"
+
+    def test_token_eq(self) -> None:
+        """token=X replaced."""
+        assert sanitize_for_logging("token=eyJhbGciOiJIUzI1NiJ9") == "token=***"
+
+    def test_token_case_insensitive(self) -> None:
+        """Token matching is case-insensitive."""
+        assert sanitize_for_logging("TOKEN=eyJhbGciOiJIUzI1NiJ9") == "token=***"
+
+    def test_authorization_bearer(self) -> None:
+        """Authorization: Bearer <token> replaced."""
+        result = sanitize_for_logging("Authorization: Bearer eyJhbGciOiJIUzI1NiJ9")
+        assert "***" in result
+        assert "Bearer" not in result
+
+    def test_authorization_bearer_only(self) -> None:
+        """Authorization: Bearer (no token) replaced."""
+        result = sanitize_for_logging("Authorization: Bearer")
+        assert "***" in result
+
+    def test_authorization_basic(self) -> None:
+        """Authorization: Basic replaced."""
+        result = sanitize_for_logging("Authorization: Basic dXNlcjpwYXNz")
+        assert "***" in result
+
+    def test_secret_eq(self) -> None:
+        """secret=X replaced."""
+        assert sanitize_for_logging("secret=my_secret_value") == "secret=***"
+
+    def test_secret_key_eq(self) -> None:
+        """secret_key=X replaced."""
+        assert sanitize_for_logging("secret_key=my_secret_value") == "secret_key=***"
+
+    def test_rsa_private_key(self) -> None:
+        """RSA private key header redacted."""
+        text = "Some text -----BEGIN RSA PRIVATE KEY-----\nMIIBogAAAAAAA="
+        result = sanitize_for_logging(text)
+        assert "*** PRIVATE KEY ***" in result
+        assert "BEGIN RSA PRIVATE KEY" not in result
+
+    def test_dsa_private_key(self) -> None:
+        """DSA private key header redacted."""
+        text = "Some text -----BEGIN DSA PRIVATE KEY-----\nMIIBogAAAAAAA="
+        result = sanitize_for_logging(text)
+        assert "*** PRIVATE KEY ***" in result
+        assert "BEGIN DSA PRIVATE KEY" not in result
+
+    def test_ec_private_key(self) -> None:
+        """EC private key header redacted."""
+        text = "Some text -----BEGIN EC PRIVATE KEY-----\nMIIBogAAAAAAA="
+        result = sanitize_for_logging(text)
+        assert "*** PRIVATE KEY ***" in result
+        assert "BEGIN EC PRIVATE KEY" not in result
+
+    def test_openssh_private_key(self) -> None:
+        """OPENSSH private key header redacted."""
+        text = "Some text -----BEGIN OPENSSH PRIVATE KEY-----\nMIIBogAAAAAAA="
+        result = sanitize_for_logging(text)
+        assert "*** PRIVATE KEY ***" in result
+        assert "BEGIN OPENSSH PRIVATE KEY" not in result
+
+    def test_aws_access_key(self) -> None:
+        """AKIA... AWS access key redacted."""
+        text = "Access key: AKIAIOSFODNN7EXAMPLE"
+        result = sanitize_for_logging(text)
+        assert "AKIA***" in result
+        assert "EXAMPLE" not in result
+
+    def test_bearer_jwt_token(self) -> None:
+        """Standalone JWT token after 'Bearer' is redacted."""
+        text = "Bearer eyJhbGciOiJIUzI1NiJ9.eyJzdWIiOiIxMjM0NTY3ODkwIn0.dozjgNryP4J3jVmNHl0w5N_XgL0n3I9PlFUP0THsR8U"
+        result = sanitize_for_logging(text)
+        assert result == "Bearer ***"
+
+    def test_multiple_sensitive_values(self) -> None:
+        """Multiple sensitive patterns on one line all redacted."""
+        text = "password=pass1 token=tok1 api_key=key1"
+        result = sanitize_for_logging(text)
+        assert result.count("***") == 3
+
+    def test_empty_string(self) -> None:
+        """Empty string returns empty string."""
+        assert sanitize_for_logging("") == ""
+
+    def test_only_sensitive_pattern(self) -> None:
+        """Text that is only a sensitive pattern returns only marker."""
+        assert sanitize_for_logging("password=secret") == "password=***"
+
+    def test_multiline_text(self) -> None:
+        """Sensitive data redacted across multiple lines."""
+        text = "Line 1: password=secret\nLine 2: token=tok123\nLine 3: clean"
+        result = sanitize_for_logging(text)
+        assert "***" in result
+        assert "secret" not in result
+        assert "tok123" not in result
+        assert "clean" in result
+
+    def test_jwt_structure_preserved(self) -> None:
+        """JWT dot-separated structure not matched as token=X."""
+        text = "header.payload.signature"
+        result = sanitize_for_logging(text)
+        assert result == text