Update observability docs and task utilities

- Add Observability.md documentation
- Standardize task logging with correlation_id support
- Add log_sanitizer utility for PII masking
- Update Tasks.md tracking
- Update geo_cache tasks and other task modules with correlation_id

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
This commit is contained in:
2026-05-03 11:52:09 +02:00
parent 7b93499551
commit 0133489920
17 changed files with 582 additions and 124 deletions

View File

@@ -62,6 +62,40 @@ log.info("password_check", password=password_hash) # Never!
Structlog provides context variable filtering to prevent accidental logging of sensitive data. Code reviews must verify compliance with this rule.
### Log Sanitization
All external output (subprocess results, API responses, config file contents) passed to structlog **must** be sanitized first using `sanitize_for_logging()` from `app.utils.log_sanitizer`.
This prevents sensitive data — passwords, API keys, tokens, private keys — from leaking into logs.
```python
from app.utils.log_sanitizer import sanitize_for_logging
# ✓ Correct: Sanitize before logging
log.error(
"fail2ban_start_failed",
command=" ".join(start_cmd_parts),
returncode=process.returncode,
stdout=sanitize_for_logging(stdout.decode("utf-8", errors="replace")),
stderr=sanitize_for_logging(stderr.decode("utf-8", errors="replace")),
)
# ✗ Wrong: Raw output may contain secrets
log.error("fail2ban_start_failed", stdout=stdout_raw, stderr=stderr_raw) # Never!
```
`sanitize_for_logging()` redacts the following patterns:
| Pattern | Example match | Replacement |
|---------|---------------|-------------|
| `password=X` | `password=Secret123` | `password=***` |
| `api_key=X` / `api-key=X` | `api_key=key123` | `api_key=***` |
| `token=X` | `token=eyJhbG...` | `token=***` |
| `Authorization: Bearer X` | `Authorization: Bearer tok...` | `Authorization: ***` |
| `secret=X` | `secret=myvalue` | `secret=***` |
| `-----BEGIN RSA PRIVATE KEY-----` | (key header) | `*** PRIVATE KEY ***` |
| `AKIA...` | `AKIAIOSFODNN7EXAMPLE` | `AKIA***` |
---
## Structured Logging Best Practices
@@ -102,6 +136,35 @@ log.info("user_action", action="create_jail") # Automatically includes correlat
structlog.contextvars.clear_contextvars()
```
### Background Task Correlation
Background tasks (APScheduler jobs) run outside the HTTP request context.
Use :mod:`app.utils.correlation` to propagate correlation IDs through tasks:
```python
from app.utils.correlation import get_correlation_id, reset_correlation_id, set_correlation_id
async def my_background_task(correlation_id: str | None = None) -> None:
# Generate a new ID if not provided (scheduled tasks have no parent request)
if correlation_id is None:
import uuid
correlation_id = str(uuid.uuid4())
# Set the correlation ID for all logs in this task
token = set_correlation_id(correlation_id)
try:
log.info("task_started") # Now includes correlation_id
# ... task logic ...
finally:
reset_correlation_id(token)
# When scheduling, optionally pass the current correlation ID:
# scheduler.add_job(my_background_task, kwargs={"correlation_id": get_correlation_id()})
```
Scheduled tasks (no parent request) generate a fresh UUID for each run.
Tasks triggered by a request inherit the request's correlation ID.
### Event Naming Convention
Use snake_case for event names, prefixed with the component or module name:

View File

@@ -1,103 +1,3 @@
### Issue #18: MEDIUM - Configuration Validation Missing at Startup
**Where found**:
- `backend/app/config.py` (lines 37-95)
- `database_path` has no validation
- `fail2ban_socket` not verified to exist
- Hard-coded paths assumed in Docker
**Why this is needed**:
Configuration errors not caught at startup:
- Database path doesn't exist - fails on first DB operation (confusing error)
- fail2ban socket wrong path - only fails when health check runs
- Directory not writable - discovered hours after deployment
**Goal**:
Validate all configuration at startup with clear error messages.
**What to do**:
1. Add validators to config fields:
```python
@field_validator("database_path")
def validate_db_path(cls, v):
path = Path(v)
parent = path.parent
if not parent.exists():
raise ValueError(
f"Database parent directory does not exist: {parent}\n"
f"Create it with: mkdir -p {parent}"
)
if not os.access(parent, os.W_OK):
raise ValueError(
f"Database directory not writable: {parent}\n"
f"Fix with: chmod 755 {parent}"
)
return v
```
2. Validate fail2ban socket exists and is readable
3. Verify session secret is sufficiently long
4. Check environment variables are set
5. Provide actionable error messages
**Possible traps and issues**:
- Validation might be too strict for some deployments
- Need to handle cases where files don't exist yet but will be created
- Docker initialization order might delay file creation
**Docs changes needed**:
- Document required directories and permissions
- Create setup validation troubleshooting guide
**Doc references**:
- DATABASE_API_DEPLOYMENT_ISSUES.md - Issue "5.2 Missing Configuration Validation"
---
### Issue #19: MEDIUM - Sensitive Data Could Leak in Logs
**Where found**:
- `backend/app/utils/fail2ban_client.py` (line 148) - Logs subprocess output without sanitization
- Could contain passwords, API keys from config files
**Why this is needed**:
If subprocess output contains secrets, logs become security liability and violate compliance requirements.
**Goal**:
Sanitize logs to remove sensitive information patterns.
**What to do**:
1. Create sanitizer function:
```python
def sanitize_for_logging(text: str) -> str:
# Remove passwords
text = re.sub(r'password[=:]\S+', 'password=***', text, flags=re.IGNORECASE)
# Remove API keys
text = re.sub(r'api[_-]?key[=:]\S+', 'api_key=***', text, flags=re.IGNORECASE)
# Remove tokens
text = re.sub(r'token[=:]\S+', 'token=***', text, flags=re.IGNORECASE)
return text
```
2. Apply to all subprocess output and external responses
3. Add to logging middleware
4. Audit existing logs for sensitive data
**Possible traps and issues**:
- Patterns might miss some sensitive data formats
- Over-sanitization might hide helpful debug info
- Performance cost if sanitizing large outputs
**Docs changes needed**:
- Add logging best practices guide
- Document what's sanitized
**Doc references**:
- DETAILED_FINDINGS.md - Issue #25 "Sensitive Data in Logs"
---
### Issue #20: MEDIUM - No Correlation ID in Background Tasks
**Where found**:

View File

@@ -391,7 +391,7 @@ def schedule_blocklist_job(
settings: Settings,
http_session: aiohttp.ClientSession,
config: ScheduleConfig,
run_import_callback: Callable[[Settings, aiohttp.ClientSession], Awaitable[None]],
run_import_callback: Callable[..., Awaitable[None]],
) -> None:
"""Register or replace the scheduled blocklist import job."""

View File

@@ -9,10 +9,16 @@ be updated at runtime through the blocklist router.
The scheduler job ID is ``"blocklist_import"`` — using a stable id means
re-registering the job (e.g. after a schedule update) safely replaces the
existing entry without creating duplicates.
Correlation IDs are propagated through the task using
:mod:`app.utils.correlation` so that task logs can be correlated with the
triggering request (for manually triggered imports) or assigned a unique
ID per run (for scheduled imports).
"""
from __future__ import annotations
import uuid
from typing import TYPE_CHECKING, Any
import structlog
@@ -20,6 +26,7 @@ import structlog
from app.services import ban_service, blocklist_service
from app.tasks.db import task_db
from app.tasks.timeout_utils import run_with_timeout
from app.utils.correlation import get_correlation_id, reset_correlation_id, set_correlation_id
from app.utils.runtime_state import get_effective_settings
if TYPE_CHECKING:
@@ -37,18 +44,36 @@ JOB_ID: str = "blocklist_import"
TASK_TIMEOUT_SECONDS: int = 300
async def _run_import_with_resources(settings: Settings, http_session: ClientSession) -> None:
async def _run_import_with_resources(
settings: Settings,
http_session: ClientSession,
correlation_id: str | None = None,
) -> None:
"""APScheduler callback that imports all enabled blocklist sources.
Args:
settings: The resolved application settings used for database access.
http_session: The shared aiohttp session used for blocklist downloads.
correlation_id: Optional correlation ID from the triggering request.
If ``None``, a new UUID4 is generated to identify this run.
"""
if correlation_id is None:
correlation_id = str(uuid.uuid4())
token = set_correlation_id(correlation_id)
try:
await _do_import_with_settings(settings, http_session)
finally:
reset_correlation_id(token)
async def _do_import_with_settings(settings: Settings, http_session: ClientSession) -> None:
"""Inner import logic that runs with correlation context set."""
async def _do_import() -> None:
socket_path: str = settings.fail2ban_socket
log.info("blocklist_import_starting")
log.info("blocklist_import_starting", correlation_id=get_correlation_id())
try:
async with task_db(settings) as db:
result = await blocklist_service.import_all(
@@ -59,12 +84,13 @@ async def _run_import_with_resources(settings: Settings, http_session: ClientSes
)
log.info(
"blocklist_import_finished",
correlation_id=get_correlation_id(),
total_imported=result.total_imported,
total_skipped=result.total_skipped,
errors=result.errors_count,
)
except Exception:
log.exception("blocklist_import_unexpected_error")
log.exception("blocklist_import_unexpected_error", correlation_id=get_correlation_id())
await run_with_timeout("blocklist_import", _do_import(), TASK_TIMEOUT_SECONDS)

View File

@@ -7,10 +7,14 @@ database file and maintains query performance on geo lookups.
When a stale IP is encountered again after purge, it will be re-resolved from
the MaxMind database or ip-api.com (if configured), which is acceptable.
Correlation IDs are propagated through the task using :mod:`app.utils.correlation`
so that task logs can be correlated across runs.
"""
from __future__ import annotations
import uuid
from datetime import UTC, datetime, timedelta
from typing import TYPE_CHECKING
@@ -19,6 +23,7 @@ import structlog
from app.repositories import geo_cache_repo
from app.tasks.db import task_db
from app.tasks.timeout_utils import run_with_timeout
from app.utils.correlation import get_correlation_id, reset_correlation_id, set_correlation_id
from app.utils.runtime_state import get_effective_settings
if TYPE_CHECKING:
@@ -41,7 +46,10 @@ JOB_ID: str = "geo_cache_cleanup"
TASK_TIMEOUT_SECONDS: int = 60
async def _run_cleanup_with_resources(settings: Settings) -> None:
async def _run_cleanup_with_resources(
settings: Settings,
correlation_id: str | None = None,
) -> None:
"""Delete stale entries from the geo cache.
Calculates a cutoff timestamp (now - retention period) and removes all
@@ -49,7 +57,20 @@ async def _run_cleanup_with_resources(settings: Settings) -> None:
Args:
settings: The resolved application settings used for database access.
correlation_id: Optional correlation ID for log correlation.
"""
if correlation_id is None:
correlation_id = str(uuid.uuid4())
token = set_correlation_id(correlation_id)
try:
await _do_cleanup_with_settings(settings)
finally:
reset_correlation_id(token)
async def _do_cleanup_with_settings(settings: Settings) -> None:
"""Inner cleanup logic that runs with correlation context set."""
async def _do_cleanup() -> None:
cutoff_dt = datetime.now(UTC) - timedelta(days=GEO_CACHE_RETENTION_DAYS)
@@ -60,9 +81,19 @@ async def _run_cleanup_with_resources(settings: Settings) -> None:
await db.commit()
if deleted > 0:
log.info("geo_cache_cleanup_ran", deleted=deleted, retention_days=GEO_CACHE_RETENTION_DAYS)
log.info(
"geo_cache_cleanup_ran",
correlation_id=get_correlation_id(),
deleted=deleted,
retention_days=GEO_CACHE_RETENTION_DAYS,
)
else:
log.debug("geo_cache_cleanup_ran", deleted=deleted, retention_days=GEO_CACHE_RETENTION_DAYS)
log.debug(
"geo_cache_cleanup_ran",
correlation_id=get_correlation_id(),
deleted=deleted,
retention_days=GEO_CACHE_RETENTION_DAYS,
)
await run_with_timeout("geo_cache_cleanup", _do_cleanup(), TASK_TIMEOUT_SECONDS)

View File

@@ -7,16 +7,21 @@ After Task 2 removed geo cache writes from GET requests, newly resolved IPs
are only held in the in-memory cache until this task flushes them. With the
default 60-second interval, at most one minute of new resolution results is
at risk on an unexpected process restart.
Correlation IDs are propagated through the task using :mod:`app.utils.correlation`
so that task logs can be correlated across runs.
"""
from __future__ import annotations
import uuid
from typing import TYPE_CHECKING
import structlog
from app.tasks.db import task_db
from app.tasks.timeout_utils import run_with_timeout
from app.utils.correlation import get_correlation_id, reset_correlation_id, set_correlation_id
from app.utils.runtime_state import get_effective_settings
if TYPE_CHECKING:
@@ -37,20 +42,37 @@ JOB_ID: str = "geo_cache_flush"
TASK_TIMEOUT_SECONDS: int = 60
async def _run_flush_with_resources(geo_cache: GeoCache, settings: Settings) -> None:
async def _run_flush_with_resources(
geo_cache: GeoCache,
settings: Settings,
correlation_id: str | None = None,
) -> None:
"""Flush the geo cache dirty set to the application database.
Args:
geo_cache: The application's GeoCache instance.
settings: The resolved application settings used for database access.
correlation_id: Optional correlation ID for log correlation.
"""
if correlation_id is None:
correlation_id = str(uuid.uuid4())
token = set_correlation_id(correlation_id)
try:
await _do_flush_with_resources(geo_cache, settings)
finally:
reset_correlation_id(token)
async def _do_flush_with_resources(geo_cache: GeoCache, settings: Settings) -> None:
"""Inner flush logic that runs with correlation context set."""
async def _do_flush() -> None:
async with task_db(settings) as db:
count = await geo_cache.flush_dirty(db)
if count > 0:
log.debug("geo_cache_flush_ran", flushed=count)
log.debug("geo_cache_flush_ran", correlation_id=get_correlation_id(), flushed=count)
await run_with_timeout("geo_cache_flush", _do_flush(), TASK_TIMEOUT_SECONDS)

View File

@@ -13,16 +13,21 @@ The task runs every 10 minutes. On each invocation it:
3. Delegates to :meth:`~app.services.geo_cache.GeoCache.lookup_batch` which
already handles rate-limit throttling and retries.
4. Logs how many IPs were retried and how many resolved successfully.
Correlation IDs are propagated through the task using :mod:`app.utils.correlation`
so that task logs can be correlated across runs.
"""
from __future__ import annotations
import uuid
from typing import TYPE_CHECKING
import structlog
from app.tasks.db import task_db
from app.tasks.timeout_utils import run_with_timeout
from app.utils.correlation import get_correlation_id, reset_correlation_id, set_correlation_id
from app.utils.runtime_state import get_effective_settings
if TYPE_CHECKING:
@@ -45,7 +50,10 @@ TASK_TIMEOUT_SECONDS: int = 120
async def _run_re_resolve_with_resources(
geo_cache: GeoCache, settings: Settings, http_session: ClientSession
geo_cache: GeoCache,
settings: Settings,
http_session: ClientSession,
correlation_id: str | None = None,
) -> None:
"""Query NULL-country IPs from the database and re-resolve them.
@@ -53,7 +61,24 @@ async def _run_re_resolve_with_resources(
geo_cache: The application's GeoCache instance.
settings: The resolved application settings used for database access.
http_session: The shared aiohttp session used for external lookups.
correlation_id: Optional correlation ID for log correlation.
"""
if correlation_id is None:
correlation_id = str(uuid.uuid4())
token = set_correlation_id(correlation_id)
try:
await _do_re_resolve_with_resources(geo_cache, settings, http_session)
finally:
reset_correlation_id(token)
async def _do_re_resolve_with_resources(
geo_cache: GeoCache,
settings: Settings,
http_session: ClientSession,
) -> None:
"""Inner re-resolve logic that runs with correlation context set."""
async def _do_re_resolve() -> None:
async with task_db(settings) as db:
@@ -61,10 +86,10 @@ async def _run_re_resolve_with_resources(
unresolved_ips = await geo_cache.get_unresolved_ips(db)
if not unresolved_ips:
log.debug("geo_re_resolve_skip", reason="no_unresolved_ips")
log.debug("geo_re_resolve_skip", correlation_id=get_correlation_id(), reason="no_unresolved_ips")
return
log.info("geo_re_resolve_start", unresolved=len(unresolved_ips))
log.info("geo_re_resolve_start", correlation_id=get_correlation_id(), unresolved=len(unresolved_ips))
# Clear the negative cache so these IPs are eligible for fresh API calls.
await geo_cache.clear_neg_cache()
@@ -78,6 +103,7 @@ async def _run_re_resolve_with_resources(
)
log.info(
"geo_re_resolve_complete",
correlation_id=get_correlation_id(),
retried=len(unresolved_ips),
resolved=resolved_count,
)

View File

@@ -13,11 +13,15 @@ keys). If the health probe subsequently detects an online→offline transition
within 60 seconds of that activation, a
:class:`~app.models.config.PendingRecovery` record is written to
``app.state.pending_recovery`` so the UI can offer a one-click rollback.
Correlation IDs are propagated through the task using :mod:`app.utils.correlation`
so that task logs can be correlated across runs.
"""
from __future__ import annotations
import datetime
import uuid
from typing import TYPE_CHECKING
import structlog
@@ -25,6 +29,7 @@ import structlog
from app.models.server import ServerStatus
from app.services import health_service
from app.tasks.timeout_utils import run_with_timeout
from app.utils.correlation import reset_correlation_id, set_correlation_id
from app.utils.runtime_state import (
RuntimeState,
get_effective_settings,
@@ -47,13 +52,30 @@ HEALTH_CHECK_INTERVAL: int = 30
HEALTH_PROBE_TIMEOUT_SECONDS: int = 10
async def _run_probe_with_resources(settings: Settings, runtime_state: RuntimeState) -> None:
async def _run_probe_with_resources(
settings: Settings,
runtime_state: RuntimeState,
correlation_id: str | None = None,
) -> None:
"""Probe fail2ban and cache the result on the runtime state.
Args:
settings: The resolved application settings used for the probe.
runtime_state: The mutable runtime state manager.
correlation_id: Optional correlation ID for log correlation.
"""
if correlation_id is None:
correlation_id = str(uuid.uuid4())
token = set_correlation_id(correlation_id)
try:
await _do_probe_with_resources(settings, runtime_state)
finally:
reset_correlation_id(token)
async def _do_probe_with_resources(settings: Settings, runtime_state: RuntimeState) -> None:
"""Inner probe logic that runs with correlation context set."""
async def _do_probe() -> None:
socket_path: str = settings.fail2ban_socket

View File

@@ -2,11 +2,15 @@
Periodically copies new records from the fail2ban sqlite database into the
BanGUI application archive table to prevent gaps when fail2ban purges old rows.
Correlation IDs are propagated through the task using :mod:`app.utils.correlation`
so that task logs can be correlated across runs.
"""
from __future__ import annotations
import datetime
import uuid
from typing import TYPE_CHECKING
import structlog
@@ -14,6 +18,7 @@ import structlog
from app.services import history_service
from app.tasks.db import task_db
from app.tasks.timeout_utils import run_with_timeout
from app.utils.correlation import get_correlation_id, reset_correlation_id, set_correlation_id
from app.utils.runtime_state import get_effective_settings
if TYPE_CHECKING:
@@ -36,16 +41,32 @@ BACKFILL_WINDOW: int = 648000
TASK_TIMEOUT_SECONDS: int = 60
async def _run_sync_with_settings(settings: Settings) -> None:
async def _run_sync_with_settings(
settings: Settings,
correlation_id: str | None = None,
) -> None:
"""Run the history sync with correlation ID context."""
if correlation_id is None:
correlation_id = str(uuid.uuid4())
token = set_correlation_id(correlation_id)
try:
await _do_sync_with_settings(settings)
finally:
reset_correlation_id(token)
async def _do_sync_with_settings(settings: Settings) -> None:
"""Inner sync logic that runs with correlation context set."""
socket_path: str = settings.fail2ban_socket
async def _do_sync() -> None:
try:
async with task_db(settings) as db:
synced = await history_service.sync_from_fail2ban_db(db, socket_path)
log.info("history_sync_complete", synced=synced)
log.info("history_sync_complete", correlation_id=get_correlation_id(), synced=synced)
except Exception:
log.exception("history_sync_failed")
log.exception("history_sync_failed", correlation_id=get_correlation_id())
await run_with_timeout("history_sync", _do_sync(), TASK_TIMEOUT_SECONDS)

View File

@@ -8,15 +8,20 @@ consuming excessive memory.
The cleanup is conservative: it only removes IPs with no recent attempts
(all timestamps outside the rate-limit window), so active or recently-active
IPs are preserved.
Correlation IDs are propagated through the task using :mod:`app.utils.correlation`
so that task logs can be correlated across runs.
"""
from __future__ import annotations
import uuid
from typing import TYPE_CHECKING
import structlog
from app.tasks.timeout_utils import run_with_timeout
from app.utils.correlation import get_correlation_id, reset_correlation_id, set_correlation_id
if TYPE_CHECKING:
from fastapi import FastAPI
@@ -35,7 +40,10 @@ JOB_ID: str = "rate_limiter_cleanup"
TASK_TIMEOUT_SECONDS: int = 5
async def _run_cleanup(app: FastAPI) -> None:
async def _run_cleanup(
app: FastAPI,
correlation_id: str | None = None,
) -> None:
"""Trigger cleanup of expired rate-limiter entries.
Cleans up both the login-specific rate limiter (exponential backoff)
@@ -43,13 +51,27 @@ async def _run_cleanup(app: FastAPI) -> None:
Args:
app: The FastAPI application instance (holds the rate limiters).
correlation_id: Optional correlation ID for log correlation.
"""
if correlation_id is None:
correlation_id = str(uuid.uuid4())
token = set_correlation_id(correlation_id)
try:
await _do_cleanup_with_app(app)
finally:
reset_correlation_id(token)
async def _do_cleanup_with_app(app: FastAPI) -> None:
"""Inner cleanup logic that runs with correlation context set."""
async def _do_cleanup() -> None:
login_limiter = getattr(app.state, "login_rate_limiter", None)
if login_limiter is None:
log.warning(
"rate_limiter_cleanup_skipped",
correlation_id=get_correlation_id(),
reason="login_rate_limiter not found on app.state",
)
else:
@@ -59,6 +81,7 @@ async def _run_cleanup(app: FastAPI) -> None:
if global_limiter is None:
log.warning(
"rate_limiter_cleanup_skipped",
correlation_id=get_correlation_id(),
reason="global_rate_limiter not found on app.state",
)
else:

View File

@@ -7,16 +7,21 @@ if the running instance experiences temporary delays or high load.
Without this heartbeat, stale lock detection (based on TTL) could incorrectly
determine that the scheduler instance has crashed when it's merely busy, and
a new instance could take over.
Correlation IDs are propagated through the task using :mod:`app.utils.correlation`
so that task logs can be correlated across runs.
"""
from __future__ import annotations
import uuid
from typing import TYPE_CHECKING
import structlog
from app.tasks.db import task_db
from app.tasks.timeout_utils import run_with_timeout
from app.utils.correlation import get_correlation_id, reset_correlation_id, set_correlation_id
from app.utils.runtime_state import get_effective_settings
from app.utils.scheduler_lock import update_scheduler_lock_heartbeat
@@ -38,7 +43,10 @@ JOB_ID: str = "scheduler_lock_heartbeat"
TASK_TIMEOUT_SECONDS: int = 5
async def _update_heartbeat_with_resources(settings: Settings) -> None:
async def _update_heartbeat_with_resources(
settings: Settings,
correlation_id: str | None = None,
) -> None:
"""Update the scheduler lock heartbeat timestamp.
If the heartbeat update fails (e.g., we no longer hold the lock), log
@@ -51,17 +59,31 @@ async def _update_heartbeat_with_resources(settings: Settings) -> None:
Args:
settings: The resolved application settings used for database access.
correlation_id: Optional correlation ID from the triggering request.
"""
if correlation_id is None:
correlation_id = str(uuid.uuid4())
token = set_correlation_id(correlation_id)
try:
await _do_update_heartbeat_with_settings(settings)
finally:
reset_correlation_id(token)
async def _do_update_heartbeat_with_settings(settings: Settings) -> None:
"""Inner heartbeat logic that runs with correlation context set."""
async def _do_update() -> None:
async with task_db(settings) as db:
success = await update_scheduler_lock_heartbeat(db)
if success:
log.debug("scheduler_lock_heartbeat_updated")
log.debug("scheduler_lock_heartbeat_updated", correlation_id=get_correlation_id())
else:
log.warning(
"scheduler_lock_heartbeat_failed",
correlation_id=get_correlation_id(),
message="Failed to update heartbeat; we no longer hold the lock. "
"Another instance may have taken over or the database connection failed.",
)
@@ -71,12 +93,14 @@ async def _update_heartbeat_with_resources(settings: Settings) -> None:
except TimeoutError:
log.error(
"scheduler_lock_heartbeat_timeout",
correlation_id=get_correlation_id(),
timeout_seconds=TASK_TIMEOUT_SECONDS,
message="Heartbeat update exceeded timeout. The database may be slow or unresponsive.",
)
except Exception as e:
log.error(
"scheduler_lock_heartbeat_error",
correlation_id=get_correlation_id(),
error=str(e),
message="Unexpected error during heartbeat update.",
)

View File

@@ -6,10 +6,14 @@ months of operation and degrades query performance.
Individual expired sessions are removed on-demand when validated, but the bulk
cleanup ensures comprehensive pruning at a predictable interval.
Correlation IDs are propagated through the task using :mod:`app.utils.correlation`
so that task logs can be correlated across runs.
"""
from __future__ import annotations
import uuid
from typing import TYPE_CHECKING
import structlog
@@ -17,6 +21,7 @@ import structlog
from app.repositories import session_repo
from app.tasks.db import task_db
from app.tasks.timeout_utils import run_with_timeout
from app.utils.correlation import get_correlation_id, reset_correlation_id, set_correlation_id
from app.utils.runtime_state import get_effective_settings
from app.utils.time_utils import utc_now
@@ -37,19 +42,40 @@ JOB_ID: str = "session_cleanup"
TASK_TIMEOUT_SECONDS: int = 30
async def _run_cleanup_with_resources(settings: Settings) -> None:
async def _run_cleanup_with_resources(
settings: Settings,
correlation_id: str | None = None,
) -> None:
"""Delete all expired sessions from the database.
Args:
settings: The resolved application settings used for database access.
correlation_id: Optional correlation ID from the triggering request.
"""
if correlation_id is None:
correlation_id = str(uuid.uuid4())
token = set_correlation_id(correlation_id)
try:
await _do_cleanup_with_settings(settings)
finally:
reset_correlation_id(token)
async def _do_cleanup_with_settings(settings: Settings) -> None:
"""Inner cleanup logic that runs with correlation context set."""
async def _do_cleanup() -> None:
now_iso = utc_now().isoformat()
async with task_db(settings) as db:
deleted_count = await session_repo.delete_expired_sessions(db, now_iso)
log.info("session_cleanup_ran", deleted_count=deleted_count, cutoff_time=now_iso)
log.info(
"session_cleanup_ran",
correlation_id=get_correlation_id(),
deleted_count=deleted_count,
cutoff_time=now_iso,
)
await run_with_timeout("session_cleanup", _do_cleanup(), TASK_TIMEOUT_SECONDS)

View File

@@ -29,6 +29,7 @@ from app.utils.fail2ban_client import (
Fail2BanResponse,
)
from app.utils.fail2ban_response import ok, to_dict
from app.utils.log_sanitizer import sanitize_for_logging
log: structlog.stdlib.BoundLogger = structlog.get_logger()
@@ -500,8 +501,8 @@ async def start_daemon(start_cmd_parts: list[str]) -> bool:
"fail2ban_start_failed",
command=" ".join(start_cmd_parts),
returncode=process.returncode,
stdout=stdout.decode("utf-8", errors="replace"),
stderr=stderr.decode("utf-8", errors="replace"),
stdout=sanitize_for_logging(stdout.decode("utf-8", errors="replace")),
stderr=sanitize_for_logging(stderr.decode("utf-8", errors="replace")),
)
return False
log.info(

View File

@@ -0,0 +1,70 @@
"""Correlation ID context variable for distributed tracing.
This module provides a :class:`contextvars.ContextVar` that stores the correlation
ID for the current execution context. Background tasks inherit or explicitly set
the correlation ID to enable log correlation across request/task boundaries.
Usage in background tasks::
from app.utils.correlation import get_correlation_id, set_correlation_id
async def my_background_task(correlation_id: str) -> None:
token = set_correlation_id(correlation_id)
try:
log.info("task_started") # Logs include correlation_id
finally:
reset_correlation_id(token)
For APScheduler jobs, pass the correlation ID through kwargs and call
:func:`set_correlation_id` at the start of the async callback.
"""
from __future__ import annotations
from contextvars import ContextVar
from typing import TYPE_CHECKING
if TYPE_CHECKING:
from contextvars import Token
#: Default correlation ID used when no parent request context exists.
_DEFAULT_BG_CORRELATION_ID: str = "bg-task"
#: ContextVar holding the current correlation ID.
correlation_id_var: ContextVar[str] = ContextVar(
"correlation_id",
default=_DEFAULT_BG_CORRELATION_ID,
)
def get_correlation_id() -> str:
"""Return the current correlation ID from context.
Returns:
The current correlation ID string. Returns ``"bg-task"`` if no
correlation ID has been set in the current context.
"""
return correlation_id_var.get()
def set_correlation_id(correlation_id: str) -> Token[str]:
"""Set the correlation ID for the current context.
Args:
correlation_id: The correlation ID to bind (typically passed from
the triggering request).
Returns:
A :class:`contextvars.Token` that must be passed to
:func:`reset_correlation_id` when the task completes.
"""
return correlation_id_var.set(correlation_id)
def reset_correlation_id(token: Token[str]) -> None:
"""Reset the correlation ID to its previous value.
Args:
token: The token returned by :func:`set_correlation_id`.
"""
correlation_id_var.reset(token)

View File

@@ -0,0 +1,50 @@
"""Log sanitization utilities for preventing sensitive data leakage.
All external output (subprocess, API responses, config data) passed to
structlog MUST be sanitized first. This module provides the canonical
sanitize_for_logging() function used across the codebase.
"""
from __future__ import annotations
import re
# Patterns for sensitive data that must never appear in logs.
_SENSITIVE_PATTERNS: list[tuple[re.Pattern[str], str]] = [
# Passwords: password=PASS, password:PASS, password = PASS, etc.
(re.compile(r"password[=:]\S+", re.IGNORECASE), "password=***"),
# API keys: api_key=X, api-key=X, APIKEY=X, etc.
(re.compile(r"api[_-]?key[=:]\S+", re.IGNORECASE), "api_key=***"),
# Auth tokens: token=X, token: X, etc.
(re.compile(r"token[=:]\S+", re.IGNORECASE), "token=***"),
# Authorization headers: Authorization: Bearer <token>
(re.compile(r"(?i)Authorization:\s*(?:Bearer\s+)?\S+"), "Authorization: ***"),
# Secret values: secret=X, secret: X
(re.compile(r"secret[=:]\S+", re.IGNORECASE), "secret=***"),
# Private keys: -----BEGIN (RSA/DSA/EC/OPENSSH) PRIVATE KEY-----
(re.compile(r"-----BEGIN +(?:RSA|DSA|EC|OPENSSH) +PRIVATE KEY-----"), "*** PRIVATE KEY ***"),
# AWS access keys: AKIA...
(re.compile(r"AKIA[0-9A-Z]{16}"), "AKIA***"),
# Generic secret= patterns
(re.compile(r"secret[_-]?key[=:]\S+", re.IGNORECASE), "secret_key=***"),
# Bearer tokens in Authorization headers
(re.compile(r"Bearer\s+[A-Za-z0-9_-]+\.[A-Za-z0-9_-]+\.[A-Za-z0-9_-]+"), "Bearer ***"),
]
def sanitize_for_logging(text: str) -> str:
"""Remove sensitive data patterns from text before logging.
Applies a set of regex substitutions to strip passwords, API keys,
tokens, secrets, private keys, and other credential-like strings.
Substituted values are replaced with a fixed redaction marker.
Args:
text: Raw text that may contain sensitive data.
Returns:
Text with sensitive patterns replaced by ``***``.
"""
for pattern, replacement in _SENSITIVE_PATTERNS:
text = pattern.sub(replacement, text)
return text

View File

@@ -176,7 +176,10 @@ class TestRunImport:
# Must not raise — the task swallows unexpected errors.
await _run_import(app)
mock_log.exception.assert_called_once_with("blocklist_import_unexpected_error")
mock_log.exception.assert_called_once()
call_args = mock_log.exception.call_args
assert call_args[0][0] == "blocklist_import_unexpected_error"
assert "correlation_id" in call_args[1]
# ---------------------------------------------------------------------------

View File

@@ -0,0 +1,150 @@
"""Tests for app.utils.log_sanitizer."""
from __future__ import annotations
import pytest
from app.utils.log_sanitizer import sanitize_for_logging
class TestSanitizeForLogging:
"""Tests for sanitize_for_logging()."""
def test_passthrough_clean_text(self) -> None:
"""No sensitive patterns: text unchanged."""
text = "Server started on port 8000"
assert sanitize_for_logging(text) == text
def test_password_eq(self) -> None:
"""password=X replaced."""
assert sanitize_for_logging("password=Secret123") == "password=***"
def test_password_colon(self) -> None:
"""password:X replaced."""
assert sanitize_for_logging("password:Secret123") == "password=***"
def test_password_case_insensitive(self) -> None:
"""Password matching is case-insensitive."""
assert sanitize_for_logging("PASSWORD=Secret123") == "password=***"
assert sanitize_for_logging("Password:Secret123") == "password=***"
def test_api_key_underscore(self) -> None:
"""api_key=X replaced."""
assert sanitize_for_logging("api_key=my-secret-key") == "api_key=***"
def test_api_key_dash(self) -> None:
"""api-key=X replaced."""
assert sanitize_for_logging("api-key=my-secret-key") == "api_key=***"
def test_api_key_no_separator(self) -> None:
"""api_keyXYZ (no separator) is NOT matched by the api_key pattern.
The pattern requires =, :, _, or - after 'key', so plain 'api_keyXYZ'
passes through unsanitized. This is intentional — the risk of
false-positives (normal words like 'api_keyboard') outweighs coverage.
"""
result = sanitize_for_logging("api_keyXYZ")
# No separator → not matched → unchanged
assert result == "api_keyXYZ"
def test_token_eq(self) -> None:
"""token=X replaced."""
assert sanitize_for_logging("token=eyJhbGciOiJIUzI1NiJ9") == "token=***"
def test_token_case_insensitive(self) -> None:
"""Token matching is case-insensitive."""
assert sanitize_for_logging("TOKEN=eyJhbGciOiJIUzI1NiJ9") == "token=***"
def test_authorization_bearer(self) -> None:
"""Authorization: Bearer <token> replaced."""
result = sanitize_for_logging("Authorization: Bearer eyJhbGciOiJIUzI1NiJ9")
assert "***" in result
assert "Bearer" not in result
def test_authorization_bearer_only(self) -> None:
"""Authorization: Bearer (no token) replaced."""
result = sanitize_for_logging("Authorization: Bearer")
assert "***" in result
def test_authorization_basic(self) -> None:
"""Authorization: Basic replaced."""
result = sanitize_for_logging("Authorization: Basic dXNlcjpwYXNz")
assert "***" in result
def test_secret_eq(self) -> None:
"""secret=X replaced."""
assert sanitize_for_logging("secret=my_secret_value") == "secret=***"
def test_secret_key_eq(self) -> None:
"""secret_key=X replaced."""
assert sanitize_for_logging("secret_key=my_secret_value") == "secret_key=***"
def test_rsa_private_key(self) -> None:
"""RSA private key header redacted."""
text = "Some text -----BEGIN RSA PRIVATE KEY-----\nMIIBogAAAAAAA="
result = sanitize_for_logging(text)
assert "*** PRIVATE KEY ***" in result
assert "BEGIN RSA PRIVATE KEY" not in result
def test_dsa_private_key(self) -> None:
"""DSA private key header redacted."""
text = "Some text -----BEGIN DSA PRIVATE KEY-----\nMIIBogAAAAAAA="
result = sanitize_for_logging(text)
assert "*** PRIVATE KEY ***" in result
assert "BEGIN DSA PRIVATE KEY" not in result
def test_ec_private_key(self) -> None:
"""EC private key header redacted."""
text = "Some text -----BEGIN EC PRIVATE KEY-----\nMIIBogAAAAAAA="
result = sanitize_for_logging(text)
assert "*** PRIVATE KEY ***" in result
assert "BEGIN EC PRIVATE KEY" not in result
def test_openssh_private_key(self) -> None:
"""OPENSSH private key header redacted."""
text = "Some text -----BEGIN OPENSSH PRIVATE KEY-----\nMIIBogAAAAAAA="
result = sanitize_for_logging(text)
assert "*** PRIVATE KEY ***" in result
assert "BEGIN OPENSSH PRIVATE KEY" not in result
def test_aws_access_key(self) -> None:
"""AKIA... AWS access key redacted."""
text = "Access key: AKIAIOSFODNN7EXAMPLE"
result = sanitize_for_logging(text)
assert "AKIA***" in result
assert "EXAMPLE" not in result
def test_bearer_jwt_token(self) -> None:
"""Standalone JWT token after 'Bearer' is redacted."""
text = "Bearer eyJhbGciOiJIUzI1NiJ9.eyJzdWIiOiIxMjM0NTY3ODkwIn0.dozjgNryP4J3jVmNHl0w5N_XgL0n3I9PlFUP0THsR8U"
result = sanitize_for_logging(text)
assert result == "Bearer ***"
def test_multiple_sensitive_values(self) -> None:
"""Multiple sensitive patterns on one line all redacted."""
text = "password=pass1 token=tok1 api_key=key1"
result = sanitize_for_logging(text)
assert result.count("***") == 3
def test_empty_string(self) -> None:
"""Empty string returns empty string."""
assert sanitize_for_logging("") == ""
def test_only_sensitive_pattern(self) -> None:
"""Text that is only a sensitive pattern returns only marker."""
assert sanitize_for_logging("password=secret") == "password=***"
def test_multiline_text(self) -> None:
"""Sensitive data redacted across multiple lines."""
text = "Line 1: password=secret\nLine 2: token=tok123\nLine 3: clean"
result = sanitize_for_logging(text)
assert "***" in result
assert "secret" not in result
assert "tok123" not in result
assert "clean" in result
def test_jwt_structure_preserved(self) -> None:
"""JWT dot-separated structure not matched as token=X."""
text = "header.payload.signature"
result = sanitize_for_logging(text)
assert result == text