diff --git a/Docs/Backend-Development.md b/Docs/Backend-Development.md index c98233e..465c1db 100644 --- a/Docs/Backend-Development.md +++ b/Docs/Backend-Development.md @@ -705,35 +705,198 @@ async def test_migration_2_is_atomic(tmp_path: Path) -> None: --- -## 7. Logging +## 7. Structured Logging Policy -- Use **structlog** for every log message. -- Bind contextual key-value pairs — never format strings manually. -- Log levels: `debug` for development detail, `info` for operational events, `warning` for recoverable issues, `error` for failures, `critical` for fatal problems. -- **Never log sensitive data** (passwords, tokens, session tokens, raw credentials, private keys). - - For session correlation without leaking token material, use a one-way hash fragment: `hashlib.sha256(token.encode()).hexdigest()[:12]`. - - Use numeric database IDs for entity correlation instead of raw identifiers: `session_id=session.id` instead of `token=session.token`. +All logging in BanGUI services and tasks must use **structlog** for consistent, queryable event tracking. This policy defines when and how to log at each level. + +### 7.1 Logging Levels and When to Use Them + +**INFO** — User-facing operations and state changes +- Use for significant operational events that the operator should know about. +- Examples: service startup/completion, resource creation/deletion, state transitions, successful operations with business impact. +- Never excessive — keep volume reasonable to maintain log clarity. +- Include relevant context: resource IDs, counts, configuration changes. ```python -import structlog -import hashlib +log.info("jail_activated", jail=name) +log.info("blocklist_source_created", id=new_id, name=name, url=url) +log.info("session_cleanup_ran", deleted_count=count, cutoff_time=now_iso) +``` -log: structlog.stdlib.BoundLogger = structlog.get_logger() +**WARNING** — Recoverable failures, degraded functionality, unexpected but handled conditions +- Use for issues that are not fatal but indicate something is wrong or suboptimal. +- Examples: missing optional config, fallback behavior triggered, non-critical API call failed, parsing errors, missing resources that have workarounds. +- Include error details and context to enable investigation. +- Do NOT use for expected error paths (e.g., wrong password in login attempt). -async def ban_ip(ip: str, jail: str) -> None: - log.info("banning_ip", ip=ip, jail=jail) +```python +log.warning("jail_status_parse_error", jail=name, error=str(exc)) +log.warning("geo_lookup_failed", ip=ip, error=type(exc).__name__) +log.warning("geoip_mmdb_not_found", path=mmdb_path) +``` + +**ERROR** — Fatal/unrecoverable failures within a request or task +- Use for errors that prevent an operation from completing. +- Examples: database write failed, critical resource is missing, state is corrupted. +- Include the full error context to enable debugging. +- Pair with exception handling — if you log an error, you've decided the caller should handle it or return a failure. + +```python +log.error("jail_activation_rollback_restore_failed", jail=name, error=str(exc)) +log.error("fail2ban_probe_parse_error", error=str(exc)) +``` + +**EXCEPTION** — Unhandled exceptions in background tasks and scheduled jobs +- Use `log.exception()` (not `log.error()`) in catch-all exception handlers to automatically capture the full traceback. +- This level is ONLY for surprises that should never happen in production. +- Use in task callback exception handlers and top-level task runners. + +```python +try: + result = await blocklist_service.import_all(...) +except Exception: + log.exception("blocklist_import_unexpected_error") +``` + +**DEBUG** — Low-level details for development and troubleshooting +- Use for details too verbose for normal operation (e.g., successful lookups, parsed files, state transitions, loop iterations). +- Include data samples and validation results. +- Safe to leave in the code — debug logs are not emitted by default in production. + +```python +log.debug("geo_lookup_success_mmdb", ip=ip, country=result.country_code) +log.debug("action_file_parsed", name=raw.name) +log.debug("backend_cmd_supported_detected") +``` + +### 7.2 Event Naming Convention + +All log event names must follow a consistent pattern for queryability: + +**Pattern:** `{domain}_{entity_or_action}[_{result_or_detail}]` + +- **Domain** — The service or feature area (e.g., `jail`, `blocklist`, `geo`, `auth`, `ban`). +- **Entity or Action** — The noun or verb describing what happened (e.g., `activated`, `created`, `failed`, `synced`). +- **Result or Detail** (optional) — Additional specificity for complex scenarios (e.g., `_restore_failed`, `_non_200`, `_no_fallback`). + +**Examples (organized by domain):** + +| Domain | INFO | WARNING | ERROR | DEBUG | +|--------|------|---------|-------|-------| +| **jail** | `jail_activated`, `jail_deactivated`, `jail_reloaded` | `jail_status_parse_error`, `jail_rollback_failed` | `jail_activation_rollback_restore_failed` | `jail_file_parsed`, `backend_cmd_supported_detected` | +| **geo** | `geo_cache_loaded_from_db`, `geo_flush_dirty_complete` | `geo_lookup_failed`, `geo_persist_failed` | - | `geo_lookup_success_mmdb`, `geo_cache_cleanup_ran` | +| **blocklist** | `blocklist_import_starting`, `blocklist_source_created` | `blocklist_schedule_invalid`, `blocklist_preview_failed` | - | `blocklist_ban_failed` | +| **ban** | `active_bans_fetched`, `all_ips_unbanned` | `ban_service_geo_lookup_failed` | `ban_service_geo_lookup_unexpected_error` | `ban_entry_parse_error` | +| **auth** | `bangui_login_success`, `bangui_logout` | `bangui_login_wrong_password`, `bangui_login_no_hash` | - | - | +| **config** | `filter_created`, `action_updated` | `filter_read_error`, `action_d_not_found` | - | `filter_file_parsed` | +| **setup** | `bangui_setup_started`, `bangui_setup_completed` | - | `bangui_setup_failed` | - | + +**Key rules:** +- Use **snake_case** (never camelCase or PascalCase). +- Keep event names short but descriptive — aim for 2–4 words. +- Use consistent terminology across the codebase (e.g., always `_created`, never `_added` or `_new`). +- Prefix task/background job events with the job name (e.g., `blocklist_import_starting`, `session_cleanup_ran`). + +### 7.3 Structured Context and Key-Value Pairs + +Always log with **structured context** — key-value pairs that make logs queryable and analyzable. + +**Essential patterns:** + +```python +# Operation with count +log.info("active_bans_fetched", total=len(bans)) + +# Resource manipulation with ID +log.info("blocklist_source_deleted", id=source_id) + +# State transition with reason or error +log.warning("geo_persist_failed", ip=ip, error=type(exc).__name__) + +# Time-bounded operation +log.info("session_cleanup_ran", deleted_count=count, cutoff_time=now_iso) + +# Config or feature change +log.info("blocklist_schedule_updated", frequency=config.frequency, hour=config.hour) + +# Batch operation with metrics +log.info("blocklist_import_finished", total_imported=result.total_imported, + total_skipped=result.total_skipped, errors=result.errors_count) +``` + +**What to include:** +- Resource IDs (jail names, IP addresses, source IDs). +- Counts and metrics (rows processed, items synced, errors). +- Configuration or decision points (enabled/disabled flags, thresholds). +- Timestamps and durations for long-running operations (optional but useful). +- Error types and short error messages (use `type(exc).__name__` or `str(exc)` depending on context). + +**What to NEVER log:** +- Sensitive data: passwords, tokens, session IDs, API keys, hashes. + - For session correlation without leaking token material, use a one-way hash fragment: `hashlib.sha256(token.encode()).hexdigest()[:12]`. + - Use numeric database IDs for entity correlation instead of raw identifiers: `session_id=session.id` instead of `token=session.token`. +- Full exception tracebacks in INFO/WARNING (use `log.exception()` only in catch-all handlers). +- Redundant system time (structlog adds `timestamp` automatically). +- User PII in most cases (name, email, phone, etc.) — unless directly relevant to debugging and anonymized. + +### 7.4 Background Tasks and Scheduled Jobs + +Every background task (APScheduler job) must follow this pattern: + +1. **On startup:** Log `{job_name}_scheduled` with interval or cron expression. +2. **On execution:** Log `{job_name}_starting` (INFO) and `{job_name}_finished` or `{job_name}_ran` (INFO) with metrics. +3. **On exception:** Use `log.exception("{job_name}_unexpected_error")` in the top-level try/except. + +```python +# app/tasks/blocklist_import.py +async def _run_import_with_resources(settings: Settings, http_session: ClientSession) -> None: + """APScheduler callback that imports all enabled blocklist sources.""" + log.info("blocklist_import_starting") try: - await _execute_ban(ip, jail) - log.info("ip_banned", ip=ip, jail=jail) - except BanError as exc: - log.error("ban_failed", ip=ip, jail=jail, error=str(exc)) - raise + result = await blocklist_service.import_all(...) + log.info("blocklist_import_finished", + total_imported=result.total_imported, + total_skipped=result.total_skipped, + errors=result.errors_count) + except Exception: + log.exception("blocklist_import_unexpected_error") -async def logout_session(db: aiosqlite.Connection, token: str) -> None: - # Use a one-way hash for token correlation in logs - token_hash = hashlib.sha256(token.encode()).hexdigest()[:12] - await session_repo.delete_session(db, token) - log.info("session_terminated", token_hash=token_hash) +# Register and log in the lifespan or register function: +log.info("blocklist_import_scheduled", interval_seconds=INTERVAL_SECONDS) +``` + +### 7.5 Service Functions and Methods + +Service functions should log at entry/exit for significant operations, or when errors occur: + +**Entry logging** (optional, use for complex or long-running operations): +```python +async def import_all(...) -> ImportResult: + log.debug("blocklist_import_starting", count=len(sources)) + try: + ... + except SomeError: + log.warning("blocklist_import_partial_failure", imported=count, error=str(exc)) +``` + +**Exit/success logging** (log results with metrics): +```python +async def get_all_jails(socket_path: str) -> list[JailResponse]: + jails = await _fetch_jails(socket_path) + log.info("jail_list_fetched", count=len(jails)) + return jails +``` + +**Error handling** (log with context, let caller decide how to respond): +```python +try: + result = await fetch_external_data(url) +except TimeoutError: + log.warning("external_fetch_timeout", url=url, timeout_seconds=TIMEOUT) + raise +except Exception as exc: + log.error("external_fetch_failed", url=url, error=type(exc).__name__) + raise ``` --- @@ -1588,7 +1751,7 @@ atomic_write(path, updated_content) # Atomic write, auto-cleanup on error These principles are **non-negotiable**. Every backend contributor must internalise and apply them daily. -### 15.1 Clean Code +### 17.1 Clean Code - Write code that **reads like well-written prose** — a new developer should understand intent without asking. - **Meaningful names** — variables, functions, and classes must reveal their purpose. Avoid abbreviations (`cnt`, `mgr`, `tmp`) unless universally understood. @@ -1619,7 +1782,7 @@ async def check(ip, j): raise Exception("not found") ``` -### 15.2 Separation of Concerns (SoC) +### 17.2 Separation of Concerns (SoC) - Each module, class, and function must have a **single, well-defined responsibility**. - **Routers** → HTTP layer only (parse requests, return responses). @@ -1629,29 +1792,29 @@ async def check(ip, j): - **Tasks** → scheduled background jobs. - Never mix layers — a router must not execute SQL, and a repository must not raise `HTTPException`. -### 15.3 Single Responsibility Principle (SRP) +### 17.3 Single Responsibility Principle (SRP) - A class or module should have **one and only one reason to change**. - If a service handles both ban management *and* email notifications, split it into `BanService` and `NotificationService`. -### 15.4 Don't Repeat Yourself (DRY) +### 17.4 Don't Repeat Yourself (DRY) - Extract shared logic into utility functions, base classes, or dependency providers. - If the same block of code appears in more than one place, **refactor it** into a single source of truth. - But don't over-abstract — premature DRY that couples unrelated features is worse than a little duplication (see **Rule of Three**: refactor when something appears a third time). -### 15.5 KISS — Keep It Simple, Stupid +### 17.5 KISS — Keep It Simple, Stupid - Choose the simplest solution that works correctly. - Avoid clever tricks, premature optimisation, and over-engineering. - If a standard library function does the job, prefer it over a custom implementation. -### 15.6 YAGNI — You Aren't Gonna Need It +### 17.6 YAGNI — You Aren't Gonna Need It - Do **not** build features, abstractions, or config options "just in case". - Implement what is required **now**. Extend later when a real need emerges. -### 15.7 Dependency Inversion Principle (DIP) +### 17.7 Dependency Inversion Principle (DIP) - High-level modules (services) must not depend on low-level modules (repositories) directly. Both should depend on **abstractions** (protocols / interfaces). - Use FastAPI's `Depends()` to inject implementations — this makes swapping and testing trivial. @@ -1940,17 +2103,17 @@ To adopt a Redis backend: - TTL must be respected — expired entries must be removed on access. - See `app/utils/session_cache.py` for the full Protocol definition and current implementations. -### 15.8 Composition over Inheritance +### 17.8 Composition over Inheritance - Favour **composing** small, focused objects over deep inheritance hierarchies. - Use mixins or protocols only when a clear "is-a" relationship exists; otherwise, pass collaborators as constructor arguments. -### 15.9 Fail Fast +### 17.9 Fail Fast - Validate inputs as early as possible — at the API boundary with Pydantic, at service entry with assertions or domain checks. - Raise specific exceptions immediately rather than letting bad data propagate silently. -### 15.10 Law of Demeter (Principle of Least Knowledge) +### 17.10 Law of Demeter (Principle of Least Knowledge) - A function should only call methods on: 1. Its own object (`self`). @@ -1958,13 +2121,13 @@ To adopt a Redis backend: 3. Objects it creates. - Avoid long accessor chains like `request.state.db.cursor().execute(...)` — wrap them in a meaningful method. -### 15.11 Defensive Programming +### 17.11 Defensive Programming - Never trust external input — validate and sanitise everything that crosses a boundary (HTTP request, file, socket, environment variable). - Handle edge cases explicitly: empty lists, `None` values, negative numbers, empty strings. - Use type narrowing and exhaustive pattern matching (`match` / `case`) to eliminate impossible states. -### 15.12 SSRF Prevention (Server-Side Request Forgery) +### 17.12 SSRF Prevention (Server-Side Request Forgery) When user-supplied URLs are fetched by the backend, validate them before making any HTTP requests: @@ -1989,8 +2152,6 @@ When user-supplied URLs are fetched by the backend, validate them before making - `async validate_blocklist_url(url: AnyHttpUrl) → None`: Async DNS resolution + private IP check. - Service layer calls `await validate_blocklist_url(url)` before persisting; router catches `ValueError` and returns 400. ---- - ## 18. Quick Reference — Do / Don't | Do | Don't | @@ -1998,7 +2159,8 @@ When user-supplied URLs are fetched by the backend, validate them before making | Type every function, variable, return | Leave types implicit | | Use `async def` for I/O | Use sync functions for I/O | | Validate with Pydantic at the boundary | Pass raw dicts through the codebase | -| Log with structlog + context keys | Use `print()` or format strings in logs | +| Log with structlog + context keys (INFO/WARNING/ERROR/DEBUG) | Use `print()` or format strings in logs | +| Use `log.exception()` in catch-all handlers (captures traceback) | Use `log.error()` for exceptions; let exceptions get lost | | Write tests for every feature | Ship untested code | | Use `aiohttp` for HTTP calls | Use `requests` | | Handle errors with custom exceptions | Use bare `except:` | diff --git a/Docs/Tasks.md b/Docs/Tasks.md index 2a6c27c..c0cc2bd 100644 --- a/Docs/Tasks.md +++ b/Docs/Tasks.md @@ -1,22 +1,3 @@ -## 10) Startup sequence depends on implicit ordering -- Where found: - - [backend/app/startup.py](backend/app/startup.py) -- Why this is needed: - - Hidden order dependencies make lifecycle changes risky. -- Goal: - - Explicitly define resource/task startup graph and required prerequisites. -- What to do: - - Document dependency graph. - - Add startup assertions and health checks for each stage. -- Possible traps and issues: - - Partial startup failures may leave stale scheduler or open sessions. -- Docs changes needed: - - Add startup DAG and failure-mode behavior. -- Doc references: - - [Docs/Architekture.md](Docs/Architekture.md) - ---- - ## 11) Logging semantics are inconsistent across backend modules - Where found: - [backend/app/services](backend/app/services)