Files
BanGUI/Docs/Tasks.md
Lukas ecb8542496 docs: add comprehensive task backlog and bump version to rc.5
- Document database error handling, logging duplication, ban service
timestamp, and orphaned SQLite file issues in Tasks.md
- Bump backend version from 0.9.19-rc.4 to 0.9.19-rc.5
2026-05-24 22:05:34 +02:00

24 KiB
Raw Blame History

Tasks

Task: Improve database_open_failed Error Handling in get_db() Dependency

Issue in Detail

The database_open_failed error event is emitted in app.dependencies.get_db() when app.db.open_db() raises an exception. Currently, the code uses a broad except Exception catch-all that:

  1. Logs the error with log.error("database_open_failed", error=str(exc)).
  2. Raises a generic HTTPException(status_code=503, detail="Database is not available.").

This approach has several problems:

  • No differentiation between failure modes (missing directory, permission denied, corrupted database, WAL file lock, disk full, etc.).
  • No actionable guidance for operators or API consumers — the message is always the same.
  • No retry logic for transient failures like sqlite3.OperationalError (database is locked / busy).
  • Error details are lost in the HTTP response, making client-side debugging impossible.
  • Inconsistent with the error contract documented in Docs/Backend-Development.md §8, which requires structured ErrorResponse objects with error_code, message, and details.

Code location: backend/app/dependencies.py:174

# Current implementation (problematic)
try:
    db = await open_db(settings.database_path)
except Exception as exc:
    log.error("database_open_failed", error=str(exc))
    raise HTTPException(
        status_code=status.HTTP_503_SERVICE_UNAVAILABLE,
        detail="Database is not available.",
    ) from exc

Why This Happens

  1. Broad exception catch: except Exception swallows all failure modes indiscriminately.
  2. No mapping to error codes: The codebase uses a structured ErrorResponse schema (error_code, message, details), but get_db() bypasses it by raising a raw HTTPException with a plain string detail.
  3. Missing retry for transient errors: SQLite in WAL mode with busy_timeout=5000 can still fail with OperationalError under heavy concurrency. A single immediate failure is suboptimal.
  4. WAL file cleanup race condition: _cleanup_wal_files() in open_db() removes orphaned .wal/.shm files, but if another process has the file open, OSError is silently swallowed. This can leave the database in a state where open_db() fails with a cryptic error.
  5. No directory existence check: If the parent directory for database_path does not exist, aiosqlite.connect() raises sqlite3.OperationalError: unable to open database file, which is indistinguishable from other errors in the current handler.
  6. Blocklist imports run in parallel and hammer the DB: Blocklist import tasks (scheduled via APScheduler) run concurrently with API requests and each other. Each import opens its own database connection via task_db() and performs mass INSERT operations into import_log and potentially blocklist_sources. Under heavy load, multiple concurrent writers contend for the SQLite WAL lock, causing OperationalError: database is locked or busy failures that bubble up to get_db() for unrelated API requests. The current busy_timeout=5000 is insufficient when multiple blocklist imports run simultaneously.

How to Fix It

Step 1: Introduce Specific Exception Handling

Map different exception types to appropriate HTTP responses and error codes:

Exception Type Error Code HTTP Status Client Message
PermissionError DATABASE_PERMISSION_DENIED 503 Insufficient permissions to access the database file.
sqlite3.OperationalError (unable to open) DATABASE_PATH_INVALID 503 Database directory does not exist or path is invalid.
sqlite3.OperationalError (database is locked) DATABASE_BUSY 503 Database is temporarily busy. Retry the request.
sqlite3.DatabaseError (corrupt) DATABASE_CORRUPTED 503 Database file is corrupted.
Other / unexpected DATABASE_UNAVAILABLE 503 Database is not available.

Step 2: Add Retry Logic for Transient DATABASE_BUSY

For sqlite3.OperationalError with "database is locked" or "database is busy", implement an async retry with exponential backoff (max 3 retries, base delay 100ms). Use asyncio.sleep() between retries.

Important: Only retry OperationalError that indicates transient locking. Do NOT retry on "unable to open database file" or corruption errors.

Step 3: Ensure ErrorResponse Compliance

Instead of raising HTTPException(detail="..."), raise a custom domain exception that the global exception handler converts to ErrorResponse. See Docs/Backend-Development.md §8 and app.exceptions / app.middleware.error_handling for the pattern.

Example pattern from the codebase:

from app.exceptions import ServiceUnavailableError

raise ServiceUnavailableError(
    error_code="DATABASE_BUSY",
    message="Database is temporarily busy.",
    details={"database_path": settings.database_path, "retries": 3},
)

If ServiceUnavailableError does not exist, define it in app/exceptions.py following the existing exception hierarchy (see app.exceptions.BanGUIException).

Step 4: Improve open_db() Resilience

In backend/app/db.py, enhance open_db() to:

  1. Check if the parent directory of database_path exists before connecting. If not, attempt to create it with Path.mkdir(parents=True, exist_ok=True).
  2. Distinguish between OSError during WAL cleanup and connection errors.
  3. Log the specific SQLite error code (exc.sqlite_errorcode if available) for better observability.

Step 5: Sanitize Error Responses

Never expose absolute file system paths in production error responses. Use relative paths or redact them:

# Safe
"database_path": "/data/bangui.db"  # OK — this is a known container path

# Unsafe (don't do this)
"database_path": "/home/alice/secret/project/bangui.db"

In the BanGUI Docker setup, /data/bangui.db is a well-known path, so including it is acceptable. For non-Docker deployments, consider redacting.

Step 6: Reduce Concurrent Write Pressure from Blocklist Imports

The root cause of many DATABASE_BUSY errors is not API load — it is parallel blocklist import tasks that open independent database connections and perform bulk INSERTs. Mitigate this by:

  1. Serializing imports: Ensure APScheduler jobs for blocklist imports use max_instances=1 and a job store lock (e.g., coalesce=True) so only one import runs at a time.
  2. Batching inserts: If blocklist imports insert one row at a time, wrap them in explicit transactions and batch commits (e.g., commit every 100 rows) to reduce WAL checkpoint frequency.
  3. Using a dedicated connection for imports: Instead of opening a new connection per import task, reuse a single long-lived connection or use a connection pool with a max size of 1 for write operations.
  4. Scheduling imports during low-traffic windows: Offset import schedules so they do not coincide with peak API usage.

Code to audit: app/tasks/blocklist_import.py, app/services/blocklist_service.py, and the APScheduler job configuration in app/main.py or app/startup.py.

Issues and Trapfalls

  1. aiosqlite wraps sqlite3 exceptions: aiosqlite wraps sqlite3.Error subclasses in aiosqlite.Error. You may need to inspect exc.__cause__ or exc.__context__ to get the underlying sqlite3.OperationalError. Test this carefully.
  2. Event loop blocking: Retry logic with asyncio.sleep() is safe, but ensure the total retry duration does not exceed FastAPI/uvicorn timeout thresholds.
  3. WAL mode and concurrent readers: Even with busy_timeout=5000, long-running reads can block writes. The retry logic mitigates but does not eliminate this.
  4. ErrorResponse schema validation: If you introduce a new error code (DATABASE_BUSY), add it to the ErrorCode enum in app.models.error (or equivalent) and ensure OpenAPI schema generation includes it.
  5. Testing file permissions: Creating permission-denied scenarios in unit tests requires mocking aiosqlite.connect() or using temporary directories with restricted permissions. Avoid tests that depend on actual chmod operations for portability.
  6. Don't break existing tests: test_get_db_uses_effective_runtime_database_path in backend/tests/test_dependencies.py patches app.db.open_db. Ensure your changes remain compatible with this mock-based test.
  7. Blocklist imports are the primary source of DATABASE_BUSY: The retry logic in get_db() helps API requests survive transient lock contention, but it does not fix the root cause. Consider throttling or serializing blocklist imports in app.tasks.blocklist_import to reduce concurrent write pressure. If imports are triggered by APScheduler with max_instances > 1, overlapping runs will compound the problem.
  8. Connection pool exhaustion: If every API request opens a new connection via get_db() and blocklist imports also open connections, the total concurrent connection count can spike. SQLite has a hard limit on concurrent writers in WAL mode. Monitor PRAGMA wal_checkpoint and connection counts in production.

Documentation References

  • Docs/Backend-Development.md §8 (Error Handling): Describes the error contract — routers must NOT construct HTTPException for domain errors; use custom exceptions that propagate to the global handler.
  • Docs/Backend-Development.md §2 (Core Libraries): Mentions aiosqlite for async SQLite access and the requirement to use structured logging (already done).
  • Docs/Backend-Development.md §6 (Dependency Injection): get_db() is a FastAPI dependency — it must yield/return the connection and handle cleanup in finally.
  • Docs/Backend-Development.md §7 (Testing): Every feature needs tests — mock external dependencies, test error paths.

Updates on Documentation

After implementing the fix, update the following:

  1. Docs/Backend-Development.md §8: Add a subsection or example showing how dependency-level database errors should be handled (custom exception → global handler → ErrorResponse).
  2. Docs/API_STATUS_CODES.md (if it exists): Document the new DATABASE_BUSY, DATABASE_PERMISSION_DENIED, DATABASE_PATH_INVALID, and DATABASE_CORRUPTED error codes and their HTTP 503 mapping.
  3. Inline docstrings: Update the docstring for get_db() in app/dependencies.py to document which exceptions it can raise and what they mean.

Tests to Write

Add the following tests to backend/tests/test_dependencies.py (or create backend/tests/test_db_errors.py if the file grows large):

1. test_get_db_raises_service_unavailable_on_permission_error

  • Mock: Patch app.db.open_db to raise PermissionError(13, "Permission denied").
  • Assert: The raised exception (or HTTP response, if tested via TestClient) has status 503 and error code DATABASE_PERMISSION_DENIED.
  • Assert: The error response follows the ErrorResponse schema.

2. test_get_db_raises_service_unavailable_on_missing_directory

  • Mock: Patch app.db.open_db to raise sqlite3.OperationalError("unable to open database file").
  • Assert: Status 503, error code DATABASE_PATH_INVALID.

3. test_get_db_retries_on_database_locked

  • Mock: Patch app.db.open_db to raise sqlite3.OperationalError("database is locked") on the first two calls, then succeed on the third.
  • Assert: open_db is called exactly 3 times.
  • Assert: The final call yields a valid connection.

4. test_get_db_fails_after_max_retries_on_database_locked

  • Mock: Patch app.db.open_db to always raise sqlite3.OperationalError("database is locked").
  • Assert: After 3 retries, raises 503 with error code DATABASE_BUSY.

5. test_get_db_raises_service_unavailable_on_corrupted_database

  • Mock: Patch app.db.open_db to raise sqlite3.DatabaseError("database disk image is malformed").
  • Assert: Status 503, error code DATABASE_CORRUPTED.

6. test_open_db_creates_parent_directory_if_missing

  • Setup: Provide a database_path in a non-existent subdirectory of tmp_path.
  • Action: Call open_db(database_path).
  • Assert: The parent directory is created and the database opens successfully.

7. test_open_db_logs_specific_sqlite_error_code

  • Mock: Patch aiosqlite.connect to raise an exception with a known sqlite_errorcode.
  • Assert: The log output includes the error code for observability.

8. test_blocklist_import_serializes_to_avoid_database_busy

  • Setup: Configure APScheduler with two blocklist import jobs that would overlap.
  • Action: Trigger both jobs simultaneously.
  • Assert: Only one job executes at a time (verify via job store or mock call order).
  • Assert: No DATABASE_BUSY errors occur during import execution.

9. test_get_db_succeeds_during_blocklist_import

  • Setup: Start a blocklist import task that holds a database write lock.
  • Action: While the import is running, invoke get_db() via a FastAPI test client.
  • Assert: The API request succeeds (either immediately or after retry), demonstrating that the retry logic and/or serialized imports prevent user-facing failures.

Task: Fix Massive Log Duplication and Malformed JSON Output

Issue in Detail

Log output in production Docker containers shows severe duplication and malformed JSON. Every log event is emitted 420+ times with identical timestamps, and lines contain repeated "INF INF INF..." or "warning warning..." prefixes instead of clean JSON.

Example from actual logs:

INF INF INF INF INF INF INF INF event=fail2ban_metadata_service_resolved_db_path INF INF INF INF INF INF INF INF event=fail2ban_metadata_service_resolved_db_path timestamp=2026-05-23T19:39:38.707935+00:00 ...

This makes logs unreadable, increases storage I/O, and breaks any log parsing pipeline.

Why This Happens

  1. Handler multiplication: Uvicorn pre-configures the root logger with a StreamHandler before _configure_logging() in app/main.py runs. When logging.basicConfig(level=level, handlers=handlers) is called, it replaces the root logger's handlers — but existing child loggers may still propagate to the root, and the uvicorn access logger (uvicorn.access) retains its own handler. The result is multiple handlers firing for the same log record.

  2. External logging handler recursion risk: _ExternalLoggingHandler is attached to the root logger (logging.getLogger()). When it internally calls log.warning() or log.debug() (e.g., external_log_buffer_full), those logs propagate back to the root and re-trigger the external handler. While not yet causing a RecursionError, this compounds the duplication.

  3. Plain text logs before JSON formatter is ready: During startup, before _configure_logging() runs, module-level get_logger() calls in db.py and startup.py emit plain text lines like orphaned_sqlite_file_removed and rate_limiting_process_local_only that bypass the JSON formatter entirely.

How to Fix It

Step 1: Clear Existing Handlers Before basicConfig

In backend/app/main.py, modify _configure_logging() to remove all pre-existing handlers from the root logger and uvicorn loggers before applying the new configuration:

def _configure_logging(...) -> None:
    # Remove uvicorn's default handlers to prevent duplication
    for logger_name in ("uvicorn", "uvicorn.access", "uvicorn.error"):
        logger = logging.getLogger(logger_name)
        for handler in logger.handlers[:]:
            logger.removeHandler(handler)
            handler.close()

    root = logging.getLogger()
    for handler in root.handlers[:]:
        root.removeHandler(handler)
        handler.close()

    # Now safe to configure
    logging.basicConfig(level=level, handlers=handlers)

Step 2: Prevent External Logging Recursion

In _ExternalLoggingHandler.emit(), filter out logs originating from app.utils.external_logging to prevent the handler from re-triggering itself:

def emit(self, record: logging.LogRecord) -> None:
    if record.name.startswith("app.utils.external_logging"):
        return
    _external_logging_processor(record)

Alternatively, attach the external handler to a specific logger (logging.getLogger("bangui")) instead of the root logger.

Step 3: Move Logging Configuration Earlier

Ensure _configure_logging() runs before any module imports that call get_logger() at the top level. In app/main.py, the lifespan function already calls _configure_logging() early, but db.py and startup.py are imported earlier in the lifespan. Consider:

  • Calling _configure_logging() at module import time ( guarded by if __name__ == "__main__" or a _logging_configured flag).
  • Or restructuring imports so that _configure_logging() happens before startup_shared_resources().

Step 4: Ensure JSONFormatter Is the Only Formatter

Verify that all StreamHandler instances use JSONFormatter. Uvicorn's default formatter is plain text. After clearing uvicorn handlers and re-adding them via basicConfig, they will use the JSON formatter.

Issues and Trapfalls

  1. Uvicorn access logs: Clearing uvicorn.access handlers will suppress access logs unless you re-add a handler. The basicConfig call adds a handler to the root logger, which uvicorn.access propagates to — so access logs will still appear, but now as JSON. Verify this behavior.
  2. File handler duplication: If BANGUI_LOG_FILE is set, ensure the FileHandler is only added once. basicConfig with handlers=[...] replaces root handlers, but if called multiple times (e.g., during tests), it may leak.
  3. Test isolation: Tests that call create_app() may re-configure logging. Add a _logging_configured global flag or use logging.disable() in test fixtures to prevent side effects.
  4. Docker PYTHONUNBUFFERED: The Dockerfile sets PYTHONUNBUFFERED=1, which is correct for containerized logging. Do not change this.

Documentation References

  • Docs/Backend-Development.md §2: Mentions structured logging with structlog — the codebase currently uses a custom JSONFormatter instead. Document this choice.
  • Docs/Backend-Development.md §7 (Testing): Test logging configuration in isolation to avoid side effects.

Updates on Documentation

  1. Update Docs/Backend-Development.md §2 to clarify that the project uses a custom JSONFormatter on stdlib logging, not structlog.
  2. Add a comment in Docker/Dockerfile.backend explaining why PYTHONUNBUFFERED=1 is required for JSON log streaming.

Tests to Write

1. test_logging_configuration_no_duplicate_handlers

  • Action: Call create_app() twice in the same process.
  • Assert: The root logger has exactly one StreamHandler after each call.

2. test_uvicorn_access_logs_are_json

  • Action: Emit a log record through the uvicorn.access logger.
  • Assert: The output is valid JSON containing "event", "level", and "timestamp".

3. test_external_logging_handler_no_recursion

  • Action: Emit a log record from app.utils.external_logging while the external handler is active.
  • Assert: The external handler's queue_log is NOT called for that record.

4. test_plain_text_logs_not_emitted_after_startup

  • Action: After create_app() completes startup, emit a log via get_logger("app.db").
  • Assert: Output is JSON, not plain text.

Task: Fix Incorrect since Timestamp in Ban Service Queries

Issue in Detail

The ban_service_ban_trend and ban_service_list_bans log events show since=1779479327 with range=24h, but 17794793272026-05-21 18:42 UTC — roughly 48 hours before the log timestamp (2026-05-23 19:49 UTC). For a 24h range, since should be approximately 1779565827.

Example from logs:

event=ban_service_ban_trend since=1779479327 range=24h

This suggests the since calculation in app/services/ban_service.py is using an incorrect time base or double-applying a timezone offset.

Why This Happens

  1. Possible double UTC conversion: If since_unix() returns a UTC timestamp and the caller also subtracts a timezone offset, the result is shifted by the offset amount.
  2. Wrong reference time: The since calculation may use datetime.now() (naive local time) instead of datetime.now(timezone.utc) as the reference point.
  3. Hardcoded test data: If the since value is cached or computed once at module import time, it will be stale for long-running processes.

How to Fix It

  1. Audit since_unix() in app/utils/time_utils.py: Verify it uses datetime.now(timezone.utc) or time.time() as the reference.
  2. Audit callers in ban_service.py: Ensure since is computed per-request, not cached. Check for any manual datetime arithmetic that bypasses since_unix().
  3. Add a runtime assertion: In debug mode, assert that since is within [now - range - slack, now].

Issues and Trapfalls

  1. since_unix() slack: The function adds TIME_RANGE_SLACK_SECONDS = 60 to the window. Ensure this is documented and expected.
  2. fail2ban epoch compatibility: fail2ban stores timeofban as Unix epoch seconds (UTC). The since value must be in the same unit.
  3. Caching: If since is cached in a service-level variable, it will drift over time. Ensure it is computed per-request.

Documentation References

  • Docs/Backend-Development.md §2 (Timestamp Handling): Rules for UTC timestamps, since_unix() usage, and the 60-second slack.

Tests to Write

1. test_ban_trend_since_is_within_expected_range

  • Action: Call the ban trend service with range="24h".
  • Assert: The computed since is within the last 24 hours + 60s slack.

2. test_since_unix_returns_utc_epoch

  • Action: Call since_unix("24h").
  • Assert: The result is within 24 hours + 60s of time.time().

Task: Investigate Orphaned SQLite Shared Memory Files on Startup

Issue in Detail

The log shows repeated warnings:

event=orphaned_sqlite_file_removed path=/data/bangui.db-shm

This occurs at 19:39:48 and again at 19:49:39 (after restart). The -shm file is SQLite's shared memory file for WAL mode. Its presence indicates unclean shutdowns (crashes or SIGKILL instead of graceful SIGTERM).

Why This Happens

  1. Docker stop timeout: Docker sends SIGTERM, waits stop_grace_period (default 10s), then sends SIGKILL. The backend allows 25s for graceful shutdown, but if the container's stop_grace_period is shorter, the process is killed before cleanup completes.
  2. Missing connection close: If the application crashes or is killed, SQLite connections are not closed, leaving .wal and .shm files behind.
  3. _cleanup_wal_files() is a workaround, not a fix: It removes stale files on the next startup, but the underlying cause (unclean shutdown) remains.

How to Fix It

  1. Verify Docker Compose stop_grace_period: In Docker/compose.prod.yml, ensure the backend service has stop_grace_period: 30s (matching the 25s internal timeout + margin).
  2. Improve shutdown logging: Add explicit logs when the database connection is closed during lifespan shutdown.
  3. Consider PRAGMA journal_mode = DELETE for single-process setups: WAL mode is beneficial for concurrent readers, but if BanGUI runs with a single worker and single process, DELETE mode eliminates .wal/.shm files entirely. Evaluate the tradeoff.

Issues and Trapfalls

  1. WAL mode is required for concurrent reads: If you switch to DELETE mode, readers block writers. This may degrade API performance under load.
  2. The _cleanup_wal_files() 10-second threshold: Files modified within 10 seconds are skipped. If the container restarts rapidly (e.g., health check failure → restart), the files may not be cleaned up.

Documentation References

  • Docs/Deployment.md: Docker deployment configuration and graceful shutdown behavior.
  • Docs/Architekture.md: Deployment constraints and process-local state.

Tests to Write

1. test_cleanup_wal_files_removes_stale_files

  • Setup: Create fake .wal and .shm files with mtime > 10s ago.
  • Action: Call _cleanup_wal_files().
  • Assert: Files are removed.

2. test_cleanup_wal_files_skips_recent_files

  • Setup: Create fake .wal and .shm files with mtime < 10s ago.
  • Action: Call _cleanup_wal_files().
  • Assert: Files are NOT removed.