Files
BanGUI/Docs/Tasks.md
Lukas ef8feba4b2 docs: add comprehensive task backlog and bump version to rc.5
- Document database error handling, logging duplication, ban service
timestamp, and orphaned SQLite file issues in Tasks.md
- Bump backend version from 0.9.19-rc.4 to 0.9.19-rc.5
2026-05-23 22:09:06 +02:00

382 lines
24 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Tasks
## Task: Improve `database_open_failed` Error Handling in `get_db()` Dependency
### Issue in Detail
The `database_open_failed` error event is emitted in `app.dependencies.get_db()` when `app.db.open_db()` raises an exception. Currently, the code uses a broad `except Exception` catch-all that:
1. Logs the error with `log.error("database_open_failed", error=str(exc))`.
2. Raises a generic `HTTPException(status_code=503, detail="Database is not available.")`.
This approach has several problems:
- **No differentiation** between failure modes (missing directory, permission denied, corrupted database, WAL file lock, disk full, etc.).
- **No actionable guidance** for operators or API consumers — the message is always the same.
- **No retry logic** for transient failures like `sqlite3.OperationalError` (database is locked / busy).
- **Error details are lost** in the HTTP response, making client-side debugging impossible.
- **Inconsistent with the error contract** documented in `Docs/Backend-Development.md` §8, which requires structured `ErrorResponse` objects with `error_code`, `message`, and `details`.
**Code location:** `backend/app/dependencies.py:174`
```python
# Current implementation (problematic)
try:
db = await open_db(settings.database_path)
except Exception as exc:
log.error("database_open_failed", error=str(exc))
raise HTTPException(
status_code=status.HTTP_503_SERVICE_UNAVAILABLE,
detail="Database is not available.",
) from exc
```
### Why This Happens
1. **Broad exception catch:** `except Exception` swallows all failure modes indiscriminately.
2. **No mapping to error codes:** The codebase uses a structured `ErrorResponse` schema (`error_code`, `message`, `details`), but `get_db()` bypasses it by raising a raw `HTTPException` with a plain string `detail`.
3. **Missing retry for transient errors:** SQLite in WAL mode with `busy_timeout=5000` can still fail with `OperationalError` under heavy concurrency. A single immediate failure is suboptimal.
4. **WAL file cleanup race condition:** `_cleanup_wal_files()` in `open_db()` removes orphaned `.wal`/`.shm` files, but if another process has the file open, `OSError` is silently swallowed. This can leave the database in a state where `open_db()` fails with a cryptic error.
5. **No directory existence check:** If the parent directory for `database_path` does not exist, `aiosqlite.connect()` raises `sqlite3.OperationalError: unable to open database file`, which is indistinguishable from other errors in the current handler.
6. **Blocklist imports run in parallel and hammer the DB:** Blocklist import tasks (scheduled via APScheduler) run concurrently with API requests and each other. Each import opens its own database connection via `task_db()` and performs mass `INSERT` operations into `import_log` and potentially `blocklist_sources`. Under heavy load, multiple concurrent writers contend for the SQLite WAL lock, causing `OperationalError: database is locked` or `busy` failures that bubble up to `get_db()` for unrelated API requests. The current `busy_timeout=5000` is insufficient when multiple blocklist imports run simultaneously.
### How to Fix It
#### Step 1: Introduce Specific Exception Handling
Map different exception types to appropriate HTTP responses and error codes:
| Exception Type | Error Code | HTTP Status | Client Message |
|---|---|---|---|
| `PermissionError` | `DATABASE_PERMISSION_DENIED` | 503 | Insufficient permissions to access the database file. |
| `sqlite3.OperationalError` (unable to open) | `DATABASE_PATH_INVALID` | 503 | Database directory does not exist or path is invalid. |
| `sqlite3.OperationalError` (database is locked) | `DATABASE_BUSY` | 503 | Database is temporarily busy. Retry the request. |
| `sqlite3.DatabaseError` (corrupt) | `DATABASE_CORRUPTED` | 503 | Database file is corrupted. |
| Other / unexpected | `DATABASE_UNAVAILABLE` | 503 | Database is not available. |
#### Step 2: Add Retry Logic for Transient `DATABASE_BUSY`
For `sqlite3.OperationalError` with "database is locked" or "database is busy", implement an async retry with exponential backoff (max 3 retries, base delay 100ms). Use `asyncio.sleep()` between retries.
**Important:** Only retry `OperationalError` that indicates transient locking. Do NOT retry on "unable to open database file" or corruption errors.
#### Step 3: Ensure ErrorResponse Compliance
Instead of raising `HTTPException(detail="...")`, raise a custom domain exception that the global exception handler converts to `ErrorResponse`. See `Docs/Backend-Development.md` §8 and `app.exceptions` / `app.middleware.error_handling` for the pattern.
Example pattern from the codebase:
```python
from app.exceptions import ServiceUnavailableError
raise ServiceUnavailableError(
error_code="DATABASE_BUSY",
message="Database is temporarily busy.",
details={"database_path": settings.database_path, "retries": 3},
)
```
If `ServiceUnavailableError` does not exist, define it in `app/exceptions.py` following the existing exception hierarchy (see `app.exceptions.BanGUIException`).
#### Step 4: Improve `open_db()` Resilience
In `backend/app/db.py`, enhance `open_db()` to:
1. Check if the parent directory of `database_path` exists before connecting. If not, attempt to create it with `Path.mkdir(parents=True, exist_ok=True)`.
2. Distinguish between `OSError` during WAL cleanup and connection errors.
3. Log the specific SQLite error code (`exc.sqlite_errorcode` if available) for better observability.
#### Step 5: Sanitize Error Responses
Never expose absolute file system paths in production error responses. Use relative paths or redact them:
```python
# Safe
"database_path": "/data/bangui.db" # OK — this is a known container path
# Unsafe (don't do this)
"database_path": "/home/alice/secret/project/bangui.db"
```
In the BanGUI Docker setup, `/data/bangui.db` is a well-known path, so including it is acceptable. For non-Docker deployments, consider redacting.
#### Step 6: Reduce Concurrent Write Pressure from Blocklist Imports
The root cause of many `DATABASE_BUSY` errors is not API load — it is parallel blocklist import tasks that open independent database connections and perform bulk `INSERT`s. Mitigate this by:
1. **Serializing imports:** Ensure APScheduler jobs for blocklist imports use `max_instances=1` and a job store lock (e.g., `coalesce=True`) so only one import runs at a time.
2. **Batching inserts:** If blocklist imports insert one row at a time, wrap them in explicit transactions and batch commits (e.g., commit every 100 rows) to reduce WAL checkpoint frequency.
3. **Using a dedicated connection for imports:** Instead of opening a new connection per import task, reuse a single long-lived connection or use a connection pool with a max size of 1 for write operations.
4. **Scheduling imports during low-traffic windows:** Offset import schedules so they do not coincide with peak API usage.
**Code to audit:** `app/tasks/blocklist_import.py`, `app/services/blocklist_service.py`, and the APScheduler job configuration in `app/main.py` or `app/startup.py`.
### Issues and Trapfalls
1. **aiosqlite wraps sqlite3 exceptions:** `aiosqlite` wraps `sqlite3.Error` subclasses in `aiosqlite.Error`. You may need to inspect `exc.__cause__` or `exc.__context__` to get the underlying `sqlite3.OperationalError`. Test this carefully.
2. **Event loop blocking:** Retry logic with `asyncio.sleep()` is safe, but ensure the total retry duration does not exceed FastAPI/uvicorn timeout thresholds.
3. **WAL mode and concurrent readers:** Even with `busy_timeout=5000`, long-running reads can block writes. The retry logic mitigates but does not eliminate this.
4. **ErrorResponse schema validation:** If you introduce a new error code (`DATABASE_BUSY`), add it to the `ErrorCode` enum in `app.models.error` (or equivalent) and ensure OpenAPI schema generation includes it.
5. **Testing file permissions:** Creating permission-denied scenarios in unit tests requires mocking `aiosqlite.connect()` or using temporary directories with restricted permissions. Avoid tests that depend on actual `chmod` operations for portability.
6. **Don't break existing tests:** `test_get_db_uses_effective_runtime_database_path` in `backend/tests/test_dependencies.py` patches `app.db.open_db`. Ensure your changes remain compatible with this mock-based test.
7. **Blocklist imports are the primary source of `DATABASE_BUSY`:** The retry logic in `get_db()` helps API requests survive transient lock contention, but it does not fix the root cause. Consider throttling or serializing blocklist imports in `app.tasks.blocklist_import` to reduce concurrent write pressure. If imports are triggered by APScheduler with `max_instances > 1`, overlapping runs will compound the problem.
8. **Connection pool exhaustion:** If every API request opens a new connection via `get_db()` and blocklist imports also open connections, the total concurrent connection count can spike. SQLite has a hard limit on concurrent writers in WAL mode. Monitor `PRAGMA wal_checkpoint` and connection counts in production.
### Documentation References
- **`Docs/Backend-Development.md` §8 (Error Handling):** Describes the error contract — routers must NOT construct `HTTPException` for domain errors; use custom exceptions that propagate to the global handler.
- **`Docs/Backend-Development.md` §2 (Core Libraries):** Mentions `aiosqlite` for async SQLite access and the requirement to use structured logging (already done).
- **`Docs/Backend-Development.md` §6 (Dependency Injection):** `get_db()` is a FastAPI dependency — it must yield/return the connection and handle cleanup in `finally`.
- **`Docs/Backend-Development.md` §7 (Testing):** Every feature needs tests — mock external dependencies, test error paths.
### Updates on Documentation
After implementing the fix, update the following:
1. **`Docs/Backend-Development.md` §8:** Add a subsection or example showing how dependency-level database errors should be handled (custom exception → global handler → `ErrorResponse`).
2. **`Docs/API_STATUS_CODES.md` (if it exists):** Document the new `DATABASE_BUSY`, `DATABASE_PERMISSION_DENIED`, `DATABASE_PATH_INVALID`, and `DATABASE_CORRUPTED` error codes and their HTTP 503 mapping.
3. **Inline docstrings:** Update the docstring for `get_db()` in `app/dependencies.py` to document which exceptions it can raise and what they mean.
### Tests to Write
Add the following tests to `backend/tests/test_dependencies.py` (or create `backend/tests/test_db_errors.py` if the file grows large):
#### 1. `test_get_db_raises_service_unavailable_on_permission_error`
- **Mock:** Patch `app.db.open_db` to raise `PermissionError(13, "Permission denied")`.
- **Assert:** The raised exception (or HTTP response, if tested via `TestClient`) has status 503 and error code `DATABASE_PERMISSION_DENIED`.
- **Assert:** The error response follows the `ErrorResponse` schema.
#### 2. `test_get_db_raises_service_unavailable_on_missing_directory`
- **Mock:** Patch `app.db.open_db` to raise `sqlite3.OperationalError("unable to open database file")`.
- **Assert:** Status 503, error code `DATABASE_PATH_INVALID`.
#### 3. `test_get_db_retries_on_database_locked`
- **Mock:** Patch `app.db.open_db` to raise `sqlite3.OperationalError("database is locked")` on the first two calls, then succeed on the third.
- **Assert:** `open_db` is called exactly 3 times.
- **Assert:** The final call yields a valid connection.
#### 4. `test_get_db_fails_after_max_retries_on_database_locked`
- **Mock:** Patch `app.db.open_db` to always raise `sqlite3.OperationalError("database is locked")`.
- **Assert:** After 3 retries, raises 503 with error code `DATABASE_BUSY`.
#### 5. `test_get_db_raises_service_unavailable_on_corrupted_database`
- **Mock:** Patch `app.db.open_db` to raise `sqlite3.DatabaseError("database disk image is malformed")`.
- **Assert:** Status 503, error code `DATABASE_CORRUPTED`.
#### 6. `test_open_db_creates_parent_directory_if_missing`
- **Setup:** Provide a `database_path` in a non-existent subdirectory of `tmp_path`.
- **Action:** Call `open_db(database_path)`.
- **Assert:** The parent directory is created and the database opens successfully.
#### 7. `test_open_db_logs_specific_sqlite_error_code`
- **Mock:** Patch `aiosqlite.connect` to raise an exception with a known `sqlite_errorcode`.
- **Assert:** The log output includes the error code for observability.
#### 8. `test_blocklist_import_serializes_to_avoid_database_busy`
- **Setup:** Configure APScheduler with two blocklist import jobs that would overlap.
- **Action:** Trigger both jobs simultaneously.
- **Assert:** Only one job executes at a time (verify via job store or mock call order).
- **Assert:** No `DATABASE_BUSY` errors occur during import execution.
#### 9. `test_get_db_succeeds_during_blocklist_import`
- **Setup:** Start a blocklist import task that holds a database write lock.
- **Action:** While the import is running, invoke `get_db()` via a FastAPI test client.
- **Assert:** The API request succeeds (either immediately or after retry), demonstrating that the retry logic and/or serialized imports prevent user-facing failures.
---
## Task: Fix Massive Log Duplication and Malformed JSON Output
### Issue in Detail
Log output in production Docker containers shows severe duplication and malformed JSON. Every log event is emitted **420+ times** with identical timestamps, and lines contain repeated `"INF INF INF..."` or `"warning warning..."` prefixes instead of clean JSON.
Example from actual logs:
```
INF INF INF INF INF INF INF INF event=fail2ban_metadata_service_resolved_db_path INF INF INF INF INF INF INF INF event=fail2ban_metadata_service_resolved_db_path timestamp=2026-05-23T19:39:38.707935+00:00 ...
```
This makes logs unreadable, increases storage I/O, and breaks any log parsing pipeline.
### Why This Happens
1. **Handler multiplication:** Uvicorn pre-configures the root logger with a `StreamHandler` before `_configure_logging()` in `app/main.py` runs. When `logging.basicConfig(level=level, handlers=handlers)` is called, it replaces the root logger's handlers — but existing child loggers may still propagate to the root, and the uvicorn access logger (`uvicorn.access`) retains its own handler. The result is multiple handlers firing for the same log record.
2. **External logging handler recursion risk:** `_ExternalLoggingHandler` is attached to the root logger (`logging.getLogger()`). When it internally calls `log.warning()` or `log.debug()` (e.g., `external_log_buffer_full`), those logs propagate back to the root and re-trigger the external handler. While not yet causing a `RecursionError`, this compounds the duplication.
3. **Plain text logs before JSON formatter is ready:** During startup, before `_configure_logging()` runs, module-level `get_logger()` calls in `db.py` and `startup.py` emit plain text lines like `orphaned_sqlite_file_removed` and `rate_limiting_process_local_only` that bypass the JSON formatter entirely.
### How to Fix It
#### Step 1: Clear Existing Handlers Before `basicConfig`
In `backend/app/main.py`, modify `_configure_logging()` to remove all pre-existing handlers from the root logger and uvicorn loggers before applying the new configuration:
```python
def _configure_logging(...) -> None:
# Remove uvicorn's default handlers to prevent duplication
for logger_name in ("uvicorn", "uvicorn.access", "uvicorn.error"):
logger = logging.getLogger(logger_name)
for handler in logger.handlers[:]:
logger.removeHandler(handler)
handler.close()
root = logging.getLogger()
for handler in root.handlers[:]:
root.removeHandler(handler)
handler.close()
# Now safe to configure
logging.basicConfig(level=level, handlers=handlers)
```
#### Step 2: Prevent External Logging Recursion
In `_ExternalLoggingHandler.emit()`, filter out logs originating from `app.utils.external_logging` to prevent the handler from re-triggering itself:
```python
def emit(self, record: logging.LogRecord) -> None:
if record.name.startswith("app.utils.external_logging"):
return
_external_logging_processor(record)
```
Alternatively, attach the external handler to a specific logger (`logging.getLogger("bangui")`) instead of the root logger.
#### Step 3: Move Logging Configuration Earlier
Ensure `_configure_logging()` runs **before** any module imports that call `get_logger()` at the top level. In `app/main.py`, the lifespan function already calls `_configure_logging()` early, but `db.py` and `startup.py` are imported earlier in the lifespan. Consider:
- Calling `_configure_logging()` at module import time ( guarded by `if __name__ == "__main__"` or a `_logging_configured` flag).
- Or restructuring imports so that `_configure_logging()` happens before `startup_shared_resources()`.
#### Step 4: Ensure `JSONFormatter` Is the Only Formatter
Verify that all `StreamHandler` instances use `JSONFormatter`. Uvicorn's default formatter is plain text. After clearing uvicorn handlers and re-adding them via `basicConfig`, they will use the JSON formatter.
### Issues and Trapfalls
1. **Uvicorn access logs:** Clearing `uvicorn.access` handlers will suppress access logs unless you re-add a handler. The `basicConfig` call adds a handler to the root logger, which `uvicorn.access` propagates to — so access logs will still appear, but now as JSON. Verify this behavior.
2. **File handler duplication:** If `BANGUI_LOG_FILE` is set, ensure the `FileHandler` is only added once. `basicConfig` with `handlers=[...]` replaces root handlers, but if called multiple times (e.g., during tests), it may leak.
3. **Test isolation:** Tests that call `create_app()` may re-configure logging. Add a `_logging_configured` global flag or use `logging.disable()` in test fixtures to prevent side effects.
4. **Docker `PYTHONUNBUFFERED`:** The Dockerfile sets `PYTHONUNBUFFERED=1`, which is correct for containerized logging. Do not change this.
### Documentation References
- **`Docs/Backend-Development.md` §2:** Mentions structured logging with `structlog` — the codebase currently uses a custom `JSONFormatter` instead. Document this choice.
- **`Docs/Backend-Development.md` §7 (Testing):** Test logging configuration in isolation to avoid side effects.
### Updates on Documentation
1. Update `Docs/Backend-Development.md` §2 to clarify that the project uses a custom `JSONFormatter` on stdlib `logging`, not `structlog`.
2. Add a comment in `Docker/Dockerfile.backend` explaining why `PYTHONUNBUFFERED=1` is required for JSON log streaming.
### Tests to Write
#### 1. `test_logging_configuration_no_duplicate_handlers`
- **Action:** Call `create_app()` twice in the same process.
- **Assert:** The root logger has exactly one `StreamHandler` after each call.
#### 2. `test_uvicorn_access_logs_are_json`
- **Action:** Emit a log record through the `uvicorn.access` logger.
- **Assert:** The output is valid JSON containing `"event"`, `"level"`, and `"timestamp"`.
#### 3. `test_external_logging_handler_no_recursion`
- **Action:** Emit a log record from `app.utils.external_logging` while the external handler is active.
- **Assert:** The external handler's `queue_log` is NOT called for that record.
#### 4. `test_plain_text_logs_not_emitted_after_startup`
- **Action:** After `create_app()` completes startup, emit a log via `get_logger("app.db")`.
- **Assert:** Output is JSON, not plain text.
---
## Task: Fix Incorrect `since` Timestamp in Ban Service Queries
### Issue in Detail
The `ban_service_ban_trend` and `ban_service_list_bans` log events show `since=1779479327` with `range=24h`, but `1779479327``2026-05-21 18:42 UTC` — roughly **48 hours** before the log timestamp (`2026-05-23 19:49 UTC`). For a 24h range, `since` should be approximately `1779565827`.
Example from logs:
```
event=ban_service_ban_trend since=1779479327 range=24h
```
This suggests the `since` calculation in `app/services/ban_service.py` is using an incorrect time base or double-applying a timezone offset.
### Why This Happens
1. **Possible double UTC conversion:** If `since_unix()` returns a UTC timestamp and the caller also subtracts a timezone offset, the result is shifted by the offset amount.
2. **Wrong reference time:** The `since` calculation may use `datetime.now()` (naive local time) instead of `datetime.now(timezone.utc)` as the reference point.
3. **Hardcoded test data:** If the `since` value is cached or computed once at module import time, it will be stale for long-running processes.
### How to Fix It
1. **Audit `since_unix()` in `app/utils/time_utils.py`:** Verify it uses `datetime.now(timezone.utc)` or `time.time()` as the reference.
2. **Audit callers in `ban_service.py`:** Ensure `since` is computed per-request, not cached. Check for any manual datetime arithmetic that bypasses `since_unix()`.
3. **Add a runtime assertion:** In debug mode, assert that `since` is within `[now - range - slack, now]`.
### Issues and Trapfalls
1. **`since_unix()` slack:** The function adds `TIME_RANGE_SLACK_SECONDS = 60` to the window. Ensure this is documented and expected.
2. **fail2ban epoch compatibility:** fail2ban stores `timeofban` as Unix epoch seconds (UTC). The `since` value must be in the same unit.
3. **Caching:** If `since` is cached in a service-level variable, it will drift over time. Ensure it is computed per-request.
### Documentation References
- **`Docs/Backend-Development.md` §2 (Timestamp Handling):** Rules for UTC timestamps, `since_unix()` usage, and the 60-second slack.
### Tests to Write
#### 1. `test_ban_trend_since_is_within_expected_range`
- **Action:** Call the ban trend service with `range="24h"`.
- **Assert:** The computed `since` is within the last 24 hours + 60s slack.
#### 2. `test_since_unix_returns_utc_epoch`
- **Action:** Call `since_unix("24h")`.
- **Assert:** The result is within 24 hours + 60s of `time.time()`.
---
## Task: Investigate Orphaned SQLite Shared Memory Files on Startup
### Issue in Detail
The log shows repeated warnings:
```
event=orphaned_sqlite_file_removed path=/data/bangui.db-shm
```
This occurs at `19:39:48` and again at `19:49:39` (after restart). The `-shm` file is SQLite's shared memory file for WAL mode. Its presence indicates **unclean shutdowns** (crashes or SIGKILL instead of graceful SIGTERM).
### Why This Happens
1. **Docker stop timeout:** Docker sends SIGTERM, waits `stop_grace_period` (default 10s), then sends SIGKILL. The backend allows 25s for graceful shutdown, but if the container's `stop_grace_period` is shorter, the process is killed before cleanup completes.
2. **Missing connection close:** If the application crashes or is killed, SQLite connections are not closed, leaving `.wal` and `.shm` files behind.
3. **`_cleanup_wal_files()` is a workaround, not a fix:** It removes stale files on the *next* startup, but the underlying cause (unclean shutdown) remains.
### How to Fix It
1. **Verify Docker Compose `stop_grace_period`:** In `Docker/compose.prod.yml`, ensure the backend service has `stop_grace_period: 30s` (matching the 25s internal timeout + margin).
2. **Improve shutdown logging:** Add explicit logs when the database connection is closed during lifespan shutdown.
3. **Consider `PRAGMA journal_mode = DELETE` for single-process setups:** WAL mode is beneficial for concurrent readers, but if BanGUI runs with a single worker and single process, DELETE mode eliminates `.wal`/`.shm` files entirely. Evaluate the tradeoff.
### Issues and Trapfalls
1. **WAL mode is required for concurrent reads:** If you switch to DELETE mode, readers block writers. This may degrade API performance under load.
2. **The `_cleanup_wal_files()` 10-second threshold:** Files modified within 10 seconds are skipped. If the container restarts rapidly (e.g., health check failure → restart), the files may not be cleaned up.
### Documentation References
- **`Docs/Deployment.md`:** Docker deployment configuration and graceful shutdown behavior.
- **`Docs/Architekture.md`:** Deployment constraints and process-local state.
### Tests to Write
#### 1. `test_cleanup_wal_files_removes_stale_files`
- **Setup:** Create fake `.wal` and `.shm` files with mtime > 10s ago.
- **Action:** Call `_cleanup_wal_files()`.
- **Assert:** Files are removed.
#### 2. `test_cleanup_wal_files_skips_recent_files`
- **Setup:** Create fake `.wal` and `.shm` files with mtime < 10s ago.
- **Action:** Call `_cleanup_wal_files()`.
- **Assert:** Files are NOT removed.
---