docs: add comprehensive task backlog and bump version to rc.5
- Document database error handling, logging duplication, ban service timestamp, and orphaned SQLite file issues in Tasks.md - Bump backend version from 0.9.19-rc.4 to 0.9.19-rc.5
This commit is contained in:
381
Docs/Tasks.md
381
Docs/Tasks.md
@@ -0,0 +1,381 @@
|
||||
# Tasks
|
||||
|
||||
## Task: Improve `database_open_failed` Error Handling in `get_db()` Dependency
|
||||
|
||||
### Issue in Detail
|
||||
|
||||
The `database_open_failed` error event is emitted in `app.dependencies.get_db()` when `app.db.open_db()` raises an exception. Currently, the code uses a broad `except Exception` catch-all that:
|
||||
|
||||
1. Logs the error with `log.error("database_open_failed", error=str(exc))`.
|
||||
2. Raises a generic `HTTPException(status_code=503, detail="Database is not available.")`.
|
||||
|
||||
This approach has several problems:
|
||||
- **No differentiation** between failure modes (missing directory, permission denied, corrupted database, WAL file lock, disk full, etc.).
|
||||
- **No actionable guidance** for operators or API consumers — the message is always the same.
|
||||
- **No retry logic** for transient failures like `sqlite3.OperationalError` (database is locked / busy).
|
||||
- **Error details are lost** in the HTTP response, making client-side debugging impossible.
|
||||
- **Inconsistent with the error contract** documented in `Docs/Backend-Development.md` §8, which requires structured `ErrorResponse` objects with `error_code`, `message`, and `details`.
|
||||
|
||||
**Code location:** `backend/app/dependencies.py:174`
|
||||
|
||||
```python
|
||||
# Current implementation (problematic)
|
||||
try:
|
||||
db = await open_db(settings.database_path)
|
||||
except Exception as exc:
|
||||
log.error("database_open_failed", error=str(exc))
|
||||
raise HTTPException(
|
||||
status_code=status.HTTP_503_SERVICE_UNAVAILABLE,
|
||||
detail="Database is not available.",
|
||||
) from exc
|
||||
```
|
||||
|
||||
### Why This Happens
|
||||
|
||||
1. **Broad exception catch:** `except Exception` swallows all failure modes indiscriminately.
|
||||
2. **No mapping to error codes:** The codebase uses a structured `ErrorResponse` schema (`error_code`, `message`, `details`), but `get_db()` bypasses it by raising a raw `HTTPException` with a plain string `detail`.
|
||||
3. **Missing retry for transient errors:** SQLite in WAL mode with `busy_timeout=5000` can still fail with `OperationalError` under heavy concurrency. A single immediate failure is suboptimal.
|
||||
4. **WAL file cleanup race condition:** `_cleanup_wal_files()` in `open_db()` removes orphaned `.wal`/`.shm` files, but if another process has the file open, `OSError` is silently swallowed. This can leave the database in a state where `open_db()` fails with a cryptic error.
|
||||
5. **No directory existence check:** If the parent directory for `database_path` does not exist, `aiosqlite.connect()` raises `sqlite3.OperationalError: unable to open database file`, which is indistinguishable from other errors in the current handler.
|
||||
6. **Blocklist imports run in parallel and hammer the DB:** Blocklist import tasks (scheduled via APScheduler) run concurrently with API requests and each other. Each import opens its own database connection via `task_db()` and performs mass `INSERT` operations into `import_log` and potentially `blocklist_sources`. Under heavy load, multiple concurrent writers contend for the SQLite WAL lock, causing `OperationalError: database is locked` or `busy` failures that bubble up to `get_db()` for unrelated API requests. The current `busy_timeout=5000` is insufficient when multiple blocklist imports run simultaneously.
|
||||
|
||||
### How to Fix It
|
||||
|
||||
#### Step 1: Introduce Specific Exception Handling
|
||||
|
||||
Map different exception types to appropriate HTTP responses and error codes:
|
||||
|
||||
| Exception Type | Error Code | HTTP Status | Client Message |
|
||||
|---|---|---|---|
|
||||
| `PermissionError` | `DATABASE_PERMISSION_DENIED` | 503 | Insufficient permissions to access the database file. |
|
||||
| `sqlite3.OperationalError` (unable to open) | `DATABASE_PATH_INVALID` | 503 | Database directory does not exist or path is invalid. |
|
||||
| `sqlite3.OperationalError` (database is locked) | `DATABASE_BUSY` | 503 | Database is temporarily busy. Retry the request. |
|
||||
| `sqlite3.DatabaseError` (corrupt) | `DATABASE_CORRUPTED` | 503 | Database file is corrupted. |
|
||||
| Other / unexpected | `DATABASE_UNAVAILABLE` | 503 | Database is not available. |
|
||||
|
||||
#### Step 2: Add Retry Logic for Transient `DATABASE_BUSY`
|
||||
|
||||
For `sqlite3.OperationalError` with "database is locked" or "database is busy", implement an async retry with exponential backoff (max 3 retries, base delay 100ms). Use `asyncio.sleep()` between retries.
|
||||
|
||||
**Important:** Only retry `OperationalError` that indicates transient locking. Do NOT retry on "unable to open database file" or corruption errors.
|
||||
|
||||
#### Step 3: Ensure ErrorResponse Compliance
|
||||
|
||||
Instead of raising `HTTPException(detail="...")`, raise a custom domain exception that the global exception handler converts to `ErrorResponse`. See `Docs/Backend-Development.md` §8 and `app.exceptions` / `app.middleware.error_handling` for the pattern.
|
||||
|
||||
Example pattern from the codebase:
|
||||
```python
|
||||
from app.exceptions import ServiceUnavailableError
|
||||
|
||||
raise ServiceUnavailableError(
|
||||
error_code="DATABASE_BUSY",
|
||||
message="Database is temporarily busy.",
|
||||
details={"database_path": settings.database_path, "retries": 3},
|
||||
)
|
||||
```
|
||||
|
||||
If `ServiceUnavailableError` does not exist, define it in `app/exceptions.py` following the existing exception hierarchy (see `app.exceptions.BanGUIException`).
|
||||
|
||||
#### Step 4: Improve `open_db()` Resilience
|
||||
|
||||
In `backend/app/db.py`, enhance `open_db()` to:
|
||||
1. Check if the parent directory of `database_path` exists before connecting. If not, attempt to create it with `Path.mkdir(parents=True, exist_ok=True)`.
|
||||
2. Distinguish between `OSError` during WAL cleanup and connection errors.
|
||||
3. Log the specific SQLite error code (`exc.sqlite_errorcode` if available) for better observability.
|
||||
|
||||
#### Step 5: Sanitize Error Responses
|
||||
|
||||
Never expose absolute file system paths in production error responses. Use relative paths or redact them:
|
||||
```python
|
||||
# Safe
|
||||
"database_path": "/data/bangui.db" # OK — this is a known container path
|
||||
|
||||
# Unsafe (don't do this)
|
||||
"database_path": "/home/alice/secret/project/bangui.db"
|
||||
```
|
||||
|
||||
In the BanGUI Docker setup, `/data/bangui.db` is a well-known path, so including it is acceptable. For non-Docker deployments, consider redacting.
|
||||
|
||||
#### Step 6: Reduce Concurrent Write Pressure from Blocklist Imports
|
||||
|
||||
The root cause of many `DATABASE_BUSY` errors is not API load — it is parallel blocklist import tasks that open independent database connections and perform bulk `INSERT`s. Mitigate this by:
|
||||
|
||||
1. **Serializing imports:** Ensure APScheduler jobs for blocklist imports use `max_instances=1` and a job store lock (e.g., `coalesce=True`) so only one import runs at a time.
|
||||
2. **Batching inserts:** If blocklist imports insert one row at a time, wrap them in explicit transactions and batch commits (e.g., commit every 100 rows) to reduce WAL checkpoint frequency.
|
||||
3. **Using a dedicated connection for imports:** Instead of opening a new connection per import task, reuse a single long-lived connection or use a connection pool with a max size of 1 for write operations.
|
||||
4. **Scheduling imports during low-traffic windows:** Offset import schedules so they do not coincide with peak API usage.
|
||||
|
||||
**Code to audit:** `app/tasks/blocklist_import.py`, `app/services/blocklist_service.py`, and the APScheduler job configuration in `app/main.py` or `app/startup.py`.
|
||||
|
||||
### Issues and Trapfalls
|
||||
|
||||
1. **aiosqlite wraps sqlite3 exceptions:** `aiosqlite` wraps `sqlite3.Error` subclasses in `aiosqlite.Error`. You may need to inspect `exc.__cause__` or `exc.__context__` to get the underlying `sqlite3.OperationalError`. Test this carefully.
|
||||
2. **Event loop blocking:** Retry logic with `asyncio.sleep()` is safe, but ensure the total retry duration does not exceed FastAPI/uvicorn timeout thresholds.
|
||||
3. **WAL mode and concurrent readers:** Even with `busy_timeout=5000`, long-running reads can block writes. The retry logic mitigates but does not eliminate this.
|
||||
4. **ErrorResponse schema validation:** If you introduce a new error code (`DATABASE_BUSY`), add it to the `ErrorCode` enum in `app.models.error` (or equivalent) and ensure OpenAPI schema generation includes it.
|
||||
5. **Testing file permissions:** Creating permission-denied scenarios in unit tests requires mocking `aiosqlite.connect()` or using temporary directories with restricted permissions. Avoid tests that depend on actual `chmod` operations for portability.
|
||||
6. **Don't break existing tests:** `test_get_db_uses_effective_runtime_database_path` in `backend/tests/test_dependencies.py` patches `app.db.open_db`. Ensure your changes remain compatible with this mock-based test.
|
||||
7. **Blocklist imports are the primary source of `DATABASE_BUSY`:** The retry logic in `get_db()` helps API requests survive transient lock contention, but it does not fix the root cause. Consider throttling or serializing blocklist imports in `app.tasks.blocklist_import` to reduce concurrent write pressure. If imports are triggered by APScheduler with `max_instances > 1`, overlapping runs will compound the problem.
|
||||
8. **Connection pool exhaustion:** If every API request opens a new connection via `get_db()` and blocklist imports also open connections, the total concurrent connection count can spike. SQLite has a hard limit on concurrent writers in WAL mode. Monitor `PRAGMA wal_checkpoint` and connection counts in production.
|
||||
|
||||
### Documentation References
|
||||
|
||||
- **`Docs/Backend-Development.md` §8 (Error Handling):** Describes the error contract — routers must NOT construct `HTTPException` for domain errors; use custom exceptions that propagate to the global handler.
|
||||
- **`Docs/Backend-Development.md` §2 (Core Libraries):** Mentions `aiosqlite` for async SQLite access and the requirement to use structured logging (already done).
|
||||
- **`Docs/Backend-Development.md` §6 (Dependency Injection):** `get_db()` is a FastAPI dependency — it must yield/return the connection and handle cleanup in `finally`.
|
||||
- **`Docs/Backend-Development.md` §7 (Testing):** Every feature needs tests — mock external dependencies, test error paths.
|
||||
|
||||
### Updates on Documentation
|
||||
|
||||
After implementing the fix, update the following:
|
||||
|
||||
1. **`Docs/Backend-Development.md` §8:** Add a subsection or example showing how dependency-level database errors should be handled (custom exception → global handler → `ErrorResponse`).
|
||||
2. **`Docs/API_STATUS_CODES.md` (if it exists):** Document the new `DATABASE_BUSY`, `DATABASE_PERMISSION_DENIED`, `DATABASE_PATH_INVALID`, and `DATABASE_CORRUPTED` error codes and their HTTP 503 mapping.
|
||||
3. **Inline docstrings:** Update the docstring for `get_db()` in `app/dependencies.py` to document which exceptions it can raise and what they mean.
|
||||
|
||||
### Tests to Write
|
||||
|
||||
Add the following tests to `backend/tests/test_dependencies.py` (or create `backend/tests/test_db_errors.py` if the file grows large):
|
||||
|
||||
#### 1. `test_get_db_raises_service_unavailable_on_permission_error`
|
||||
- **Mock:** Patch `app.db.open_db` to raise `PermissionError(13, "Permission denied")`.
|
||||
- **Assert:** The raised exception (or HTTP response, if tested via `TestClient`) has status 503 and error code `DATABASE_PERMISSION_DENIED`.
|
||||
- **Assert:** The error response follows the `ErrorResponse` schema.
|
||||
|
||||
#### 2. `test_get_db_raises_service_unavailable_on_missing_directory`
|
||||
- **Mock:** Patch `app.db.open_db` to raise `sqlite3.OperationalError("unable to open database file")`.
|
||||
- **Assert:** Status 503, error code `DATABASE_PATH_INVALID`.
|
||||
|
||||
#### 3. `test_get_db_retries_on_database_locked`
|
||||
- **Mock:** Patch `app.db.open_db` to raise `sqlite3.OperationalError("database is locked")` on the first two calls, then succeed on the third.
|
||||
- **Assert:** `open_db` is called exactly 3 times.
|
||||
- **Assert:** The final call yields a valid connection.
|
||||
|
||||
#### 4. `test_get_db_fails_after_max_retries_on_database_locked`
|
||||
- **Mock:** Patch `app.db.open_db` to always raise `sqlite3.OperationalError("database is locked")`.
|
||||
- **Assert:** After 3 retries, raises 503 with error code `DATABASE_BUSY`.
|
||||
|
||||
#### 5. `test_get_db_raises_service_unavailable_on_corrupted_database`
|
||||
- **Mock:** Patch `app.db.open_db` to raise `sqlite3.DatabaseError("database disk image is malformed")`.
|
||||
- **Assert:** Status 503, error code `DATABASE_CORRUPTED`.
|
||||
|
||||
#### 6. `test_open_db_creates_parent_directory_if_missing`
|
||||
- **Setup:** Provide a `database_path` in a non-existent subdirectory of `tmp_path`.
|
||||
- **Action:** Call `open_db(database_path)`.
|
||||
- **Assert:** The parent directory is created and the database opens successfully.
|
||||
|
||||
#### 7. `test_open_db_logs_specific_sqlite_error_code`
|
||||
- **Mock:** Patch `aiosqlite.connect` to raise an exception with a known `sqlite_errorcode`.
|
||||
- **Assert:** The log output includes the error code for observability.
|
||||
|
||||
#### 8. `test_blocklist_import_serializes_to_avoid_database_busy`
|
||||
- **Setup:** Configure APScheduler with two blocklist import jobs that would overlap.
|
||||
- **Action:** Trigger both jobs simultaneously.
|
||||
- **Assert:** Only one job executes at a time (verify via job store or mock call order).
|
||||
- **Assert:** No `DATABASE_BUSY` errors occur during import execution.
|
||||
|
||||
#### 9. `test_get_db_succeeds_during_blocklist_import`
|
||||
- **Setup:** Start a blocklist import task that holds a database write lock.
|
||||
- **Action:** While the import is running, invoke `get_db()` via a FastAPI test client.
|
||||
- **Assert:** The API request succeeds (either immediately or after retry), demonstrating that the retry logic and/or serialized imports prevent user-facing failures.
|
||||
|
||||
---
|
||||
|
||||
## Task: Fix Massive Log Duplication and Malformed JSON Output
|
||||
|
||||
### Issue in Detail
|
||||
|
||||
Log output in production Docker containers shows severe duplication and malformed JSON. Every log event is emitted **4–20+ times** with identical timestamps, and lines contain repeated `"INF INF INF..."` or `"warning warning..."` prefixes instead of clean JSON.
|
||||
|
||||
Example from actual logs:
|
||||
```
|
||||
INF INF INF INF INF INF INF INF event=fail2ban_metadata_service_resolved_db_path INF INF INF INF INF INF INF INF event=fail2ban_metadata_service_resolved_db_path timestamp=2026-05-23T19:39:38.707935+00:00 ...
|
||||
```
|
||||
|
||||
This makes logs unreadable, increases storage I/O, and breaks any log parsing pipeline.
|
||||
|
||||
### Why This Happens
|
||||
|
||||
1. **Handler multiplication:** Uvicorn pre-configures the root logger with a `StreamHandler` before `_configure_logging()` in `app/main.py` runs. When `logging.basicConfig(level=level, handlers=handlers)` is called, it replaces the root logger's handlers — but existing child loggers may still propagate to the root, and the uvicorn access logger (`uvicorn.access`) retains its own handler. The result is multiple handlers firing for the same log record.
|
||||
|
||||
2. **External logging handler recursion risk:** `_ExternalLoggingHandler` is attached to the root logger (`logging.getLogger()`). When it internally calls `log.warning()` or `log.debug()` (e.g., `external_log_buffer_full`), those logs propagate back to the root and re-trigger the external handler. While not yet causing a `RecursionError`, this compounds the duplication.
|
||||
|
||||
3. **Plain text logs before JSON formatter is ready:** During startup, before `_configure_logging()` runs, module-level `get_logger()` calls in `db.py` and `startup.py` emit plain text lines like `orphaned_sqlite_file_removed` and `rate_limiting_process_local_only` that bypass the JSON formatter entirely.
|
||||
|
||||
### How to Fix It
|
||||
|
||||
#### Step 1: Clear Existing Handlers Before `basicConfig`
|
||||
|
||||
In `backend/app/main.py`, modify `_configure_logging()` to remove all pre-existing handlers from the root logger and uvicorn loggers before applying the new configuration:
|
||||
|
||||
```python
|
||||
def _configure_logging(...) -> None:
|
||||
# Remove uvicorn's default handlers to prevent duplication
|
||||
for logger_name in ("uvicorn", "uvicorn.access", "uvicorn.error"):
|
||||
logger = logging.getLogger(logger_name)
|
||||
for handler in logger.handlers[:]:
|
||||
logger.removeHandler(handler)
|
||||
handler.close()
|
||||
|
||||
root = logging.getLogger()
|
||||
for handler in root.handlers[:]:
|
||||
root.removeHandler(handler)
|
||||
handler.close()
|
||||
|
||||
# Now safe to configure
|
||||
logging.basicConfig(level=level, handlers=handlers)
|
||||
```
|
||||
|
||||
#### Step 2: Prevent External Logging Recursion
|
||||
|
||||
In `_ExternalLoggingHandler.emit()`, filter out logs originating from `app.utils.external_logging` to prevent the handler from re-triggering itself:
|
||||
|
||||
```python
|
||||
def emit(self, record: logging.LogRecord) -> None:
|
||||
if record.name.startswith("app.utils.external_logging"):
|
||||
return
|
||||
_external_logging_processor(record)
|
||||
```
|
||||
|
||||
Alternatively, attach the external handler to a specific logger (`logging.getLogger("bangui")`) instead of the root logger.
|
||||
|
||||
#### Step 3: Move Logging Configuration Earlier
|
||||
|
||||
Ensure `_configure_logging()` runs **before** any module imports that call `get_logger()` at the top level. In `app/main.py`, the lifespan function already calls `_configure_logging()` early, but `db.py` and `startup.py` are imported earlier in the lifespan. Consider:
|
||||
- Calling `_configure_logging()` at module import time ( guarded by `if __name__ == "__main__"` or a `_logging_configured` flag).
|
||||
- Or restructuring imports so that `_configure_logging()` happens before `startup_shared_resources()`.
|
||||
|
||||
#### Step 4: Ensure `JSONFormatter` Is the Only Formatter
|
||||
|
||||
Verify that all `StreamHandler` instances use `JSONFormatter`. Uvicorn's default formatter is plain text. After clearing uvicorn handlers and re-adding them via `basicConfig`, they will use the JSON formatter.
|
||||
|
||||
### Issues and Trapfalls
|
||||
|
||||
1. **Uvicorn access logs:** Clearing `uvicorn.access` handlers will suppress access logs unless you re-add a handler. The `basicConfig` call adds a handler to the root logger, which `uvicorn.access` propagates to — so access logs will still appear, but now as JSON. Verify this behavior.
|
||||
2. **File handler duplication:** If `BANGUI_LOG_FILE` is set, ensure the `FileHandler` is only added once. `basicConfig` with `handlers=[...]` replaces root handlers, but if called multiple times (e.g., during tests), it may leak.
|
||||
3. **Test isolation:** Tests that call `create_app()` may re-configure logging. Add a `_logging_configured` global flag or use `logging.disable()` in test fixtures to prevent side effects.
|
||||
4. **Docker `PYTHONUNBUFFERED`:** The Dockerfile sets `PYTHONUNBUFFERED=1`, which is correct for containerized logging. Do not change this.
|
||||
|
||||
### Documentation References
|
||||
|
||||
- **`Docs/Backend-Development.md` §2:** Mentions structured logging with `structlog` — the codebase currently uses a custom `JSONFormatter` instead. Document this choice.
|
||||
- **`Docs/Backend-Development.md` §7 (Testing):** Test logging configuration in isolation to avoid side effects.
|
||||
|
||||
### Updates on Documentation
|
||||
|
||||
1. Update `Docs/Backend-Development.md` §2 to clarify that the project uses a custom `JSONFormatter` on stdlib `logging`, not `structlog`.
|
||||
2. Add a comment in `Docker/Dockerfile.backend` explaining why `PYTHONUNBUFFERED=1` is required for JSON log streaming.
|
||||
|
||||
### Tests to Write
|
||||
|
||||
#### 1. `test_logging_configuration_no_duplicate_handlers`
|
||||
- **Action:** Call `create_app()` twice in the same process.
|
||||
- **Assert:** The root logger has exactly one `StreamHandler` after each call.
|
||||
|
||||
#### 2. `test_uvicorn_access_logs_are_json`
|
||||
- **Action:** Emit a log record through the `uvicorn.access` logger.
|
||||
- **Assert:** The output is valid JSON containing `"event"`, `"level"`, and `"timestamp"`.
|
||||
|
||||
#### 3. `test_external_logging_handler_no_recursion`
|
||||
- **Action:** Emit a log record from `app.utils.external_logging` while the external handler is active.
|
||||
- **Assert:** The external handler's `queue_log` is NOT called for that record.
|
||||
|
||||
#### 4. `test_plain_text_logs_not_emitted_after_startup`
|
||||
- **Action:** After `create_app()` completes startup, emit a log via `get_logger("app.db")`.
|
||||
- **Assert:** Output is JSON, not plain text.
|
||||
|
||||
---
|
||||
|
||||
## Task: Fix Incorrect `since` Timestamp in Ban Service Queries
|
||||
|
||||
### Issue in Detail
|
||||
|
||||
The `ban_service_ban_trend` and `ban_service_list_bans` log events show `since=1779479327` with `range=24h`, but `1779479327` ≈ `2026-05-21 18:42 UTC` — roughly **48 hours** before the log timestamp (`2026-05-23 19:49 UTC`). For a 24h range, `since` should be approximately `1779565827`.
|
||||
|
||||
Example from logs:
|
||||
```
|
||||
event=ban_service_ban_trend since=1779479327 range=24h
|
||||
```
|
||||
|
||||
This suggests the `since` calculation in `app/services/ban_service.py` is using an incorrect time base or double-applying a timezone offset.
|
||||
|
||||
### Why This Happens
|
||||
|
||||
1. **Possible double UTC conversion:** If `since_unix()` returns a UTC timestamp and the caller also subtracts a timezone offset, the result is shifted by the offset amount.
|
||||
2. **Wrong reference time:** The `since` calculation may use `datetime.now()` (naive local time) instead of `datetime.now(timezone.utc)` as the reference point.
|
||||
3. **Hardcoded test data:** If the `since` value is cached or computed once at module import time, it will be stale for long-running processes.
|
||||
|
||||
### How to Fix It
|
||||
|
||||
1. **Audit `since_unix()` in `app/utils/time_utils.py`:** Verify it uses `datetime.now(timezone.utc)` or `time.time()` as the reference.
|
||||
2. **Audit callers in `ban_service.py`:** Ensure `since` is computed per-request, not cached. Check for any manual datetime arithmetic that bypasses `since_unix()`.
|
||||
3. **Add a runtime assertion:** In debug mode, assert that `since` is within `[now - range - slack, now]`.
|
||||
|
||||
### Issues and Trapfalls
|
||||
|
||||
1. **`since_unix()` slack:** The function adds `TIME_RANGE_SLACK_SECONDS = 60` to the window. Ensure this is documented and expected.
|
||||
2. **fail2ban epoch compatibility:** fail2ban stores `timeofban` as Unix epoch seconds (UTC). The `since` value must be in the same unit.
|
||||
3. **Caching:** If `since` is cached in a service-level variable, it will drift over time. Ensure it is computed per-request.
|
||||
|
||||
### Documentation References
|
||||
|
||||
- **`Docs/Backend-Development.md` §2 (Timestamp Handling):** Rules for UTC timestamps, `since_unix()` usage, and the 60-second slack.
|
||||
|
||||
### Tests to Write
|
||||
|
||||
#### 1. `test_ban_trend_since_is_within_expected_range`
|
||||
- **Action:** Call the ban trend service with `range="24h"`.
|
||||
- **Assert:** The computed `since` is within the last 24 hours + 60s slack.
|
||||
|
||||
#### 2. `test_since_unix_returns_utc_epoch`
|
||||
- **Action:** Call `since_unix("24h")`.
|
||||
- **Assert:** The result is within 24 hours + 60s of `time.time()`.
|
||||
|
||||
---
|
||||
|
||||
## Task: Investigate Orphaned SQLite Shared Memory Files on Startup
|
||||
|
||||
### Issue in Detail
|
||||
|
||||
The log shows repeated warnings:
|
||||
```
|
||||
event=orphaned_sqlite_file_removed path=/data/bangui.db-shm
|
||||
```
|
||||
|
||||
This occurs at `19:39:48` and again at `19:49:39` (after restart). The `-shm` file is SQLite's shared memory file for WAL mode. Its presence indicates **unclean shutdowns** (crashes or SIGKILL instead of graceful SIGTERM).
|
||||
|
||||
### Why This Happens
|
||||
|
||||
1. **Docker stop timeout:** Docker sends SIGTERM, waits `stop_grace_period` (default 10s), then sends SIGKILL. The backend allows 25s for graceful shutdown, but if the container's `stop_grace_period` is shorter, the process is killed before cleanup completes.
|
||||
2. **Missing connection close:** If the application crashes or is killed, SQLite connections are not closed, leaving `.wal` and `.shm` files behind.
|
||||
3. **`_cleanup_wal_files()` is a workaround, not a fix:** It removes stale files on the *next* startup, but the underlying cause (unclean shutdown) remains.
|
||||
|
||||
### How to Fix It
|
||||
|
||||
1. **Verify Docker Compose `stop_grace_period`:** In `Docker/compose.prod.yml`, ensure the backend service has `stop_grace_period: 30s` (matching the 25s internal timeout + margin).
|
||||
2. **Improve shutdown logging:** Add explicit logs when the database connection is closed during lifespan shutdown.
|
||||
3. **Consider `PRAGMA journal_mode = DELETE` for single-process setups:** WAL mode is beneficial for concurrent readers, but if BanGUI runs with a single worker and single process, DELETE mode eliminates `.wal`/`.shm` files entirely. Evaluate the tradeoff.
|
||||
|
||||
### Issues and Trapfalls
|
||||
|
||||
1. **WAL mode is required for concurrent reads:** If you switch to DELETE mode, readers block writers. This may degrade API performance under load.
|
||||
2. **The `_cleanup_wal_files()` 10-second threshold:** Files modified within 10 seconds are skipped. If the container restarts rapidly (e.g., health check failure → restart), the files may not be cleaned up.
|
||||
|
||||
### Documentation References
|
||||
|
||||
- **`Docs/Deployment.md`:** Docker deployment configuration and graceful shutdown behavior.
|
||||
- **`Docs/Architekture.md`:** Deployment constraints and process-local state.
|
||||
|
||||
### Tests to Write
|
||||
|
||||
#### 1. `test_cleanup_wal_files_removes_stale_files`
|
||||
- **Setup:** Create fake `.wal` and `.shm` files with mtime > 10s ago.
|
||||
- **Action:** Call `_cleanup_wal_files()`.
|
||||
- **Assert:** Files are removed.
|
||||
|
||||
#### 2. `test_cleanup_wal_files_skips_recent_files`
|
||||
- **Setup:** Create fake `.wal` and `.shm` files with mtime < 10s ago.
|
||||
- **Action:** Call `_cleanup_wal_files()`.
|
||||
- **Assert:** Files are NOT removed.
|
||||
|
||||
---
|
||||
|
||||
@@ -4,7 +4,7 @@ build-backend = "hatchling.build"
|
||||
|
||||
[project]
|
||||
name = "bangui-backend"
|
||||
version = "0.9.19-rc.4"
|
||||
version = "0.9.19-rc.5"
|
||||
description = "BanGUI backend — fail2ban web management interface"
|
||||
requires-python = ">=3.12"
|
||||
dependencies = [
|
||||
|
||||
Reference in New Issue
Block a user