- Document database error handling, logging duplication, ban service timestamp, and orphaned SQLite file issues in Tasks.md - Bump backend version from 0.9.19-rc.4 to 0.9.19-rc.5
24 KiB
Tasks
Task: Improve database_open_failed Error Handling in get_db() Dependency
Issue in Detail
The database_open_failed error event is emitted in app.dependencies.get_db() when app.db.open_db() raises an exception. Currently, the code uses a broad except Exception catch-all that:
- Logs the error with
log.error("database_open_failed", error=str(exc)). - Raises a generic
HTTPException(status_code=503, detail="Database is not available.").
This approach has several problems:
- No differentiation between failure modes (missing directory, permission denied, corrupted database, WAL file lock, disk full, etc.).
- No actionable guidance for operators or API consumers — the message is always the same.
- No retry logic for transient failures like
sqlite3.OperationalError(database is locked / busy). - Error details are lost in the HTTP response, making client-side debugging impossible.
- Inconsistent with the error contract documented in
Docs/Backend-Development.md§8, which requires structuredErrorResponseobjects witherror_code,message, anddetails.
Code location: backend/app/dependencies.py:174
# Current implementation (problematic)
try:
db = await open_db(settings.database_path)
except Exception as exc:
log.error("database_open_failed", error=str(exc))
raise HTTPException(
status_code=status.HTTP_503_SERVICE_UNAVAILABLE,
detail="Database is not available.",
) from exc
Why This Happens
- Broad exception catch:
except Exceptionswallows all failure modes indiscriminately. - No mapping to error codes: The codebase uses a structured
ErrorResponseschema (error_code,message,details), butget_db()bypasses it by raising a rawHTTPExceptionwith a plain stringdetail. - Missing retry for transient errors: SQLite in WAL mode with
busy_timeout=5000can still fail withOperationalErrorunder heavy concurrency. A single immediate failure is suboptimal. - WAL file cleanup race condition:
_cleanup_wal_files()inopen_db()removes orphaned.wal/.shmfiles, but if another process has the file open,OSErroris silently swallowed. This can leave the database in a state whereopen_db()fails with a cryptic error. - No directory existence check: If the parent directory for
database_pathdoes not exist,aiosqlite.connect()raisessqlite3.OperationalError: unable to open database file, which is indistinguishable from other errors in the current handler. - Blocklist imports run in parallel and hammer the DB: Blocklist import tasks (scheduled via APScheduler) run concurrently with API requests and each other. Each import opens its own database connection via
task_db()and performs massINSERToperations intoimport_logand potentiallyblocklist_sources. Under heavy load, multiple concurrent writers contend for the SQLite WAL lock, causingOperationalError: database is lockedorbusyfailures that bubble up toget_db()for unrelated API requests. The currentbusy_timeout=5000is insufficient when multiple blocklist imports run simultaneously.
How to Fix It
Step 1: Introduce Specific Exception Handling
Map different exception types to appropriate HTTP responses and error codes:
| Exception Type | Error Code | HTTP Status | Client Message |
|---|---|---|---|
PermissionError |
DATABASE_PERMISSION_DENIED |
503 | Insufficient permissions to access the database file. |
sqlite3.OperationalError (unable to open) |
DATABASE_PATH_INVALID |
503 | Database directory does not exist or path is invalid. |
sqlite3.OperationalError (database is locked) |
DATABASE_BUSY |
503 | Database is temporarily busy. Retry the request. |
sqlite3.DatabaseError (corrupt) |
DATABASE_CORRUPTED |
503 | Database file is corrupted. |
| Other / unexpected | DATABASE_UNAVAILABLE |
503 | Database is not available. |
Step 2: Add Retry Logic for Transient DATABASE_BUSY
For sqlite3.OperationalError with "database is locked" or "database is busy", implement an async retry with exponential backoff (max 3 retries, base delay 100ms). Use asyncio.sleep() between retries.
Important: Only retry OperationalError that indicates transient locking. Do NOT retry on "unable to open database file" or corruption errors.
Step 3: Ensure ErrorResponse Compliance
Instead of raising HTTPException(detail="..."), raise a custom domain exception that the global exception handler converts to ErrorResponse. See Docs/Backend-Development.md §8 and app.exceptions / app.middleware.error_handling for the pattern.
Example pattern from the codebase:
from app.exceptions import ServiceUnavailableError
raise ServiceUnavailableError(
error_code="DATABASE_BUSY",
message="Database is temporarily busy.",
details={"database_path": settings.database_path, "retries": 3},
)
If ServiceUnavailableError does not exist, define it in app/exceptions.py following the existing exception hierarchy (see app.exceptions.BanGUIException).
Step 4: Improve open_db() Resilience
In backend/app/db.py, enhance open_db() to:
- Check if the parent directory of
database_pathexists before connecting. If not, attempt to create it withPath.mkdir(parents=True, exist_ok=True). - Distinguish between
OSErrorduring WAL cleanup and connection errors. - Log the specific SQLite error code (
exc.sqlite_errorcodeif available) for better observability.
Step 5: Sanitize Error Responses
Never expose absolute file system paths in production error responses. Use relative paths or redact them:
# Safe
"database_path": "/data/bangui.db" # OK — this is a known container path
# Unsafe (don't do this)
"database_path": "/home/alice/secret/project/bangui.db"
In the BanGUI Docker setup, /data/bangui.db is a well-known path, so including it is acceptable. For non-Docker deployments, consider redacting.
Step 6: Reduce Concurrent Write Pressure from Blocklist Imports
The root cause of many DATABASE_BUSY errors is not API load — it is parallel blocklist import tasks that open independent database connections and perform bulk INSERTs. Mitigate this by:
- Serializing imports: Ensure APScheduler jobs for blocklist imports use
max_instances=1and a job store lock (e.g.,coalesce=True) so only one import runs at a time. - Batching inserts: If blocklist imports insert one row at a time, wrap them in explicit transactions and batch commits (e.g., commit every 100 rows) to reduce WAL checkpoint frequency.
- Using a dedicated connection for imports: Instead of opening a new connection per import task, reuse a single long-lived connection or use a connection pool with a max size of 1 for write operations.
- Scheduling imports during low-traffic windows: Offset import schedules so they do not coincide with peak API usage.
Code to audit: app/tasks/blocklist_import.py, app/services/blocklist_service.py, and the APScheduler job configuration in app/main.py or app/startup.py.
Issues and Trapfalls
- aiosqlite wraps sqlite3 exceptions:
aiosqlitewrapssqlite3.Errorsubclasses inaiosqlite.Error. You may need to inspectexc.__cause__orexc.__context__to get the underlyingsqlite3.OperationalError. Test this carefully. - Event loop blocking: Retry logic with
asyncio.sleep()is safe, but ensure the total retry duration does not exceed FastAPI/uvicorn timeout thresholds. - WAL mode and concurrent readers: Even with
busy_timeout=5000, long-running reads can block writes. The retry logic mitigates but does not eliminate this. - ErrorResponse schema validation: If you introduce a new error code (
DATABASE_BUSY), add it to theErrorCodeenum inapp.models.error(or equivalent) and ensure OpenAPI schema generation includes it. - Testing file permissions: Creating permission-denied scenarios in unit tests requires mocking
aiosqlite.connect()or using temporary directories with restricted permissions. Avoid tests that depend on actualchmodoperations for portability. - Don't break existing tests:
test_get_db_uses_effective_runtime_database_pathinbackend/tests/test_dependencies.pypatchesapp.db.open_db. Ensure your changes remain compatible with this mock-based test. - Blocklist imports are the primary source of
DATABASE_BUSY: The retry logic inget_db()helps API requests survive transient lock contention, but it does not fix the root cause. Consider throttling or serializing blocklist imports inapp.tasks.blocklist_importto reduce concurrent write pressure. If imports are triggered by APScheduler withmax_instances > 1, overlapping runs will compound the problem. - Connection pool exhaustion: If every API request opens a new connection via
get_db()and blocklist imports also open connections, the total concurrent connection count can spike. SQLite has a hard limit on concurrent writers in WAL mode. MonitorPRAGMA wal_checkpointand connection counts in production.
Documentation References
Docs/Backend-Development.md§8 (Error Handling): Describes the error contract — routers must NOT constructHTTPExceptionfor domain errors; use custom exceptions that propagate to the global handler.Docs/Backend-Development.md§2 (Core Libraries): Mentionsaiosqlitefor async SQLite access and the requirement to use structured logging (already done).Docs/Backend-Development.md§6 (Dependency Injection):get_db()is a FastAPI dependency — it must yield/return the connection and handle cleanup infinally.Docs/Backend-Development.md§7 (Testing): Every feature needs tests — mock external dependencies, test error paths.
Updates on Documentation
After implementing the fix, update the following:
Docs/Backend-Development.md§8: Add a subsection or example showing how dependency-level database errors should be handled (custom exception → global handler →ErrorResponse).Docs/API_STATUS_CODES.md(if it exists): Document the newDATABASE_BUSY,DATABASE_PERMISSION_DENIED,DATABASE_PATH_INVALID, andDATABASE_CORRUPTEDerror codes and their HTTP 503 mapping.- Inline docstrings: Update the docstring for
get_db()inapp/dependencies.pyto document which exceptions it can raise and what they mean.
Tests to Write
Add the following tests to backend/tests/test_dependencies.py (or create backend/tests/test_db_errors.py if the file grows large):
1. test_get_db_raises_service_unavailable_on_permission_error
- Mock: Patch
app.db.open_dbto raisePermissionError(13, "Permission denied"). - Assert: The raised exception (or HTTP response, if tested via
TestClient) has status 503 and error codeDATABASE_PERMISSION_DENIED. - Assert: The error response follows the
ErrorResponseschema.
2. test_get_db_raises_service_unavailable_on_missing_directory
- Mock: Patch
app.db.open_dbto raisesqlite3.OperationalError("unable to open database file"). - Assert: Status 503, error code
DATABASE_PATH_INVALID.
3. test_get_db_retries_on_database_locked
- Mock: Patch
app.db.open_dbto raisesqlite3.OperationalError("database is locked")on the first two calls, then succeed on the third. - Assert:
open_dbis called exactly 3 times. - Assert: The final call yields a valid connection.
4. test_get_db_fails_after_max_retries_on_database_locked
- Mock: Patch
app.db.open_dbto always raisesqlite3.OperationalError("database is locked"). - Assert: After 3 retries, raises 503 with error code
DATABASE_BUSY.
5. test_get_db_raises_service_unavailable_on_corrupted_database
- Mock: Patch
app.db.open_dbto raisesqlite3.DatabaseError("database disk image is malformed"). - Assert: Status 503, error code
DATABASE_CORRUPTED.
6. test_open_db_creates_parent_directory_if_missing
- Setup: Provide a
database_pathin a non-existent subdirectory oftmp_path. - Action: Call
open_db(database_path). - Assert: The parent directory is created and the database opens successfully.
7. test_open_db_logs_specific_sqlite_error_code
- Mock: Patch
aiosqlite.connectto raise an exception with a knownsqlite_errorcode. - Assert: The log output includes the error code for observability.
8. test_blocklist_import_serializes_to_avoid_database_busy
- Setup: Configure APScheduler with two blocklist import jobs that would overlap.
- Action: Trigger both jobs simultaneously.
- Assert: Only one job executes at a time (verify via job store or mock call order).
- Assert: No
DATABASE_BUSYerrors occur during import execution.
9. test_get_db_succeeds_during_blocklist_import
- Setup: Start a blocklist import task that holds a database write lock.
- Action: While the import is running, invoke
get_db()via a FastAPI test client. - Assert: The API request succeeds (either immediately or after retry), demonstrating that the retry logic and/or serialized imports prevent user-facing failures.
Task: Fix Massive Log Duplication and Malformed JSON Output
Issue in Detail
Log output in production Docker containers shows severe duplication and malformed JSON. Every log event is emitted 4–20+ times with identical timestamps, and lines contain repeated "INF INF INF..." or "warning warning..." prefixes instead of clean JSON.
Example from actual logs:
INF INF INF INF INF INF INF INF event=fail2ban_metadata_service_resolved_db_path INF INF INF INF INF INF INF INF event=fail2ban_metadata_service_resolved_db_path timestamp=2026-05-23T19:39:38.707935+00:00 ...
This makes logs unreadable, increases storage I/O, and breaks any log parsing pipeline.
Why This Happens
-
Handler multiplication: Uvicorn pre-configures the root logger with a
StreamHandlerbefore_configure_logging()inapp/main.pyruns. Whenlogging.basicConfig(level=level, handlers=handlers)is called, it replaces the root logger's handlers — but existing child loggers may still propagate to the root, and the uvicorn access logger (uvicorn.access) retains its own handler. The result is multiple handlers firing for the same log record. -
External logging handler recursion risk:
_ExternalLoggingHandleris attached to the root logger (logging.getLogger()). When it internally callslog.warning()orlog.debug()(e.g.,external_log_buffer_full), those logs propagate back to the root and re-trigger the external handler. While not yet causing aRecursionError, this compounds the duplication. -
Plain text logs before JSON formatter is ready: During startup, before
_configure_logging()runs, module-levelget_logger()calls indb.pyandstartup.pyemit plain text lines likeorphaned_sqlite_file_removedandrate_limiting_process_local_onlythat bypass the JSON formatter entirely.
How to Fix It
Step 1: Clear Existing Handlers Before basicConfig
In backend/app/main.py, modify _configure_logging() to remove all pre-existing handlers from the root logger and uvicorn loggers before applying the new configuration:
def _configure_logging(...) -> None:
# Remove uvicorn's default handlers to prevent duplication
for logger_name in ("uvicorn", "uvicorn.access", "uvicorn.error"):
logger = logging.getLogger(logger_name)
for handler in logger.handlers[:]:
logger.removeHandler(handler)
handler.close()
root = logging.getLogger()
for handler in root.handlers[:]:
root.removeHandler(handler)
handler.close()
# Now safe to configure
logging.basicConfig(level=level, handlers=handlers)
Step 2: Prevent External Logging Recursion
In _ExternalLoggingHandler.emit(), filter out logs originating from app.utils.external_logging to prevent the handler from re-triggering itself:
def emit(self, record: logging.LogRecord) -> None:
if record.name.startswith("app.utils.external_logging"):
return
_external_logging_processor(record)
Alternatively, attach the external handler to a specific logger (logging.getLogger("bangui")) instead of the root logger.
Step 3: Move Logging Configuration Earlier
Ensure _configure_logging() runs before any module imports that call get_logger() at the top level. In app/main.py, the lifespan function already calls _configure_logging() early, but db.py and startup.py are imported earlier in the lifespan. Consider:
- Calling
_configure_logging()at module import time ( guarded byif __name__ == "__main__"or a_logging_configuredflag). - Or restructuring imports so that
_configure_logging()happens beforestartup_shared_resources().
Step 4: Ensure JSONFormatter Is the Only Formatter
Verify that all StreamHandler instances use JSONFormatter. Uvicorn's default formatter is plain text. After clearing uvicorn handlers and re-adding them via basicConfig, they will use the JSON formatter.
Issues and Trapfalls
- Uvicorn access logs: Clearing
uvicorn.accesshandlers will suppress access logs unless you re-add a handler. ThebasicConfigcall adds a handler to the root logger, whichuvicorn.accesspropagates to — so access logs will still appear, but now as JSON. Verify this behavior. - File handler duplication: If
BANGUI_LOG_FILEis set, ensure theFileHandleris only added once.basicConfigwithhandlers=[...]replaces root handlers, but if called multiple times (e.g., during tests), it may leak. - Test isolation: Tests that call
create_app()may re-configure logging. Add a_logging_configuredglobal flag or uselogging.disable()in test fixtures to prevent side effects. - Docker
PYTHONUNBUFFERED: The Dockerfile setsPYTHONUNBUFFERED=1, which is correct for containerized logging. Do not change this.
Documentation References
Docs/Backend-Development.md§2: Mentions structured logging withstructlog— the codebase currently uses a customJSONFormatterinstead. Document this choice.Docs/Backend-Development.md§7 (Testing): Test logging configuration in isolation to avoid side effects.
Updates on Documentation
- Update
Docs/Backend-Development.md§2 to clarify that the project uses a customJSONFormatteron stdliblogging, notstructlog. - Add a comment in
Docker/Dockerfile.backendexplaining whyPYTHONUNBUFFERED=1is required for JSON log streaming.
Tests to Write
1. test_logging_configuration_no_duplicate_handlers
- Action: Call
create_app()twice in the same process. - Assert: The root logger has exactly one
StreamHandlerafter each call.
2. test_uvicorn_access_logs_are_json
- Action: Emit a log record through the
uvicorn.accesslogger. - Assert: The output is valid JSON containing
"event","level", and"timestamp".
3. test_external_logging_handler_no_recursion
- Action: Emit a log record from
app.utils.external_loggingwhile the external handler is active. - Assert: The external handler's
queue_logis NOT called for that record.
4. test_plain_text_logs_not_emitted_after_startup
- Action: After
create_app()completes startup, emit a log viaget_logger("app.db"). - Assert: Output is JSON, not plain text.
Task: Fix Incorrect since Timestamp in Ban Service Queries
Issue in Detail
The ban_service_ban_trend and ban_service_list_bans log events show since=1779479327 with range=24h, but 1779479327 ≈ 2026-05-21 18:42 UTC — roughly 48 hours before the log timestamp (2026-05-23 19:49 UTC). For a 24h range, since should be approximately 1779565827.
Example from logs:
event=ban_service_ban_trend since=1779479327 range=24h
This suggests the since calculation in app/services/ban_service.py is using an incorrect time base or double-applying a timezone offset.
Why This Happens
- Possible double UTC conversion: If
since_unix()returns a UTC timestamp and the caller also subtracts a timezone offset, the result is shifted by the offset amount. - Wrong reference time: The
sincecalculation may usedatetime.now()(naive local time) instead ofdatetime.now(timezone.utc)as the reference point. - Hardcoded test data: If the
sincevalue is cached or computed once at module import time, it will be stale for long-running processes.
How to Fix It
- Audit
since_unix()inapp/utils/time_utils.py: Verify it usesdatetime.now(timezone.utc)ortime.time()as the reference. - Audit callers in
ban_service.py: Ensuresinceis computed per-request, not cached. Check for any manual datetime arithmetic that bypassessince_unix(). - Add a runtime assertion: In debug mode, assert that
sinceis within[now - range - slack, now].
Issues and Trapfalls
since_unix()slack: The function addsTIME_RANGE_SLACK_SECONDS = 60to the window. Ensure this is documented and expected.- fail2ban epoch compatibility: fail2ban stores
timeofbanas Unix epoch seconds (UTC). Thesincevalue must be in the same unit. - Caching: If
sinceis cached in a service-level variable, it will drift over time. Ensure it is computed per-request.
Documentation References
Docs/Backend-Development.md§2 (Timestamp Handling): Rules for UTC timestamps,since_unix()usage, and the 60-second slack.
Tests to Write
1. test_ban_trend_since_is_within_expected_range
- Action: Call the ban trend service with
range="24h". - Assert: The computed
sinceis within the last 24 hours + 60s slack.
2. test_since_unix_returns_utc_epoch
- Action: Call
since_unix("24h"). - Assert: The result is within 24 hours + 60s of
time.time().
Task: Investigate Orphaned SQLite Shared Memory Files on Startup
Issue in Detail
The log shows repeated warnings:
event=orphaned_sqlite_file_removed path=/data/bangui.db-shm
This occurs at 19:39:48 and again at 19:49:39 (after restart). The -shm file is SQLite's shared memory file for WAL mode. Its presence indicates unclean shutdowns (crashes or SIGKILL instead of graceful SIGTERM).
Why This Happens
- Docker stop timeout: Docker sends SIGTERM, waits
stop_grace_period(default 10s), then sends SIGKILL. The backend allows 25s for graceful shutdown, but if the container'sstop_grace_periodis shorter, the process is killed before cleanup completes. - Missing connection close: If the application crashes or is killed, SQLite connections are not closed, leaving
.waland.shmfiles behind. _cleanup_wal_files()is a workaround, not a fix: It removes stale files on the next startup, but the underlying cause (unclean shutdown) remains.
How to Fix It
- Verify Docker Compose
stop_grace_period: InDocker/compose.prod.yml, ensure the backend service hasstop_grace_period: 30s(matching the 25s internal timeout + margin). - Improve shutdown logging: Add explicit logs when the database connection is closed during lifespan shutdown.
- Consider
PRAGMA journal_mode = DELETEfor single-process setups: WAL mode is beneficial for concurrent readers, but if BanGUI runs with a single worker and single process, DELETE mode eliminates.wal/.shmfiles entirely. Evaluate the tradeoff.
Issues and Trapfalls
- WAL mode is required for concurrent reads: If you switch to DELETE mode, readers block writers. This may degrade API performance under load.
- The
_cleanup_wal_files()10-second threshold: Files modified within 10 seconds are skipped. If the container restarts rapidly (e.g., health check failure → restart), the files may not be cleaned up.
Documentation References
Docs/Deployment.md: Docker deployment configuration and graceful shutdown behavior.Docs/Architekture.md: Deployment constraints and process-local state.
Tests to Write
1. test_cleanup_wal_files_removes_stale_files
- Setup: Create fake
.waland.shmfiles with mtime > 10s ago. - Action: Call
_cleanup_wal_files(). - Assert: Files are removed.
2. test_cleanup_wal_files_skips_recent_files
- Setup: Create fake
.waland.shmfiles with mtime < 10s ago. - Action: Call
_cleanup_wal_files(). - Assert: Files are NOT removed.