Fix: Enforce single-worker deployment for session cache cluster safety

Addresses: Backend session cache not cluster-safe (multi-worker issue) Problem: - Session cache is process-local (InMemorySessionCache) - Multi-worker deployments (uvicorn --workers N) create separate processes - Each process has its own independent session cache - Sessions cached in Worker A are invisible to Workers B, C, D - Users randomly logged out when requests land on different workers - Also affects RuntimeState, rate limiter, and background jobs Solution (Option A - Strict single-worker enforcement): - Enhance startup validation with clearer error messages - Update error messages to explain the problem and how to fix it - Document single-worker requirement prominently in Docker configs - Update module docstrings to clarify constraints Changes: 1. app/startup.py: - Enhanced _check_single_worker_mode() error message with troubleshooting - Enhanced _stage_check_worker_mode_and_acquire_lock() error message - Removed unused import 2. app/utils/session_cache.py: - Updated module docstring to explain constraints more clearly - Added references to deployment documentation - Clarified multi-worker solution for future implementation 3. app/utils/runtime_state.py: - Updated module docstring with deployment constraint references - Aligned messaging with session_cache.py 4. Docker/Dockerfile.backend: - Added comprehensive comments about single-worker requirement - Explained impact in multi-worker deployments - Referenced deployment constraints documentation 5. Docker/docker-compose.yml, compose.prod.yml, compose.debug.yml: - Added documentation comments about BANGUI_WORKERS constraint - Explained why single-worker is required 6. backend/tests/test_startup_integration.py: - Fixed test unpacking to match function return signature (3 values, not 2) This ensures multi-worker deployments fail loudly at startup with clear guidance on what went wrong and how to fix it. The database-backed scheduler lock provides defense-in-depth for container orchestration scenarios. For future multi-worker support, implement: - Redis or database-backed session cache - Shared RuntimeState coordination - Distributed APScheduler backend Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-04-30 20:54:24 +02:00
parent f074882f2d
commit c4ede71fa6
8 changed files with 89 additions and 34 deletions
--- a/backend/app/startup.py
+++ b/backend/app/startup.py
@@ -50,7 +50,6 @@ from app.utils.jail_config import ensure_jail_configs
 from app.utils.runtime_state import set_runtime_settings
 from app.utils.scheduler_lock import (
    acquire_scheduler_lock,
-    release_scheduler_lock,
 )
 from app.utils.setup_state import set_setup_complete_cache

@@ -84,7 +83,18 @@ def _check_single_worker_mode() -> None:
                raise RuntimeError(
                    "BanGUI background scheduler cannot run with multiple workers.\n"
                    f"BANGUI_WORKERS is set to {worker_count}. Set it to 1 or remove it.\n"
-                    "See Architekture.md § Deployment Constraints for details."
+                    "\n"
+                    "Why this matters:\n"
+                    "  - Session cache is process-local; users may be randomly logged out\n"
+                    "  - Background jobs (blocklist imports, history sync) would run N times\n"
+                    "  - Database lock contention will cause timeouts\n"
+                    "\n"
+                    "To fix:\n"
+                    "  1. Remove BANGUI_WORKERS=N from your environment\n"
+                    "  2. Don't pass --workers to uvicorn or -w to gunicorn\n"
+                    "  3. Deploy as a single process (use container orchestration for HA)\n"
+                    "\n"
+                    "See Docs/Architekture.md § Deployment Constraints for details."
                )
        except ValueError as e:
            raise RuntimeError(
@@ -275,14 +285,20 @@ async def _stage_check_worker_mode_and_acquire_lock(startup_db: Any) -> None:
    if not await acquire_scheduler_lock(startup_db):
        raise RuntimeError(
            "Could not acquire scheduler lock. Another BanGUI instance is already running the scheduler.\n"
+            "\n"
            "This prevents duplicate background jobs (blocklist imports, history sync, etc.).\n"
            "\n"
+            "IMPORTANT: This also indicates a possible multi-worker misconfiguration:\n"
+            "  - If BANGUI_WORKERS > 1, multiple workers are trying to acquire the lock\n"
+            "  - If --workers or -w was passed to uvicorn/gunicorn, remove it\n"
+            "  - BanGUI must run with exactly 1 worker process (use HA at container level)\n"
+            "\n"
            "To recover from a stale lock (e.g., after a crash):\n"
            "  1. Verify no other BanGUI instances are running\n"
            "  2. Inspect the lock: sqlite3 bangui.db 'SELECT * FROM scheduler_lock;'\n"
            "  3. If stale, clean it: sqlite3 bangui.db 'DELETE FROM scheduler_lock;'\n"
            "\n"
-            "See Architekture.md § Deployment Constraints for details."
+            "See Docs/Architekture.md § Deployment Constraints for details."
        )