Implement database-backed scheduler lock for multi-worker safety

Enforce single-executor safety regardless of process launcher through a robust database-backed lock mechanism that works reliably in container orchestration environments. Key changes: 1. Add scheduler_lock table to database schema (migration 4) - Singleton row (id=1) prevents concurrent execution - Stores PID, hostname, creation timestamp, heartbeat timestamp - Atomic transaction prevents race conditions 2. Create scheduler lock utility (app/utils/scheduler_lock.py) - acquire_scheduler_lock(): Atomically acquire or fail - release_scheduler_lock(): Clean up on shutdown - update_scheduler_lock_heartbeat(): Keep lock alive (every 10 seconds) - get_scheduler_lock_info(): Debug/inspect lock status - Stale lock detection: TTL-based (60 second expiry) 3. Reorder startup DAG stages - DATABASE now comes first (required for lock acquisition) - WORKER_MODE depends on DATABASE (performs lock check after initialization) - Maintains all other stage dependencies intact 4. Update startup process (app/startup.py) - Replace _check_single_worker_mode() with two-tier check: * Fast check: BANGUI_WORKERS env var (if explicitly set to >1) * Authoritative check: Database lock (catches misconfiguration) - Return startup_db from startup_shared_resources() for lock management 5. Register scheduler lock heartbeat task - New task: scheduler_lock_heartbeat (app/tasks/scheduler_lock_heartbeat.py) - Updates lock heartbeat every 10 seconds (keeps lock alive) - Prevents false positives from temporary load spikes 6. Add lock release to lifespan shutdown (app/main.py) - Release lock before closing database - Allows other instances to acquire during rolling deployments - Graceful handoff between instances 7. Comprehensive test coverage (backend/tests/test_scheduler_lock.py) - Lock acquisition success and failure cases - Stale lock cleanup on startup - Lock release and heartbeat updates - Full lifecycle: acquire → heartbeat → release 8. Update documentation (Docs/Architekture.md § 9.3) - Explain single-executor requirement - Document database-backed locking mechanism - Compare with alternative approaches (filesystem, env var) - Include troubleshooting guide - Container orchestration examples (Docker, Kubernetes, systemd) Why database-backed instead of filesystem? - Atomicity: SQLite transactions prevent TOCTOU race windows - Container-safe: Works across containers with shared DB volumes - No NFS/SMB edge cases - Timestamp-based stale detection (PID reuse is unreliable) - More reliable in rolling deployments Benefits: - Works with any process manager (uvicorn, gunicorn, etc.) - Handles simultaneous startup attempts correctly - Automatic failover on instance crash (stale lock cleanup) - Clear error messages with troubleshooting steps - No environment variable required (lock is authoritative) - Scales to multi-worker deployments if combined with external job store Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-04-29 20:10:53 +02:00
parent 336242ad06
commit 187cd8250d
8 changed files with 768 additions and 82 deletions
--- a/backend/app/main.py
+++ b/backend/app/main.py
@@ -59,6 +59,7 @@ from app.routers import (
 from app.startup import startup_shared_resources
 from app.utils.rate_limiter import RateLimiter
 from app.utils.runtime_state import ApplicationState, RuntimeState
+from app.utils.scheduler_lock import release_scheduler_lock
 from app.utils.session_cache import InMemorySessionCache, NoOpSessionCache
 from app.utils.setup_state import is_setup_complete_cached, set_setup_complete_cache

@@ -128,6 +129,9 @@ async def _lifespan(app: FastAPI) -> AsyncGenerator[None, None]:
    order on shutdown.  They are stored on ``app.state`` so they are
    accessible to dependency providers and tests.

+    The scheduler lock is released on shutdown to allow other instances to
+    acquire it during rolling deployments or after a crash.
+
    Args:
        app: The :class:`fastapi.FastAPI` instance being started.
    """
@@ -136,9 +140,10 @@ async def _lifespan(app: FastAPI) -> AsyncGenerator[None, None]:

    log.info("bangui_starting_up", database_path=settings.database_path)

-    http_session, scheduler = await startup_shared_resources(app, settings)
+    http_session, scheduler, startup_db = await startup_shared_resources(app, settings)
    app.state.http_session = http_session
    app.state.scheduler = scheduler
+    app.state.startup_db = startup_db

    # Ensure session cache is initialized based on effective settings.
    # This cache is process-local and not cluster-safe. In multi-worker
@@ -158,6 +163,13 @@ async def _lifespan(app: FastAPI) -> AsyncGenerator[None, None]:
        log.info("bangui_shutting_down")
        scheduler.shutdown(wait=False)
        await http_session.close()
+        # Release the scheduler lock to allow other instances to take over
+        try:
+            await release_scheduler_lock(startup_db)
+        except Exception as e:
+            log.error("scheduler_lock_release_failed", error=str(e))
+        finally:
+            await startup_db.close()
        log.info("bangui_shut_down")