Implement database-backed scheduler lock for multi-worker safety

Enforce single-executor safety regardless of process launcher through a
robust database-backed lock mechanism that works reliably in container
orchestration environments.

Key changes:
1. Add scheduler_lock table to database schema (migration 4)
   - Singleton row (id=1) prevents concurrent execution
   - Stores PID, hostname, creation timestamp, heartbeat timestamp
   - Atomic transaction prevents race conditions

2. Create scheduler lock utility (app/utils/scheduler_lock.py)
   - acquire_scheduler_lock(): Atomically acquire or fail
   - release_scheduler_lock(): Clean up on shutdown
   - update_scheduler_lock_heartbeat(): Keep lock alive (every 10 seconds)
   - get_scheduler_lock_info(): Debug/inspect lock status
   - Stale lock detection: TTL-based (60 second expiry)

3. Reorder startup DAG stages
   - DATABASE now comes first (required for lock acquisition)
   - WORKER_MODE depends on DATABASE (performs lock check after initialization)
   - Maintains all other stage dependencies intact

4. Update startup process (app/startup.py)
   - Replace _check_single_worker_mode() with two-tier check:
     * Fast check: BANGUI_WORKERS env var (if explicitly set to >1)
     * Authoritative check: Database lock (catches misconfiguration)
   - Return startup_db from startup_shared_resources() for lock management

5. Register scheduler lock heartbeat task
   - New task: scheduler_lock_heartbeat (app/tasks/scheduler_lock_heartbeat.py)
   - Updates lock heartbeat every 10 seconds (keeps lock alive)
   - Prevents false positives from temporary load spikes

6. Add lock release to lifespan shutdown (app/main.py)
   - Release lock before closing database
   - Allows other instances to acquire during rolling deployments
   - Graceful handoff between instances

7. Comprehensive test coverage (backend/tests/test_scheduler_lock.py)
   - Lock acquisition success and failure cases
   - Stale lock cleanup on startup
   - Lock release and heartbeat updates
   - Full lifecycle: acquire → heartbeat → release

8. Update documentation (Docs/Architekture.md § 9.3)
   - Explain single-executor requirement
   - Document database-backed locking mechanism
   - Compare with alternative approaches (filesystem, env var)
   - Include troubleshooting guide
   - Container orchestration examples (Docker, Kubernetes, systemd)

Why database-backed instead of filesystem?
   - Atomicity: SQLite transactions prevent TOCTOU race windows
   - Container-safe: Works across containers with shared DB volumes
   - No NFS/SMB edge cases
   - Timestamp-based stale detection (PID reuse is unreliable)
   - More reliable in rolling deployments

Benefits:
   - Works with any process manager (uvicorn, gunicorn, etc.)
   - Handles simultaneous startup attempts correctly
   - Automatic failover on instance crash (stale lock cleanup)
   - Clear error messages with troubleshooting steps
   - No environment variable required (lock is authoritative)
   - Scales to multi-worker deployments if combined with external job store

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
This commit is contained in:
2026-04-29 20:10:53 +02:00
parent 336242ad06
commit 187cd8250d
8 changed files with 768 additions and 82 deletions

View File

@@ -59,6 +59,7 @@ from app.routers import (
from app.startup import startup_shared_resources
from app.utils.rate_limiter import RateLimiter
from app.utils.runtime_state import ApplicationState, RuntimeState
from app.utils.scheduler_lock import release_scheduler_lock
from app.utils.session_cache import InMemorySessionCache, NoOpSessionCache
from app.utils.setup_state import is_setup_complete_cached, set_setup_complete_cache
@@ -128,6 +129,9 @@ async def _lifespan(app: FastAPI) -> AsyncGenerator[None, None]:
order on shutdown. They are stored on ``app.state`` so they are
accessible to dependency providers and tests.
The scheduler lock is released on shutdown to allow other instances to
acquire it during rolling deployments or after a crash.
Args:
app: The :class:`fastapi.FastAPI` instance being started.
"""
@@ -136,9 +140,10 @@ async def _lifespan(app: FastAPI) -> AsyncGenerator[None, None]:
log.info("bangui_starting_up", database_path=settings.database_path)
http_session, scheduler = await startup_shared_resources(app, settings)
http_session, scheduler, startup_db = await startup_shared_resources(app, settings)
app.state.http_session = http_session
app.state.scheduler = scheduler
app.state.startup_db = startup_db
# Ensure session cache is initialized based on effective settings.
# This cache is process-local and not cluster-safe. In multi-worker
@@ -158,6 +163,13 @@ async def _lifespan(app: FastAPI) -> AsyncGenerator[None, None]:
log.info("bangui_shutting_down")
scheduler.shutdown(wait=False)
await http_session.close()
# Release the scheduler lock to allow other instances to take over
try:
await release_scheduler_lock(startup_db)
except Exception as e:
log.error("scheduler_lock_release_failed", error=str(e))
finally:
await startup_db.close()
log.info("bangui_shut_down")