Implement database-backed scheduler lock for multi-worker safety
Enforce single-executor safety regardless of process launcher through a
robust database-backed lock mechanism that works reliably in container
orchestration environments.
Key changes:
1. Add scheduler_lock table to database schema (migration 4)
- Singleton row (id=1) prevents concurrent execution
- Stores PID, hostname, creation timestamp, heartbeat timestamp
- Atomic transaction prevents race conditions
2. Create scheduler lock utility (app/utils/scheduler_lock.py)
- acquire_scheduler_lock(): Atomically acquire or fail
- release_scheduler_lock(): Clean up on shutdown
- update_scheduler_lock_heartbeat(): Keep lock alive (every 10 seconds)
- get_scheduler_lock_info(): Debug/inspect lock status
- Stale lock detection: TTL-based (60 second expiry)
3. Reorder startup DAG stages
- DATABASE now comes first (required for lock acquisition)
- WORKER_MODE depends on DATABASE (performs lock check after initialization)
- Maintains all other stage dependencies intact
4. Update startup process (app/startup.py)
- Replace _check_single_worker_mode() with two-tier check:
* Fast check: BANGUI_WORKERS env var (if explicitly set to >1)
* Authoritative check: Database lock (catches misconfiguration)
- Return startup_db from startup_shared_resources() for lock management
5. Register scheduler lock heartbeat task
- New task: scheduler_lock_heartbeat (app/tasks/scheduler_lock_heartbeat.py)
- Updates lock heartbeat every 10 seconds (keeps lock alive)
- Prevents false positives from temporary load spikes
6. Add lock release to lifespan shutdown (app/main.py)
- Release lock before closing database
- Allows other instances to acquire during rolling deployments
- Graceful handoff between instances
7. Comprehensive test coverage (backend/tests/test_scheduler_lock.py)
- Lock acquisition success and failure cases
- Stale lock cleanup on startup
- Lock release and heartbeat updates
- Full lifecycle: acquire → heartbeat → release
8. Update documentation (Docs/Architekture.md § 9.3)
- Explain single-executor requirement
- Document database-backed locking mechanism
- Compare with alternative approaches (filesystem, env var)
- Include troubleshooting guide
- Container orchestration examples (Docker, Kubernetes, systemd)
Why database-backed instead of filesystem?
- Atomicity: SQLite transactions prevent TOCTOU race windows
- Container-safe: Works across containers with shared DB volumes
- No NFS/SMB edge cases
- Timestamp-based stale detection (PID reuse is unreliable)
- More reliable in rolling deployments
Benefits:
- Works with any process manager (uvicorn, gunicorn, etc.)
- Handles simultaneous startup attempts correctly
- Automatic failover on instance crash (stale lock cleanup)
- Clear error messages with troubleshooting steps
- No environment variable required (lock is authoritative)
- Scales to multi-worker deployments if combined with external job store
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
This commit is contained in:
@@ -42,11 +42,16 @@ from app.tasks import (
|
||||
health_check,
|
||||
history_sync,
|
||||
rate_limiter_cleanup,
|
||||
scheduler_lock_heartbeat,
|
||||
session_cleanup,
|
||||
)
|
||||
from app.utils.async_utils import run_blocking
|
||||
from app.utils.jail_config import ensure_jail_configs
|
||||
from app.utils.runtime_state import set_runtime_settings
|
||||
from app.utils.scheduler_lock import (
|
||||
acquire_scheduler_lock,
|
||||
release_scheduler_lock,
|
||||
)
|
||||
from app.utils.setup_state import set_setup_complete_cache
|
||||
|
||||
if TYPE_CHECKING:
|
||||
@@ -58,22 +63,19 @@ log: structlog.stdlib.BoundLogger = structlog.get_logger()
|
||||
|
||||
|
||||
def _check_single_worker_mode() -> None:
|
||||
"""Verify that the application is running with a single worker.
|
||||
"""Fast check: verify BANGUI_WORKERS environment variable if set.
|
||||
|
||||
APScheduler's AsyncIOScheduler is bound to a single asyncio event loop
|
||||
and cannot be safely shared across multiple worker processes. If each
|
||||
worker starts its own scheduler instance, all background jobs execute N
|
||||
times (where N is the number of workers), resulting in duplicate blocklist
|
||||
imports, duplicate ban operations, duplicate history writes, and SQLite
|
||||
lock contention.
|
||||
This is the first-line guard: if BANGUI_WORKERS is explicitly set to a
|
||||
value > 1, reject immediately without requiring database access. This
|
||||
catches obvious misconfiguration early.
|
||||
|
||||
This function detects multi-worker configurations and raises a clear
|
||||
RuntimeError with instructions.
|
||||
The authoritative check is the database-backed lock acquired in
|
||||
_stage_check_worker_mode_and_acquire_lock(), which handles the general
|
||||
case where multiple instances start without proper environment setup.
|
||||
|
||||
Raises:
|
||||
RuntimeError: If the app would run with multiple workers.
|
||||
RuntimeError: If BANGUI_WORKERS is explicitly set to > 1.
|
||||
"""
|
||||
# Check for explicit worker count env var (convention used in deployment)
|
||||
workers_env = os.environ.get("BANGUI_WORKERS")
|
||||
if workers_env is not None:
|
||||
try:
|
||||
@@ -124,7 +126,7 @@ def _create_http_session(settings: Settings) -> aiohttp.ClientSession:
|
||||
async def startup_shared_resources(
|
||||
app: FastAPI,
|
||||
settings: Settings,
|
||||
) -> tuple[aiohttp.ClientSession, AsyncIOScheduler]:
|
||||
) -> tuple[aiohttp.ClientSession, AsyncIOScheduler, Any]:
|
||||
"""Create shared resources needed during the application lifespan.
|
||||
|
||||
This function orchestrates the entire startup sequence through a StartupDAG,
|
||||
@@ -133,8 +135,8 @@ async def startup_shared_resources(
|
||||
rolled back.
|
||||
|
||||
The startup stages are:
|
||||
1. WORKER_MODE: Validate single-worker configuration
|
||||
2. DATABASE: Initialize database and load setup state
|
||||
1. DATABASE: Initialize database and load setup state
|
||||
2. WORKER_MODE: Validate single-worker configuration and acquire scheduler lock
|
||||
3. GEO_CACHE: Load IP geolocation cache
|
||||
4. HTTP_SESSION: Create shared aiohttp session
|
||||
5. SCHEDULER: Create and start APScheduler
|
||||
@@ -145,7 +147,7 @@ async def startup_shared_resources(
|
||||
settings: Resolved application settings.
|
||||
|
||||
Returns:
|
||||
A tuple of ``(http_session, scheduler)``.
|
||||
A tuple of ``(http_session, scheduler, startup_db)``.
|
||||
|
||||
Raises:
|
||||
RuntimeError: If any startup stage fails or prerequisites are not met.
|
||||
@@ -153,20 +155,21 @@ async def startup_shared_resources(
|
||||
dag = StartupDAG()
|
||||
|
||||
# Register all startup stages with their dependencies.
|
||||
dag.register_stage(
|
||||
StartupStage.WORKER_MODE,
|
||||
"Verify single-worker mode (scheduler must not run in multiple workers)",
|
||||
prerequisites=frozenset(),
|
||||
)
|
||||
# NOTE: DATABASE stage must come before WORKER_MODE for lock acquisition
|
||||
dag.register_stage(
|
||||
StartupStage.DATABASE,
|
||||
"Initialize database schema and load setup state",
|
||||
prerequisites=frozenset([StartupStage.WORKER_MODE]),
|
||||
prerequisites=frozenset(),
|
||||
)
|
||||
dag.register_stage(
|
||||
StartupStage.WORKER_MODE,
|
||||
"Verify single-worker mode and acquire scheduler lock",
|
||||
prerequisites=frozenset([StartupStage.DATABASE]),
|
||||
)
|
||||
dag.register_stage(
|
||||
StartupStage.GEO_CACHE,
|
||||
"Load IP geolocation cache from database",
|
||||
prerequisites=frozenset([StartupStage.DATABASE]),
|
||||
prerequisites=frozenset([StartupStage.WORKER_MODE]),
|
||||
)
|
||||
dag.register_stage(
|
||||
StartupStage.HTTP_SESSION,
|
||||
@@ -185,18 +188,18 @@ async def startup_shared_resources(
|
||||
)
|
||||
|
||||
try:
|
||||
# Stage 1: Validate single-worker mode
|
||||
await dag.execute_stage(
|
||||
StartupStage.WORKER_MODE,
|
||||
_stage_check_worker_mode,
|
||||
)
|
||||
|
||||
# Stage 2: Initialize database
|
||||
# Stage 1: Initialize database (must come first for lock acquisition)
|
||||
startup_db = await dag.execute_stage(
|
||||
StartupStage.DATABASE,
|
||||
lambda: _stage_init_database(app, settings),
|
||||
)
|
||||
|
||||
# Stage 2: Validate single-worker mode and acquire scheduler lock
|
||||
await dag.execute_stage(
|
||||
StartupStage.WORKER_MODE,
|
||||
lambda: _stage_check_worker_mode_and_acquire_lock(startup_db),
|
||||
)
|
||||
|
||||
# Stage 3: Load GeoCache
|
||||
geo_cache = await dag.execute_stage(
|
||||
StartupStage.GEO_CACHE,
|
||||
@@ -233,7 +236,7 @@ async def startup_shared_resources(
|
||||
stages=len(dag.context.completed_stages),
|
||||
)
|
||||
|
||||
return http_session, scheduler
|
||||
return http_session, scheduler, startup_db
|
||||
|
||||
except Exception:
|
||||
# Clean up on failure
|
||||
@@ -246,13 +249,42 @@ async def startup_shared_resources(
|
||||
raise
|
||||
|
||||
|
||||
async def _stage_check_worker_mode() -> None:
|
||||
"""Check that the application is running with a single worker.
|
||||
async def _stage_check_worker_mode_and_acquire_lock(startup_db: Any) -> None:
|
||||
"""Check single-worker mode and acquire the scheduler lock.
|
||||
|
||||
This is stage 1 of the startup DAG.
|
||||
This is stage 1 of the startup DAG. It performs two checks:
|
||||
1. Fast check: Verify BANGUI_WORKERS env var if explicitly set
|
||||
2. Authoritative check: Acquire database-backed scheduler lock
|
||||
|
||||
The database lock ensures that only one instance runs the scheduler, even
|
||||
in container orchestration scenarios where multiple instances may start
|
||||
simultaneously. This prevents duplicate background jobs, duplicate history
|
||||
entries, and SQLite lock contention.
|
||||
|
||||
Args:
|
||||
startup_db: The initialized database connection.
|
||||
|
||||
Raises:
|
||||
RuntimeError: If the env var check fails or the scheduler lock cannot
|
||||
be acquired (another instance is running the scheduler).
|
||||
"""
|
||||
# Fast check: verify BANGUI_WORKERS if explicitly set
|
||||
_check_single_worker_mode()
|
||||
|
||||
# Authoritative check: acquire the database-backed lock
|
||||
if not await acquire_scheduler_lock(startup_db):
|
||||
raise RuntimeError(
|
||||
"Could not acquire scheduler lock. Another BanGUI instance is already running the scheduler.\n"
|
||||
"This prevents duplicate background jobs (blocklist imports, history sync, etc.).\n"
|
||||
"\n"
|
||||
"To recover from a stale lock (e.g., after a crash):\n"
|
||||
" 1. Verify no other BanGUI instances are running\n"
|
||||
" 2. Inspect the lock: sqlite3 bangui.db 'SELECT * FROM scheduler_lock;'\n"
|
||||
" 3. If stale, clean it: sqlite3 bangui.db 'DELETE FROM scheduler_lock;'\n"
|
||||
"\n"
|
||||
"See Architekture.md § Deployment Constraints for details."
|
||||
)
|
||||
|
||||
|
||||
async def _stage_init_database(app: FastAPI, settings: Settings) -> Any:
|
||||
"""Initialize database schema and load setup state.
|
||||
@@ -389,6 +421,7 @@ async def _stage_register_tasks(app: FastAPI, scheduler: AsyncIOScheduler) -> No
|
||||
"""Register all background jobs.
|
||||
|
||||
This is stage 6 of the startup DAG. It registers:
|
||||
- scheduler_lock_heartbeat: Periodic update of scheduler lock (keeps it alive)
|
||||
- health_check: Periodic fail2ban connectivity probe
|
||||
- blocklist_import: Scheduled blocklist download and application
|
||||
- geo_cache_cleanup: Periodic purge of stale geo cache entries
|
||||
@@ -402,6 +435,7 @@ async def _stage_register_tasks(app: FastAPI, scheduler: AsyncIOScheduler) -> No
|
||||
app: The FastAPI application instance.
|
||||
scheduler: The APScheduler scheduler to register tasks with.
|
||||
"""
|
||||
scheduler_lock_heartbeat.register(app)
|
||||
health_check.register(app)
|
||||
await blocklist_import.register(app)
|
||||
geo_cache_cleanup.register(app)
|
||||
@@ -411,4 +445,4 @@ async def _stage_register_tasks(app: FastAPI, scheduler: AsyncIOScheduler) -> No
|
||||
session_cleanup.register(app)
|
||||
rate_limiter_cleanup.register(app)
|
||||
|
||||
log.info("startup_tasks_registered", count=8)
|
||||
log.info("startup_tasks_registered", count=9)
|
||||
|
||||
Reference in New Issue
Block a user