Files
BanGUI/backend/app/utils/scheduler_lock.py
Lukas 187cd8250d Implement database-backed scheduler lock for multi-worker safety
Enforce single-executor safety regardless of process launcher through a
robust database-backed lock mechanism that works reliably in container
orchestration environments.

Key changes:
1. Add scheduler_lock table to database schema (migration 4)
   - Singleton row (id=1) prevents concurrent execution
   - Stores PID, hostname, creation timestamp, heartbeat timestamp
   - Atomic transaction prevents race conditions

2. Create scheduler lock utility (app/utils/scheduler_lock.py)
   - acquire_scheduler_lock(): Atomically acquire or fail
   - release_scheduler_lock(): Clean up on shutdown
   - update_scheduler_lock_heartbeat(): Keep lock alive (every 10 seconds)
   - get_scheduler_lock_info(): Debug/inspect lock status
   - Stale lock detection: TTL-based (60 second expiry)

3. Reorder startup DAG stages
   - DATABASE now comes first (required for lock acquisition)
   - WORKER_MODE depends on DATABASE (performs lock check after initialization)
   - Maintains all other stage dependencies intact

4. Update startup process (app/startup.py)
   - Replace _check_single_worker_mode() with two-tier check:
     * Fast check: BANGUI_WORKERS env var (if explicitly set to >1)
     * Authoritative check: Database lock (catches misconfiguration)
   - Return startup_db from startup_shared_resources() for lock management

5. Register scheduler lock heartbeat task
   - New task: scheduler_lock_heartbeat (app/tasks/scheduler_lock_heartbeat.py)
   - Updates lock heartbeat every 10 seconds (keeps lock alive)
   - Prevents false positives from temporary load spikes

6. Add lock release to lifespan shutdown (app/main.py)
   - Release lock before closing database
   - Allows other instances to acquire during rolling deployments
   - Graceful handoff between instances

7. Comprehensive test coverage (backend/tests/test_scheduler_lock.py)
   - Lock acquisition success and failure cases
   - Stale lock cleanup on startup
   - Lock release and heartbeat updates
   - Full lifecycle: acquire → heartbeat → release

8. Update documentation (Docs/Architekture.md § 9.3)
   - Explain single-executor requirement
   - Document database-backed locking mechanism
   - Compare with alternative approaches (filesystem, env var)
   - Include troubleshooting guide
   - Container orchestration examples (Docker, Kubernetes, systemd)

Why database-backed instead of filesystem?
   - Atomicity: SQLite transactions prevent TOCTOU race windows
   - Container-safe: Works across containers with shared DB volumes
   - No NFS/SMB edge cases
   - Timestamp-based stale detection (PID reuse is unreliable)
   - More reliable in rolling deployments

Benefits:
   - Works with any process manager (uvicorn, gunicorn, etc.)
   - Handles simultaneous startup attempts correctly
   - Automatic failover on instance crash (stale lock cleanup)
   - Clear error messages with troubleshooting steps
   - No environment variable required (lock is authoritative)
   - Scales to multi-worker deployments if combined with external job store

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-04-29 20:10:53 +02:00

276 lines
9.0 KiB
Python

"""Database-based scheduler lock for single-executor enforcement.
This module implements a database-backed lock mechanism that ensures only one
BanGUI instance runs the background scheduler, even in container orchestration
environments where multiple instances might start simultaneously.
The lock uses atomic database operations to prevent race conditions:
- Lock acquisition is atomic: INSERT fails if the singleton row already exists
- Lock release is atomic: DELETE with PID check ensures only the owner releases
- Stale lock detection uses heartbeat timestamps: a lock older than TTL is
considered abandoned and eligible for cleanup on the next startup
This approach is more reliable than filesystem-based locking in containerized
environments because:
1. Database transactions are atomic (no TOCTOU race windows)
2. No NFS/network filesystem edge cases
3. Stale lock detection is timestamp-based, not PID-based (PID reuse is unreliable)
4. Works across container restarts and rolling deployments
The lock record stores:
- id: Always 1 (singleton table)
- pid: Process ID of the lock holder
- hostname: Container/host name for debugging
- created_at: When the lock was first acquired
- heartbeat_at: When the lock was last confirmed alive (updated periodically)
On startup:
1. Cleanup any stale locks (where heartbeat_at > TTL)
2. Try to insert the lock for this instance
3. If INSERT succeeds, lock is acquired
4. If INSERT fails (IntegrityError), another instance holds the lock
On running (periodic):
- Update heartbeat_at to keep the lock alive and prevent false positives
On shutdown:
- Delete the lock (this instance is no longer running the scheduler)
"""
from __future__ import annotations
import os
import socket
import time
from typing import Any
import aiosqlite
import structlog
log: structlog.stdlib.BoundLogger = structlog.get_logger()
# Lock record expires if heartbeat hasn't been updated for this many seconds.
# This prevents stale locks from a crashed instance from blocking new startups.
SCHEDULER_LOCK_TTL_SECONDS: int = 60
# Heartbeat interval: how often to update the lock's heartbeat_at timestamp.
# Must be less than TTL to prevent premature expiration.
SCHEDULER_LOCK_HEARTBEAT_INTERVAL_SECONDS: int = 10
async def init_scheduler_lock_table(db: aiosqlite.Connection) -> None:
"""Create the scheduler_lock table if it doesn't exist.
This is called during database schema initialization and is safe to call
multiple times (CREATE TABLE IF NOT EXISTS is idempotent).
Args:
db: The SQLite database connection.
"""
await db.execute(
"""
CREATE TABLE IF NOT EXISTS scheduler_lock (
id INTEGER PRIMARY KEY CHECK (id = 1),
pid INTEGER NOT NULL,
hostname TEXT NOT NULL,
created_at REAL NOT NULL,
heartbeat_at REAL NOT NULL
);
"""
)
await db.commit()
async def acquire_scheduler_lock(db: aiosqlite.Connection) -> bool:
"""Try to acquire the scheduler lock.
This function performs two operations:
1. Clean up any stale locks (where heartbeat_at + TTL < now)
2. Try to insert a lock record for this instance
If another instance already holds a valid lock, the INSERT will fail and
this function returns False. The caller should reject startup with a clear
error message.
Args:
db: The SQLite database connection.
Returns:
True if the lock was successfully acquired, False if held by another instance.
Raises:
RuntimeError: If database operations fail for reasons other than the lock
being held (e.g., database is corrupted or inaccessible).
"""
now = time.time()
pid = os.getpid()
hostname = socket.gethostname()
try:
# Clean up stale locks first
await db.execute(
"""
DELETE FROM scheduler_lock
WHERE (? - heartbeat_at) > ?
""",
(now, SCHEDULER_LOCK_TTL_SECONDS),
)
# Try to acquire the lock (atomic: INSERT fails if row exists)
await db.execute(
"""
INSERT INTO scheduler_lock (id, pid, hostname, created_at, heartbeat_at)
VALUES (1, ?, ?, ?, ?)
""",
(pid, hostname, now, now),
)
await db.commit()
log.info(
"scheduler_lock_acquired",
pid=pid,
hostname=hostname,
)
return True
except aiosqlite.IntegrityError:
# Lock is already held by another instance (INSERT failed due to UNIQUE constraint)
# Log details about who holds the lock to help with debugging
try:
cursor = await db.execute(
"SELECT pid, hostname, created_at, heartbeat_at FROM scheduler_lock WHERE id = 1"
)
row = await cursor.fetchone()
if row:
lock_pid, lock_hostname, lock_created, lock_heartbeat = row
age_seconds = now - lock_created
heartbeat_age = now - lock_heartbeat
log.warning(
"scheduler_lock_held_by_other_instance",
our_pid=pid,
lock_pid=lock_pid,
lock_hostname=lock_hostname,
lock_age_seconds=age_seconds,
heartbeat_age_seconds=heartbeat_age,
)
except Exception as e:
log.warning("scheduler_lock_held_but_could_not_read_holder", error=str(e))
return False
except Exception as e:
# Unexpected database error (not an IntegrityError)
raise RuntimeError(
f"Failed to acquire scheduler lock due to database error: {e}\n"
"Check that the database is accessible and not corrupted."
) from e
async def release_scheduler_lock(db: aiosqlite.Connection) -> None:
"""Release the scheduler lock.
This function should be called during application shutdown. It removes the
lock record, allowing other instances to acquire it.
Args:
db: The SQLite database connection.
Raises:
RuntimeError: If database operations fail.
"""
pid = os.getpid()
try:
cursor = await db.execute(
"DELETE FROM scheduler_lock WHERE id = 1 AND pid = ?",
(pid,),
)
await db.commit()
if cursor.rowcount == 0:
# This shouldn't happen in normal operation, but log it for visibility
log.warning(
"scheduler_lock_release_mismatch",
our_pid=pid,
message="Tried to release lock but we don't hold it. Another instance may have replaced us.",
)
else:
log.info("scheduler_lock_released", pid=pid)
except Exception as e:
raise RuntimeError(f"Failed to release scheduler lock: {e}") from e
async def update_scheduler_lock_heartbeat(db: aiosqlite.Connection) -> bool:
"""Update the heartbeat timestamp to keep the lock alive.
This function should be called periodically (every ~10 seconds) to prevent
the lock from being considered stale. It only succeeds if this process
still holds the lock.
Args:
db: The SQLite database connection.
Returns:
True if the heartbeat was updated (we still hold the lock), False if
we no longer hold the lock (another instance has taken over).
Raises:
RuntimeError: If database operations fail.
"""
now = time.time()
pid = os.getpid()
try:
cursor = await db.execute(
"UPDATE scheduler_lock SET heartbeat_at = ? WHERE id = 1 AND pid = ?",
(now, pid),
)
await db.commit()
if cursor.rowcount == 0:
# We no longer hold the lock
log.warning(
"scheduler_lock_heartbeat_lost",
our_pid=pid,
message="Heartbeat failed; we no longer hold the lock.",
)
return False
return True
except Exception as e:
raise RuntimeError(f"Failed to update scheduler lock heartbeat: {e}") from e
async def get_scheduler_lock_info(db: aiosqlite.Connection) -> dict[str, Any] | None:
"""Retrieve information about the current scheduler lock.
This function is useful for debugging and monitoring. Returns None if no
lock is currently held.
Args:
db: The SQLite database connection.
Returns:
A dict with keys: pid, hostname, created_at, heartbeat_at, or None
if no lock exists.
"""
try:
cursor = await db.execute(
"SELECT pid, hostname, created_at, heartbeat_at FROM scheduler_lock WHERE id = 1"
)
row = await cursor.fetchone()
if row:
pid, hostname, created_at, heartbeat_at = row
return {
"pid": pid,
"hostname": hostname,
"created_at": created_at,
"heartbeat_at": heartbeat_at,
}
return None
except Exception as e:
log.warning("scheduler_lock_info_query_failed", error=str(e))
return None