Enforce single-executor safety regardless of process launcher through a
robust database-backed lock mechanism that works reliably in container
orchestration environments.
Key changes:
1. Add scheduler_lock table to database schema (migration 4)
- Singleton row (id=1) prevents concurrent execution
- Stores PID, hostname, creation timestamp, heartbeat timestamp
- Atomic transaction prevents race conditions
2. Create scheduler lock utility (app/utils/scheduler_lock.py)
- acquire_scheduler_lock(): Atomically acquire or fail
- release_scheduler_lock(): Clean up on shutdown
- update_scheduler_lock_heartbeat(): Keep lock alive (every 10 seconds)
- get_scheduler_lock_info(): Debug/inspect lock status
- Stale lock detection: TTL-based (60 second expiry)
3. Reorder startup DAG stages
- DATABASE now comes first (required for lock acquisition)
- WORKER_MODE depends on DATABASE (performs lock check after initialization)
- Maintains all other stage dependencies intact
4. Update startup process (app/startup.py)
- Replace _check_single_worker_mode() with two-tier check:
* Fast check: BANGUI_WORKERS env var (if explicitly set to >1)
* Authoritative check: Database lock (catches misconfiguration)
- Return startup_db from startup_shared_resources() for lock management
5. Register scheduler lock heartbeat task
- New task: scheduler_lock_heartbeat (app/tasks/scheduler_lock_heartbeat.py)
- Updates lock heartbeat every 10 seconds (keeps lock alive)
- Prevents false positives from temporary load spikes
6. Add lock release to lifespan shutdown (app/main.py)
- Release lock before closing database
- Allows other instances to acquire during rolling deployments
- Graceful handoff between instances
7. Comprehensive test coverage (backend/tests/test_scheduler_lock.py)
- Lock acquisition success and failure cases
- Stale lock cleanup on startup
- Lock release and heartbeat updates
- Full lifecycle: acquire → heartbeat → release
8. Update documentation (Docs/Architekture.md § 9.3)
- Explain single-executor requirement
- Document database-backed locking mechanism
- Compare with alternative approaches (filesystem, env var)
- Include troubleshooting guide
- Container orchestration examples (Docker, Kubernetes, systemd)
Why database-backed instead of filesystem?
- Atomicity: SQLite transactions prevent TOCTOU race windows
- Container-safe: Works across containers with shared DB volumes
- No NFS/SMB edge cases
- Timestamp-based stale detection (PID reuse is unreliable)
- More reliable in rolling deployments
Benefits:
- Works with any process manager (uvicorn, gunicorn, etc.)
- Handles simultaneous startup attempts correctly
- Automatic failover on instance crash (stale lock cleanup)
- Clear error messages with troubleshooting steps
- No environment variable required (lock is authoritative)
- Scales to multi-worker deployments if combined with external job store
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
276 lines
9.0 KiB
Python
276 lines
9.0 KiB
Python
"""Database-based scheduler lock for single-executor enforcement.
|
|
|
|
This module implements a database-backed lock mechanism that ensures only one
|
|
BanGUI instance runs the background scheduler, even in container orchestration
|
|
environments where multiple instances might start simultaneously.
|
|
|
|
The lock uses atomic database operations to prevent race conditions:
|
|
- Lock acquisition is atomic: INSERT fails if the singleton row already exists
|
|
- Lock release is atomic: DELETE with PID check ensures only the owner releases
|
|
- Stale lock detection uses heartbeat timestamps: a lock older than TTL is
|
|
considered abandoned and eligible for cleanup on the next startup
|
|
|
|
This approach is more reliable than filesystem-based locking in containerized
|
|
environments because:
|
|
1. Database transactions are atomic (no TOCTOU race windows)
|
|
2. No NFS/network filesystem edge cases
|
|
3. Stale lock detection is timestamp-based, not PID-based (PID reuse is unreliable)
|
|
4. Works across container restarts and rolling deployments
|
|
|
|
The lock record stores:
|
|
- id: Always 1 (singleton table)
|
|
- pid: Process ID of the lock holder
|
|
- hostname: Container/host name for debugging
|
|
- created_at: When the lock was first acquired
|
|
- heartbeat_at: When the lock was last confirmed alive (updated periodically)
|
|
|
|
On startup:
|
|
1. Cleanup any stale locks (where heartbeat_at > TTL)
|
|
2. Try to insert the lock for this instance
|
|
3. If INSERT succeeds, lock is acquired
|
|
4. If INSERT fails (IntegrityError), another instance holds the lock
|
|
|
|
On running (periodic):
|
|
- Update heartbeat_at to keep the lock alive and prevent false positives
|
|
|
|
On shutdown:
|
|
- Delete the lock (this instance is no longer running the scheduler)
|
|
"""
|
|
|
|
from __future__ import annotations
|
|
|
|
import os
|
|
import socket
|
|
import time
|
|
from typing import Any
|
|
|
|
import aiosqlite
|
|
import structlog
|
|
|
|
log: structlog.stdlib.BoundLogger = structlog.get_logger()
|
|
|
|
# Lock record expires if heartbeat hasn't been updated for this many seconds.
|
|
# This prevents stale locks from a crashed instance from blocking new startups.
|
|
SCHEDULER_LOCK_TTL_SECONDS: int = 60
|
|
|
|
# Heartbeat interval: how often to update the lock's heartbeat_at timestamp.
|
|
# Must be less than TTL to prevent premature expiration.
|
|
SCHEDULER_LOCK_HEARTBEAT_INTERVAL_SECONDS: int = 10
|
|
|
|
|
|
async def init_scheduler_lock_table(db: aiosqlite.Connection) -> None:
|
|
"""Create the scheduler_lock table if it doesn't exist.
|
|
|
|
This is called during database schema initialization and is safe to call
|
|
multiple times (CREATE TABLE IF NOT EXISTS is idempotent).
|
|
|
|
Args:
|
|
db: The SQLite database connection.
|
|
"""
|
|
await db.execute(
|
|
"""
|
|
CREATE TABLE IF NOT EXISTS scheduler_lock (
|
|
id INTEGER PRIMARY KEY CHECK (id = 1),
|
|
pid INTEGER NOT NULL,
|
|
hostname TEXT NOT NULL,
|
|
created_at REAL NOT NULL,
|
|
heartbeat_at REAL NOT NULL
|
|
);
|
|
"""
|
|
)
|
|
await db.commit()
|
|
|
|
|
|
async def acquire_scheduler_lock(db: aiosqlite.Connection) -> bool:
|
|
"""Try to acquire the scheduler lock.
|
|
|
|
This function performs two operations:
|
|
1. Clean up any stale locks (where heartbeat_at + TTL < now)
|
|
2. Try to insert a lock record for this instance
|
|
|
|
If another instance already holds a valid lock, the INSERT will fail and
|
|
this function returns False. The caller should reject startup with a clear
|
|
error message.
|
|
|
|
Args:
|
|
db: The SQLite database connection.
|
|
|
|
Returns:
|
|
True if the lock was successfully acquired, False if held by another instance.
|
|
|
|
Raises:
|
|
RuntimeError: If database operations fail for reasons other than the lock
|
|
being held (e.g., database is corrupted or inaccessible).
|
|
"""
|
|
now = time.time()
|
|
pid = os.getpid()
|
|
hostname = socket.gethostname()
|
|
|
|
try:
|
|
# Clean up stale locks first
|
|
await db.execute(
|
|
"""
|
|
DELETE FROM scheduler_lock
|
|
WHERE (? - heartbeat_at) > ?
|
|
""",
|
|
(now, SCHEDULER_LOCK_TTL_SECONDS),
|
|
)
|
|
|
|
# Try to acquire the lock (atomic: INSERT fails if row exists)
|
|
await db.execute(
|
|
"""
|
|
INSERT INTO scheduler_lock (id, pid, hostname, created_at, heartbeat_at)
|
|
VALUES (1, ?, ?, ?, ?)
|
|
""",
|
|
(pid, hostname, now, now),
|
|
)
|
|
await db.commit()
|
|
|
|
log.info(
|
|
"scheduler_lock_acquired",
|
|
pid=pid,
|
|
hostname=hostname,
|
|
)
|
|
return True
|
|
|
|
except aiosqlite.IntegrityError:
|
|
# Lock is already held by another instance (INSERT failed due to UNIQUE constraint)
|
|
# Log details about who holds the lock to help with debugging
|
|
try:
|
|
cursor = await db.execute(
|
|
"SELECT pid, hostname, created_at, heartbeat_at FROM scheduler_lock WHERE id = 1"
|
|
)
|
|
row = await cursor.fetchone()
|
|
if row:
|
|
lock_pid, lock_hostname, lock_created, lock_heartbeat = row
|
|
age_seconds = now - lock_created
|
|
heartbeat_age = now - lock_heartbeat
|
|
log.warning(
|
|
"scheduler_lock_held_by_other_instance",
|
|
our_pid=pid,
|
|
lock_pid=lock_pid,
|
|
lock_hostname=lock_hostname,
|
|
lock_age_seconds=age_seconds,
|
|
heartbeat_age_seconds=heartbeat_age,
|
|
)
|
|
except Exception as e:
|
|
log.warning("scheduler_lock_held_but_could_not_read_holder", error=str(e))
|
|
|
|
return False
|
|
|
|
except Exception as e:
|
|
# Unexpected database error (not an IntegrityError)
|
|
raise RuntimeError(
|
|
f"Failed to acquire scheduler lock due to database error: {e}\n"
|
|
"Check that the database is accessible and not corrupted."
|
|
) from e
|
|
|
|
|
|
async def release_scheduler_lock(db: aiosqlite.Connection) -> None:
|
|
"""Release the scheduler lock.
|
|
|
|
This function should be called during application shutdown. It removes the
|
|
lock record, allowing other instances to acquire it.
|
|
|
|
Args:
|
|
db: The SQLite database connection.
|
|
|
|
Raises:
|
|
RuntimeError: If database operations fail.
|
|
"""
|
|
pid = os.getpid()
|
|
|
|
try:
|
|
cursor = await db.execute(
|
|
"DELETE FROM scheduler_lock WHERE id = 1 AND pid = ?",
|
|
(pid,),
|
|
)
|
|
await db.commit()
|
|
|
|
if cursor.rowcount == 0:
|
|
# This shouldn't happen in normal operation, but log it for visibility
|
|
log.warning(
|
|
"scheduler_lock_release_mismatch",
|
|
our_pid=pid,
|
|
message="Tried to release lock but we don't hold it. Another instance may have replaced us.",
|
|
)
|
|
else:
|
|
log.info("scheduler_lock_released", pid=pid)
|
|
|
|
except Exception as e:
|
|
raise RuntimeError(f"Failed to release scheduler lock: {e}") from e
|
|
|
|
|
|
async def update_scheduler_lock_heartbeat(db: aiosqlite.Connection) -> bool:
|
|
"""Update the heartbeat timestamp to keep the lock alive.
|
|
|
|
This function should be called periodically (every ~10 seconds) to prevent
|
|
the lock from being considered stale. It only succeeds if this process
|
|
still holds the lock.
|
|
|
|
Args:
|
|
db: The SQLite database connection.
|
|
|
|
Returns:
|
|
True if the heartbeat was updated (we still hold the lock), False if
|
|
we no longer hold the lock (another instance has taken over).
|
|
|
|
Raises:
|
|
RuntimeError: If database operations fail.
|
|
"""
|
|
now = time.time()
|
|
pid = os.getpid()
|
|
|
|
try:
|
|
cursor = await db.execute(
|
|
"UPDATE scheduler_lock SET heartbeat_at = ? WHERE id = 1 AND pid = ?",
|
|
(now, pid),
|
|
)
|
|
await db.commit()
|
|
|
|
if cursor.rowcount == 0:
|
|
# We no longer hold the lock
|
|
log.warning(
|
|
"scheduler_lock_heartbeat_lost",
|
|
our_pid=pid,
|
|
message="Heartbeat failed; we no longer hold the lock.",
|
|
)
|
|
return False
|
|
|
|
return True
|
|
|
|
except Exception as e:
|
|
raise RuntimeError(f"Failed to update scheduler lock heartbeat: {e}") from e
|
|
|
|
|
|
async def get_scheduler_lock_info(db: aiosqlite.Connection) -> dict[str, Any] | None:
|
|
"""Retrieve information about the current scheduler lock.
|
|
|
|
This function is useful for debugging and monitoring. Returns None if no
|
|
lock is currently held.
|
|
|
|
Args:
|
|
db: The SQLite database connection.
|
|
|
|
Returns:
|
|
A dict with keys: pid, hostname, created_at, heartbeat_at, or None
|
|
if no lock exists.
|
|
"""
|
|
try:
|
|
cursor = await db.execute(
|
|
"SELECT pid, hostname, created_at, heartbeat_at FROM scheduler_lock WHERE id = 1"
|
|
)
|
|
row = await cursor.fetchone()
|
|
if row:
|
|
pid, hostname, created_at, heartbeat_at = row
|
|
return {
|
|
"pid": pid,
|
|
"hostname": hostname,
|
|
"created_at": created_at,
|
|
"heartbeat_at": heartbeat_at,
|
|
}
|
|
return None
|
|
except Exception as e:
|
|
log.warning("scheduler_lock_info_query_failed", error=str(e))
|
|
return None
|