# ADR-005: Single-Instance Scheduler Enforcement ## Status Accepted ## Context APScheduler's `AsyncIOScheduler` is bound to a single asyncio event loop. Running multiple scheduler instances leads to duplicate jobs, database lock contention, and undefined behaviour. ## Decision Enforce exactly **one scheduler instance** across the entire application lifecycle, using a database-level distributed lock. ## Mechanism ### 1. Startup gate: `BANGUI_WORKERS=1` The Docker compose file is configured with `BANGUI_WORKERS=1` and the startup DAG validates this variable. If the variable is not set to `1`, startup aborts with a clear error message. ### 2. Runtime lock: `scheduler_lock` table During startup, after opening the SQLite database, the application attempts: ```sql INSERT INTO scheduler_lock (lock_name, heartbeat_at) VALUES ('scheduler', unixepoch()) ON CONFLICT(lock_name) DO UPDATE SET heartbeat_at = unixepoch() WHERE (unixepoch() - heartbeat_at) < 30; ``` - If the INSERT succeeds, this instance holds the lock and starts the scheduler. - If the INSERT is a no-op (heartbeat is recent), another instance holds the lock and startup continues without starting the scheduler. - A background task (`scheduler_lock_heartbeat`) updates the heartbeat every 10 seconds. If the process crashes, the lock expires after 30 seconds, allowing a restart to acquire it immediately. ### 3. Deployment topology | Deployment | Behaviour | |---|---| | Single container | Scheduler runs normally | | Single Pod (Kubernetes) | Scheduler runs normally | | Accidental multi-process restart | Second process fails to start scheduler; first continues | | Intentional multi-worker | Not supported; requires external job store (future) | ## Rationale ### Why this approach? - **No external coordination service:** No ZooKeeper, etcd, or Redis needed. The existing SQLite database is reused. - **Atomic:** SQLite's INSERT with ON CONFLICT is atomic; no race condition. - **Self-healing:** Lock expiry means a crashed instance automatically releases its lock. No manual cleanup required. - **Crash-safe:** A heartbeat-based TTL ensures stale locks are not held indefinitely. ## Consequences - `BANGUI_WORKERS` must always be `1`. This is documented and enforced. - Future multi-worker deployments require migration to a persistent job store (PostgreSQL + SQLAlchemy job store, or Redis).