Implement database-backed scheduler lock for multi-worker safety
Enforce single-executor safety regardless of process launcher through a
robust database-backed lock mechanism that works reliably in container
orchestration environments.
Key changes:
1. Add scheduler_lock table to database schema (migration 4)
- Singleton row (id=1) prevents concurrent execution
- Stores PID, hostname, creation timestamp, heartbeat timestamp
- Atomic transaction prevents race conditions
2. Create scheduler lock utility (app/utils/scheduler_lock.py)
- acquire_scheduler_lock(): Atomically acquire or fail
- release_scheduler_lock(): Clean up on shutdown
- update_scheduler_lock_heartbeat(): Keep lock alive (every 10 seconds)
- get_scheduler_lock_info(): Debug/inspect lock status
- Stale lock detection: TTL-based (60 second expiry)
3. Reorder startup DAG stages
- DATABASE now comes first (required for lock acquisition)
- WORKER_MODE depends on DATABASE (performs lock check after initialization)
- Maintains all other stage dependencies intact
4. Update startup process (app/startup.py)
- Replace _check_single_worker_mode() with two-tier check:
* Fast check: BANGUI_WORKERS env var (if explicitly set to >1)
* Authoritative check: Database lock (catches misconfiguration)
- Return startup_db from startup_shared_resources() for lock management
5. Register scheduler lock heartbeat task
- New task: scheduler_lock_heartbeat (app/tasks/scheduler_lock_heartbeat.py)
- Updates lock heartbeat every 10 seconds (keeps lock alive)
- Prevents false positives from temporary load spikes
6. Add lock release to lifespan shutdown (app/main.py)
- Release lock before closing database
- Allows other instances to acquire during rolling deployments
- Graceful handoff between instances
7. Comprehensive test coverage (backend/tests/test_scheduler_lock.py)
- Lock acquisition success and failure cases
- Stale lock cleanup on startup
- Lock release and heartbeat updates
- Full lifecycle: acquire → heartbeat → release
8. Update documentation (Docs/Architekture.md § 9.3)
- Explain single-executor requirement
- Document database-backed locking mechanism
- Compare with alternative approaches (filesystem, env var)
- Include troubleshooting guide
- Container orchestration examples (Docker, Kubernetes, systemd)
Why database-backed instead of filesystem?
- Atomicity: SQLite transactions prevent TOCTOU race windows
- Container-safe: Works across containers with shared DB volumes
- No NFS/SMB edge cases
- Timestamp-based stale detection (PID reuse is unreliable)
- More reliable in rolling deployments
Benefits:
- Works with any process manager (uvicorn, gunicorn, etc.)
- Handles simultaneous startup attempts correctly
- Automatic failover on instance crash (stale lock cleanup)
- Clear error messages with troubleshooting steps
- No environment variable required (lock is authoritative)
- Scales to multi-worker deployments if combined with external job store
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
This commit is contained in:
@@ -1170,9 +1170,9 @@ Fluent UI v9 applies styles via inline `style` attributes on DOM elements. To su
|
||||
|
||||
## 9.3 Deployment Constraints
|
||||
|
||||
### Single-Worker Requirement
|
||||
### Single-Executor Scheduler Requirement
|
||||
|
||||
**BanGUI's background scheduler must run with exactly one uvicorn worker process.**
|
||||
**BanGUI's background scheduler must run with exactly one executor process.**
|
||||
|
||||
The application uses APScheduler's `AsyncIOScheduler`, which is bound to a single asyncio event loop and cannot be safely shared across multiple worker processes. If the app is deployed with `--workers N` (where N > 1), the following failures occur:
|
||||
|
||||
@@ -1184,21 +1184,105 @@ The application uses APScheduler's `AsyncIOScheduler`, which is bound to a singl
|
||||
- **Duplicate ban operations** — bans are executed multiple times, with potential state conflicts.
|
||||
- **SQLite lock contention** — concurrent writes to the same database from N workers cause lock timeouts.
|
||||
|
||||
### Enforcement
|
||||
### Enforcement Mechanism
|
||||
|
||||
1. **Environment variable:** Set `BANGUI_WORKERS=1` (default in Dockerfile.backend).
|
||||
2. **Detection:** On startup, `startup_shared_resources()` validates `BANGUI_WORKERS` and raises a clear `RuntimeError` if it is not 1.
|
||||
3. **Single-process design:** The application is optimized for a single-process, high-concurrency model using asyncio. Request handling is fully async and leverages the event loop efficiently.
|
||||
BanGUI enforces single-executor safety through a **database-backed lock** that works reliably in container orchestration environments:
|
||||
|
||||
1. **Fast check (env var):** On startup, the `BANGUI_WORKERS` environment variable is checked (if set). If explicitly set to a value > 1, startup fails immediately with a clear error.
|
||||
|
||||
2. **Authoritative check (database lock):** During startup, BanGUI acquires an atomic database lock in the `scheduler_lock` table. This lock:
|
||||
- Uses a singleton row (id=1) to prevent race conditions across simultaneously starting instances
|
||||
- Stores the PID, hostname, creation timestamp, and heartbeat timestamp of the lock holder
|
||||
- Is considered stale if the heartbeat hasn't been updated for 60 seconds
|
||||
- Is automatically cleaned up on stale instance detection, allowing failover in rolling deployments
|
||||
|
||||
3. **Lock acquisition (startup):**
|
||||
- Clean up any stale locks (heartbeat older than 60 seconds)
|
||||
- Attempt to insert a new lock row with this instance's PID and hostname
|
||||
- If the INSERT fails (row already exists), reject startup with a clear error
|
||||
- If the INSERT succeeds, this instance holds the lock and will start the scheduler
|
||||
|
||||
4. **Lock maintenance (runtime):** A periodic background task (`scheduler_lock_heartbeat`) updates the lock's heartbeat timestamp every 10 seconds, keeping it alive and preventing false positives from temporary load spikes.
|
||||
|
||||
5. **Lock release (shutdown):** On graceful shutdown, the lock is released, allowing other instances to acquire it.
|
||||
|
||||
**Why database-backed instead of filesystem?**
|
||||
|
||||
Database-backed locking is more reliable in container orchestration because:
|
||||
- **Atomicity:** SQLite transactions are atomic — no race condition window between checking and inserting
|
||||
- **Container-safe:** Works across containers with shared database volumes (no NFS/SMB edge cases)
|
||||
- **Stale detection:** Heartbeat-based TTL is simpler and more reliable than PID-based checks (PID reuse is common in containers)
|
||||
- **No false positives:** Timestamp-based expiration eliminates issues with PID reuse
|
||||
|
||||
### Startup Sequence with Scheduler Lock
|
||||
|
||||
```
|
||||
1. DATABASE stage
|
||||
└─ Initialize SQLite schema (including scheduler_lock table)
|
||||
|
||||
2. WORKER_MODE stage (formerly first, now depends on DATABASE)
|
||||
├─ Fast check: Verify BANGUI_WORKERS env var if explicitly set
|
||||
└─ Authoritative check: Acquire scheduler lock in database
|
||||
→ If lock held by another instance: Fail with clear error
|
||||
→ If lock acquired: Continue to GEO_CACHE stage
|
||||
|
||||
3. (rest of startup continues as normal)
|
||||
```
|
||||
|
||||
### Troubleshooting
|
||||
|
||||
**Problem:** Startup fails with "Could not acquire scheduler lock"
|
||||
|
||||
**Solution:**
|
||||
1. Verify no other BanGUI instances are running
|
||||
2. Inspect the lock: `sqlite3 bangui.db "SELECT * FROM scheduler_lock;"`
|
||||
3. Check who holds the lock (hostname, PID, heartbeat time)
|
||||
4. If stale (heartbeat older than 60 seconds), clean it:
|
||||
```sql
|
||||
sqlite3 bangui.db "DELETE FROM scheduler_lock WHERE (strftime('%s', 'now') - heartbeat_at) > 60;"
|
||||
```
|
||||
5. Retry the failed instance
|
||||
|
||||
**Problem:** Stale lock after instance crash
|
||||
|
||||
BanGUI handles this automatically:
|
||||
- The next instance to start will detect the stale lock (heartbeat older than 60 seconds)
|
||||
- It will clean it up and acquire the lock
|
||||
- The new instance starts the scheduler as normal
|
||||
|
||||
No manual intervention is required.
|
||||
|
||||
### Environment Variables
|
||||
|
||||
- **`BANGUI_WORKERS`** (optional, default: unset)
|
||||
- If set to `1` or unset: Normal operation (any number of instances may start, but only one holds the lock)
|
||||
- If set to > `1`: Startup fails immediately with an error (fast check)
|
||||
- Reason: Legacy env var for explicitly forbidding multi-worker deployments
|
||||
|
||||
### Container Orchestration Examples
|
||||
|
||||
**Docker Compose:**
|
||||
- Single service instance (no scaling) — scheduler runs normally
|
||||
|
||||
**Kubernetes:**
|
||||
- Single Pod replica — scheduler runs normally
|
||||
- Multiple Pod replicas (during rolling update) — old Pod releases lock on shutdown, new Pod acquires it
|
||||
- No duplicate jobs, no startup failures
|
||||
- Health check should allow 30-60 seconds for lock handoff
|
||||
|
||||
**systemd / process manager:**
|
||||
- Single process — scheduler runs normally
|
||||
- Accidental multi-process restart — lock prevents duplicate jobs, other processes fail to start scheduler
|
||||
|
||||
### Future Multi-Worker Support
|
||||
|
||||
To safely support multiple workers in the future:
|
||||
|
||||
1. **External job store:** Move APScheduler from in-memory to a persistent store (e.g., SQLAlchemy-backed job store with PostgreSQL or Redis).
|
||||
2. **Distributed locking:** Use a distributed lock (Redis, etcd) to ensure only one worker executes each scheduled job.
|
||||
2. **Distributed locking:** Use a distributed lock (Redis, etcd) instead of database lock for better performance.
|
||||
3. **Process coordination:** Implement a process-to-worker pool communication mechanism so the scheduler runs only on one designated worker.
|
||||
|
||||
Currently, the single-worker approach is simple, maintainable, and sufficient for BanGUI's operational requirements.
|
||||
Currently, the single-executor approach is simple, maintainable, and sufficient for BanGUI's operational requirements. The database lock provides reliable enforcement across all deployment scenarios.
|
||||
|
||||
---
|
||||
|
||||
|
||||
Reference in New Issue
Block a user