Implement database-backed scheduler lock for multi-worker safety

Enforce single-executor safety regardless of process launcher through a
robust database-backed lock mechanism that works reliably in container
orchestration environments.

Key changes:
1. Add scheduler_lock table to database schema (migration 4)
   - Singleton row (id=1) prevents concurrent execution
   - Stores PID, hostname, creation timestamp, heartbeat timestamp
   - Atomic transaction prevents race conditions

2. Create scheduler lock utility (app/utils/scheduler_lock.py)
   - acquire_scheduler_lock(): Atomically acquire or fail
   - release_scheduler_lock(): Clean up on shutdown
   - update_scheduler_lock_heartbeat(): Keep lock alive (every 10 seconds)
   - get_scheduler_lock_info(): Debug/inspect lock status
   - Stale lock detection: TTL-based (60 second expiry)

3. Reorder startup DAG stages
   - DATABASE now comes first (required for lock acquisition)
   - WORKER_MODE depends on DATABASE (performs lock check after initialization)
   - Maintains all other stage dependencies intact

4. Update startup process (app/startup.py)
   - Replace _check_single_worker_mode() with two-tier check:
     * Fast check: BANGUI_WORKERS env var (if explicitly set to >1)
     * Authoritative check: Database lock (catches misconfiguration)
   - Return startup_db from startup_shared_resources() for lock management

5. Register scheduler lock heartbeat task
   - New task: scheduler_lock_heartbeat (app/tasks/scheduler_lock_heartbeat.py)
   - Updates lock heartbeat every 10 seconds (keeps lock alive)
   - Prevents false positives from temporary load spikes

6. Add lock release to lifespan shutdown (app/main.py)
   - Release lock before closing database
   - Allows other instances to acquire during rolling deployments
   - Graceful handoff between instances

7. Comprehensive test coverage (backend/tests/test_scheduler_lock.py)
   - Lock acquisition success and failure cases
   - Stale lock cleanup on startup
   - Lock release and heartbeat updates
   - Full lifecycle: acquire → heartbeat → release

8. Update documentation (Docs/Architekture.md § 9.3)
   - Explain single-executor requirement
   - Document database-backed locking mechanism
   - Compare with alternative approaches (filesystem, env var)
   - Include troubleshooting guide
   - Container orchestration examples (Docker, Kubernetes, systemd)

Why database-backed instead of filesystem?
   - Atomicity: SQLite transactions prevent TOCTOU race windows
   - Container-safe: Works across containers with shared DB volumes
   - No NFS/SMB edge cases
   - Timestamp-based stale detection (PID reuse is unreliable)
   - More reliable in rolling deployments

Benefits:
   - Works with any process manager (uvicorn, gunicorn, etc.)
   - Handles simultaneous startup attempts correctly
   - Automatic failover on instance crash (stale lock cleanup)
   - Clear error messages with troubleshooting steps
   - No environment variable required (lock is authoritative)
   - Scales to multi-worker deployments if combined with external job store

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
This commit is contained in:
2026-04-29 20:10:53 +02:00
parent 336242ad06
commit 187cd8250d
8 changed files with 768 additions and 82 deletions

View File

@@ -1170,9 +1170,9 @@ Fluent UI v9 applies styles via inline `style` attributes on DOM elements. To su
## 9.3 Deployment Constraints
### Single-Worker Requirement
### Single-Executor Scheduler Requirement
**BanGUI's background scheduler must run with exactly one uvicorn worker process.**
**BanGUI's background scheduler must run with exactly one executor process.**
The application uses APScheduler's `AsyncIOScheduler`, which is bound to a single asyncio event loop and cannot be safely shared across multiple worker processes. If the app is deployed with `--workers N` (where N > 1), the following failures occur:
@@ -1184,21 +1184,105 @@ The application uses APScheduler's `AsyncIOScheduler`, which is bound to a singl
- **Duplicate ban operations** — bans are executed multiple times, with potential state conflicts.
- **SQLite lock contention** — concurrent writes to the same database from N workers cause lock timeouts.
### Enforcement
### Enforcement Mechanism
1. **Environment variable:** Set `BANGUI_WORKERS=1` (default in Dockerfile.backend).
2. **Detection:** On startup, `startup_shared_resources()` validates `BANGUI_WORKERS` and raises a clear `RuntimeError` if it is not 1.
3. **Single-process design:** The application is optimized for a single-process, high-concurrency model using asyncio. Request handling is fully async and leverages the event loop efficiently.
BanGUI enforces single-executor safety through a **database-backed lock** that works reliably in container orchestration environments:
1. **Fast check (env var):** On startup, the `BANGUI_WORKERS` environment variable is checked (if set). If explicitly set to a value > 1, startup fails immediately with a clear error.
2. **Authoritative check (database lock):** During startup, BanGUI acquires an atomic database lock in the `scheduler_lock` table. This lock:
- Uses a singleton row (id=1) to prevent race conditions across simultaneously starting instances
- Stores the PID, hostname, creation timestamp, and heartbeat timestamp of the lock holder
- Is considered stale if the heartbeat hasn't been updated for 60 seconds
- Is automatically cleaned up on stale instance detection, allowing failover in rolling deployments
3. **Lock acquisition (startup):**
- Clean up any stale locks (heartbeat older than 60 seconds)
- Attempt to insert a new lock row with this instance's PID and hostname
- If the INSERT fails (row already exists), reject startup with a clear error
- If the INSERT succeeds, this instance holds the lock and will start the scheduler
4. **Lock maintenance (runtime):** A periodic background task (`scheduler_lock_heartbeat`) updates the lock's heartbeat timestamp every 10 seconds, keeping it alive and preventing false positives from temporary load spikes.
5. **Lock release (shutdown):** On graceful shutdown, the lock is released, allowing other instances to acquire it.
**Why database-backed instead of filesystem?**
Database-backed locking is more reliable in container orchestration because:
- **Atomicity:** SQLite transactions are atomic — no race condition window between checking and inserting
- **Container-safe:** Works across containers with shared database volumes (no NFS/SMB edge cases)
- **Stale detection:** Heartbeat-based TTL is simpler and more reliable than PID-based checks (PID reuse is common in containers)
- **No false positives:** Timestamp-based expiration eliminates issues with PID reuse
### Startup Sequence with Scheduler Lock
```
1. DATABASE stage
└─ Initialize SQLite schema (including scheduler_lock table)
2. WORKER_MODE stage (formerly first, now depends on DATABASE)
├─ Fast check: Verify BANGUI_WORKERS env var if explicitly set
└─ Authoritative check: Acquire scheduler lock in database
→ If lock held by another instance: Fail with clear error
→ If lock acquired: Continue to GEO_CACHE stage
3. (rest of startup continues as normal)
```
### Troubleshooting
**Problem:** Startup fails with "Could not acquire scheduler lock"
**Solution:**
1. Verify no other BanGUI instances are running
2. Inspect the lock: `sqlite3 bangui.db "SELECT * FROM scheduler_lock;"`
3. Check who holds the lock (hostname, PID, heartbeat time)
4. If stale (heartbeat older than 60 seconds), clean it:
```sql
sqlite3 bangui.db "DELETE FROM scheduler_lock WHERE (strftime('%s', 'now') - heartbeat_at) > 60;"
```
5. Retry the failed instance
**Problem:** Stale lock after instance crash
BanGUI handles this automatically:
- The next instance to start will detect the stale lock (heartbeat older than 60 seconds)
- It will clean it up and acquire the lock
- The new instance starts the scheduler as normal
No manual intervention is required.
### Environment Variables
- **`BANGUI_WORKERS`** (optional, default: unset)
- If set to `1` or unset: Normal operation (any number of instances may start, but only one holds the lock)
- If set to > `1`: Startup fails immediately with an error (fast check)
- Reason: Legacy env var for explicitly forbidding multi-worker deployments
### Container Orchestration Examples
**Docker Compose:**
- Single service instance (no scaling) — scheduler runs normally
**Kubernetes:**
- Single Pod replica — scheduler runs normally
- Multiple Pod replicas (during rolling update) — old Pod releases lock on shutdown, new Pod acquires it
- No duplicate jobs, no startup failures
- Health check should allow 30-60 seconds for lock handoff
**systemd / process manager:**
- Single process — scheduler runs normally
- Accidental multi-process restart — lock prevents duplicate jobs, other processes fail to start scheduler
### Future Multi-Worker Support
To safely support multiple workers in the future:
1. **External job store:** Move APScheduler from in-memory to a persistent store (e.g., SQLAlchemy-backed job store with PostgreSQL or Redis).
2. **Distributed locking:** Use a distributed lock (Redis, etcd) to ensure only one worker executes each scheduled job.
2. **Distributed locking:** Use a distributed lock (Redis, etcd) instead of database lock for better performance.
3. **Process coordination:** Implement a process-to-worker pool communication mechanism so the scheduler runs only on one designated worker.
Currently, the single-worker approach is simple, maintainable, and sufficient for BanGUI's operational requirements.
Currently, the single-executor approach is simple, maintainable, and sufficient for BanGUI's operational requirements. The database lock provides reliable enforcement across all deployment scenarios.
---

View File

@@ -1,41 +1,3 @@
## 35) API client sends JSON and CSRF header for every request method
- Where found:
- [frontend/src/api/client.ts](frontend/src/api/client.ts)
- Why this is needed:
- Extra headers on GET increase unnecessary CORS preflights and noise.
- Goal:
- Apply headers by method/body requirements.
- What to do:
- Only set Content-Type for requests with JSON body.
- Send CSRF header for mutating cookie-authenticated requests only.
- Possible traps and issues:
- CSRF protection assumptions must still hold for all mutating paths.
- Docs changes needed:
- Update frontend API client contract and CSRF notes.
- Doc references:
- [backend/app/middleware/csrf.py](backend/app/middleware/csrf.py)
---
## 36) Polling continues when tab is not visible
- Where found:
- [frontend/src/hooks/usePolledData.ts](frontend/src/hooks/usePolledData.ts#L90)
- [frontend/src/hooks/useBlocklistStatus.ts](frontend/src/hooks/useBlocklistStatus.ts)
- Why this is needed:
- Unnecessary backend load and client resource usage in background tabs.
- Goal:
- Pause/reduce polling when page is hidden.
- What to do:
- Add visibility-aware polling strategy and optional backoff.
- Possible traps and issues:
- Data may appear stale immediately after tab restore if refresh is delayed.
- Docs changes needed:
- Add frontend polling lifecycle policy.
- Doc references:
- [Docs/Web-Development.md](Docs/Web-Development.md)
---
## 37) Multi-worker safety check depends on one environment variable
- Where found:
- [backend/app/startup.py](backend/app/startup.py#L61)