Implement database-backed scheduler lock for multi-worker safety

Enforce single-executor safety regardless of process launcher through a
robust database-backed lock mechanism that works reliably in container
orchestration environments.

Key changes:
1. Add scheduler_lock table to database schema (migration 4)
   - Singleton row (id=1) prevents concurrent execution
   - Stores PID, hostname, creation timestamp, heartbeat timestamp
   - Atomic transaction prevents race conditions

2. Create scheduler lock utility (app/utils/scheduler_lock.py)
   - acquire_scheduler_lock(): Atomically acquire or fail
   - release_scheduler_lock(): Clean up on shutdown
   - update_scheduler_lock_heartbeat(): Keep lock alive (every 10 seconds)
   - get_scheduler_lock_info(): Debug/inspect lock status
   - Stale lock detection: TTL-based (60 second expiry)

3. Reorder startup DAG stages
   - DATABASE now comes first (required for lock acquisition)
   - WORKER_MODE depends on DATABASE (performs lock check after initialization)
   - Maintains all other stage dependencies intact

4. Update startup process (app/startup.py)
   - Replace _check_single_worker_mode() with two-tier check:
     * Fast check: BANGUI_WORKERS env var (if explicitly set to >1)
     * Authoritative check: Database lock (catches misconfiguration)
   - Return startup_db from startup_shared_resources() for lock management

5. Register scheduler lock heartbeat task
   - New task: scheduler_lock_heartbeat (app/tasks/scheduler_lock_heartbeat.py)
   - Updates lock heartbeat every 10 seconds (keeps lock alive)
   - Prevents false positives from temporary load spikes

6. Add lock release to lifespan shutdown (app/main.py)
   - Release lock before closing database
   - Allows other instances to acquire during rolling deployments
   - Graceful handoff between instances

7. Comprehensive test coverage (backend/tests/test_scheduler_lock.py)
   - Lock acquisition success and failure cases
   - Stale lock cleanup on startup
   - Lock release and heartbeat updates
   - Full lifecycle: acquire → heartbeat → release

8. Update documentation (Docs/Architekture.md § 9.3)
   - Explain single-executor requirement
   - Document database-backed locking mechanism
   - Compare with alternative approaches (filesystem, env var)
   - Include troubleshooting guide
   - Container orchestration examples (Docker, Kubernetes, systemd)

Why database-backed instead of filesystem?
   - Atomicity: SQLite transactions prevent TOCTOU race windows
   - Container-safe: Works across containers with shared DB volumes
   - No NFS/SMB edge cases
   - Timestamp-based stale detection (PID reuse is unreliable)
   - More reliable in rolling deployments

Benefits:
   - Works with any process manager (uvicorn, gunicorn, etc.)
   - Handles simultaneous startup attempts correctly
   - Automatic failover on instance crash (stale lock cleanup)
   - Clear error messages with troubleshooting steps
   - No environment variable required (lock is authoritative)
   - Scales to multi-worker deployments if combined with external job store

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
This commit is contained in:
2026-04-29 20:10:53 +02:00
parent 336242ad06
commit 187cd8250d
8 changed files with 768 additions and 82 deletions

View File

@@ -1,41 +1,3 @@
## 35) API client sends JSON and CSRF header for every request method
- Where found:
- [frontend/src/api/client.ts](frontend/src/api/client.ts)
- Why this is needed:
- Extra headers on GET increase unnecessary CORS preflights and noise.
- Goal:
- Apply headers by method/body requirements.
- What to do:
- Only set Content-Type for requests with JSON body.
- Send CSRF header for mutating cookie-authenticated requests only.
- Possible traps and issues:
- CSRF protection assumptions must still hold for all mutating paths.
- Docs changes needed:
- Update frontend API client contract and CSRF notes.
- Doc references:
- [backend/app/middleware/csrf.py](backend/app/middleware/csrf.py)
---
## 36) Polling continues when tab is not visible
- Where found:
- [frontend/src/hooks/usePolledData.ts](frontend/src/hooks/usePolledData.ts#L90)
- [frontend/src/hooks/useBlocklistStatus.ts](frontend/src/hooks/useBlocklistStatus.ts)
- Why this is needed:
- Unnecessary backend load and client resource usage in background tabs.
- Goal:
- Pause/reduce polling when page is hidden.
- What to do:
- Add visibility-aware polling strategy and optional backoff.
- Possible traps and issues:
- Data may appear stale immediately after tab restore if refresh is delayed.
- Docs changes needed:
- Add frontend polling lifecycle policy.
- Doc references:
- [Docs/Web-Development.md](Docs/Web-Development.md)
---
## 37) Multi-worker safety check depends on one environment variable
- Where found:
- [backend/app/startup.py](backend/app/startup.py#L61)