Implement database-backed scheduler lock for multi-worker safety

Enforce single-executor safety regardless of process launcher through a robust database-backed lock mechanism that works reliably in container orchestration environments. Key changes: 1. Add scheduler_lock table to database schema (migration 4) - Singleton row (id=1) prevents concurrent execution - Stores PID, hostname, creation timestamp, heartbeat timestamp - Atomic transaction prevents race conditions 2. Create scheduler lock utility (app/utils/scheduler_lock.py) - acquire_scheduler_lock(): Atomically acquire or fail - release_scheduler_lock(): Clean up on shutdown - update_scheduler_lock_heartbeat(): Keep lock alive (every 10 seconds) - get_scheduler_lock_info(): Debug/inspect lock status - Stale lock detection: TTL-based (60 second expiry) 3. Reorder startup DAG stages - DATABASE now comes first (required for lock acquisition) - WORKER_MODE depends on DATABASE (performs lock check after initialization) - Maintains all other stage dependencies intact 4. Update startup process (app/startup.py) - Replace _check_single_worker_mode() with two-tier check: * Fast check: BANGUI_WORKERS env var (if explicitly set to >1) * Authoritative check: Database lock (catches misconfiguration) - Return startup_db from startup_shared_resources() for lock management 5. Register scheduler lock heartbeat task - New task: scheduler_lock_heartbeat (app/tasks/scheduler_lock_heartbeat.py) - Updates lock heartbeat every 10 seconds (keeps lock alive) - Prevents false positives from temporary load spikes 6. Add lock release to lifespan shutdown (app/main.py) - Release lock before closing database - Allows other instances to acquire during rolling deployments - Graceful handoff between instances 7. Comprehensive test coverage (backend/tests/test_scheduler_lock.py) - Lock acquisition success and failure cases - Stale lock cleanup on startup - Lock release and heartbeat updates - Full lifecycle: acquire → heartbeat → release 8. Update documentation (Docs/Architekture.md § 9.3) - Explain single-executor requirement - Document database-backed locking mechanism - Compare with alternative approaches (filesystem, env var) - Include troubleshooting guide - Container orchestration examples (Docker, Kubernetes, systemd) Why database-backed instead of filesystem? - Atomicity: SQLite transactions prevent TOCTOU race windows - Container-safe: Works across containers with shared DB volumes - No NFS/SMB edge cases - Timestamp-based stale detection (PID reuse is unreliable) - More reliable in rolling deployments Benefits: - Works with any process manager (uvicorn, gunicorn, etc.) - Handles simultaneous startup attempts correctly - Automatic failover on instance crash (stale lock cleanup) - Clear error messages with troubleshooting steps - No environment variable required (lock is authoritative) - Scales to multi-worker deployments if combined with external job store Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-04-29 20:10:53 +02:00
parent 336242ad06
commit 187cd8250d
8 changed files with 768 additions and 82 deletions
--- a/Docs/Tasks.md
+++ b/Docs/Tasks.md
@@ -1,41 +1,3 @@
-## 35) API client sends JSON and CSRF header for every request method
- Where found:
-	- [frontend/src/api/client.ts](frontend/src/api/client.ts)
- Why this is needed:
-	- Extra headers on GET increase unnecessary CORS preflights and noise.
- Goal:
-	- Apply headers by method/body requirements.
- What to do:
-	- Only set Content-Type for requests with JSON body.
-	- Send CSRF header for mutating cookie-authenticated requests only.
- Possible traps and issues:
-	- CSRF protection assumptions must still hold for all mutating paths.
- Docs changes needed:
-	- Update frontend API client contract and CSRF notes.
- Doc references:
-	- [backend/app/middleware/csrf.py](backend/app/middleware/csrf.py)
-
---
-
-## 36) Polling continues when tab is not visible
- Where found:
-	- [frontend/src/hooks/usePolledData.ts](frontend/src/hooks/usePolledData.ts#L90)
-	- [frontend/src/hooks/useBlocklistStatus.ts](frontend/src/hooks/useBlocklistStatus.ts)
- Why this is needed:
-	- Unnecessary backend load and client resource usage in background tabs.
- Goal:
-	- Pause/reduce polling when page is hidden.
- What to do:
-	- Add visibility-aware polling strategy and optional backoff.
- Possible traps and issues:
-	- Data may appear stale immediately after tab restore if refresh is delayed.
- Docs changes needed:
-	- Add frontend polling lifecycle policy.
- Doc references:
-	- [Docs/Web-Development.md](Docs/Web-Development.md)
-
---
-
 ## 37) Multi-worker safety check depends on one environment variable
 - Where found:
 	- [backend/app/startup.py](backend/app/startup.py#L61)