Files
BanGUI/Docs/Tasks.md
Lukas 187cd8250d Implement database-backed scheduler lock for multi-worker safety
Enforce single-executor safety regardless of process launcher through a
robust database-backed lock mechanism that works reliably in container
orchestration environments.

Key changes:
1. Add scheduler_lock table to database schema (migration 4)
   - Singleton row (id=1) prevents concurrent execution
   - Stores PID, hostname, creation timestamp, heartbeat timestamp
   - Atomic transaction prevents race conditions

2. Create scheduler lock utility (app/utils/scheduler_lock.py)
   - acquire_scheduler_lock(): Atomically acquire or fail
   - release_scheduler_lock(): Clean up on shutdown
   - update_scheduler_lock_heartbeat(): Keep lock alive (every 10 seconds)
   - get_scheduler_lock_info(): Debug/inspect lock status
   - Stale lock detection: TTL-based (60 second expiry)

3. Reorder startup DAG stages
   - DATABASE now comes first (required for lock acquisition)
   - WORKER_MODE depends on DATABASE (performs lock check after initialization)
   - Maintains all other stage dependencies intact

4. Update startup process (app/startup.py)
   - Replace _check_single_worker_mode() with two-tier check:
     * Fast check: BANGUI_WORKERS env var (if explicitly set to >1)
     * Authoritative check: Database lock (catches misconfiguration)
   - Return startup_db from startup_shared_resources() for lock management

5. Register scheduler lock heartbeat task
   - New task: scheduler_lock_heartbeat (app/tasks/scheduler_lock_heartbeat.py)
   - Updates lock heartbeat every 10 seconds (keeps lock alive)
   - Prevents false positives from temporary load spikes

6. Add lock release to lifespan shutdown (app/main.py)
   - Release lock before closing database
   - Allows other instances to acquire during rolling deployments
   - Graceful handoff between instances

7. Comprehensive test coverage (backend/tests/test_scheduler_lock.py)
   - Lock acquisition success and failure cases
   - Stale lock cleanup on startup
   - Lock release and heartbeat updates
   - Full lifecycle: acquire → heartbeat → release

8. Update documentation (Docs/Architekture.md § 9.3)
   - Explain single-executor requirement
   - Document database-backed locking mechanism
   - Compare with alternative approaches (filesystem, env var)
   - Include troubleshooting guide
   - Container orchestration examples (Docker, Kubernetes, systemd)

Why database-backed instead of filesystem?
   - Atomicity: SQLite transactions prevent TOCTOU race windows
   - Container-safe: Works across containers with shared DB volumes
   - No NFS/SMB edge cases
   - Timestamp-based stale detection (PID reuse is unreliable)
   - More reliable in rolling deployments

Benefits:
   - Works with any process manager (uvicorn, gunicorn, etc.)
   - Handles simultaneous startup attempts correctly
   - Automatic failover on instance crash (stale lock cleanup)
   - Clear error messages with troubleshooting steps
   - No environment variable required (lock is authoritative)
   - Scales to multi-worker deployments if combined with external job store

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-04-29 20:10:53 +02:00

75 lines
2.7 KiB
Markdown

## 37) Multi-worker safety check depends on one environment variable
- Where found:
- [backend/app/startup.py](backend/app/startup.py#L61)
- Why this is needed:
- Other process managers can still launch multiple workers without this variable.
- Goal:
- Enforce scheduler single-executor safety regardless of launcher.
- What to do:
- Add robust single-run lock/leader mechanism for scheduler ownership.
- Possible traps and issues:
- Locking strategy must be reliable in container orchestration.
- Docs changes needed:
- Expand deployment constraints and supported run modes.
- Doc references:
- [Docs/Architekture.md](Docs/Architekture.md)
---
## 38) History archive query paths may need explicit indexing plan
- Where found:
- [backend/app/db.py](backend/app/db.py)
- [backend/app/repositories/history_archive_repo.py](backend/app/repositories/history_archive_repo.py)
- Why this is needed:
- Large archive datasets can degrade filter/sort performance.
- Goal:
- Add indexes aligned with real query patterns.
- What to do:
- Benchmark common history queries.
- Add migration with targeted indexes.
- Possible traps and issues:
- Extra indexes increase write cost and DB size.
- Docs changes needed:
- Add DB performance/indexing section for history.
- Doc references:
- [Docs/Backend-Development.md](Docs/Backend-Development.md)
- https://www.sqlite.org/queryplanner.html
---
## 39) No explicit DI container strategy for backend service graph
- Where found:
- [backend/app/dependencies.py](backend/app/dependencies.py)
- [backend/app/services](backend/app/services)
- Why this is needed:
- Dependency construction and lifecycle are partly implicit.
- Goal:
- Define a clear dependency wiring pattern for services and repositories.
- What to do:
- Create service composition root pattern and document usage.
- Possible traps and issues:
- Over-engineering if container abstraction is too heavy for current size.
- Docs changes needed:
- Add dependency wiring chapter.
- Doc references:
- [Docs/Architekture.md](Docs/Architekture.md)
---
## 40) Frontend and backend observability are not aligned
- Where found:
- [backend/app/main.py](backend/app/main.py)
- [frontend/src](frontend/src)
- Why this is needed:
- Backend uses structured logging while frontend error telemetry is mostly local and ad-hoc.
- Goal:
- Define unified error telemetry and correlation approach.
- What to do:
- Introduce frontend error reporting pipeline and request correlation IDs.
- Possible traps and issues:
- PII/sensitive payload leakage risk in client-side telemetry.
- Docs changes needed:
- Add observability and privacy-safe logging guidelines.
- Doc references:
- [Docs/Architekture.md](Docs/Architekture.md)
- [Docs/Web-Development.md](Docs/Web-Development.md)