BanGUI/Docs/Tasks.md at 187cd8250d7c225d272d3fb62941a97515e31c47

Files

Lukas 187cd8250d Implement database-backed scheduler lock for multi-worker safety

Enforce single-executor safety regardless of process launcher through a
robust database-backed lock mechanism that works reliably in container
orchestration environments.

Key changes:
1. Add scheduler_lock table to database schema (migration 4)
   - Singleton row (id=1) prevents concurrent execution
   - Stores PID, hostname, creation timestamp, heartbeat timestamp
   - Atomic transaction prevents race conditions

2. Create scheduler lock utility (app/utils/scheduler_lock.py)
   - acquire_scheduler_lock(): Atomically acquire or fail
   - release_scheduler_lock(): Clean up on shutdown
   - update_scheduler_lock_heartbeat(): Keep lock alive (every 10 seconds)
   - get_scheduler_lock_info(): Debug/inspect lock status
   - Stale lock detection: TTL-based (60 second expiry)

3. Reorder startup DAG stages
   - DATABASE now comes first (required for lock acquisition)
   - WORKER_MODE depends on DATABASE (performs lock check after initialization)
   - Maintains all other stage dependencies intact

4. Update startup process (app/startup.py)
   - Replace _check_single_worker_mode() with two-tier check:
     * Fast check: BANGUI_WORKERS env var (if explicitly set to >1)
     * Authoritative check: Database lock (catches misconfiguration)
   - Return startup_db from startup_shared_resources() for lock management

5. Register scheduler lock heartbeat task
   - New task: scheduler_lock_heartbeat (app/tasks/scheduler_lock_heartbeat.py)
   - Updates lock heartbeat every 10 seconds (keeps lock alive)
   - Prevents false positives from temporary load spikes

6. Add lock release to lifespan shutdown (app/main.py)
   - Release lock before closing database
   - Allows other instances to acquire during rolling deployments
   - Graceful handoff between instances

7. Comprehensive test coverage (backend/tests/test_scheduler_lock.py)
   - Lock acquisition success and failure cases
   - Stale lock cleanup on startup
   - Lock release and heartbeat updates
   - Full lifecycle: acquire → heartbeat → release

8. Update documentation (Docs/Architekture.md § 9.3)
   - Explain single-executor requirement
   - Document database-backed locking mechanism
   - Compare with alternative approaches (filesystem, env var)
   - Include troubleshooting guide
   - Container orchestration examples (Docker, Kubernetes, systemd)

Why database-backed instead of filesystem?
   - Atomicity: SQLite transactions prevent TOCTOU race windows
   - Container-safe: Works across containers with shared DB volumes
   - No NFS/SMB edge cases
   - Timestamp-based stale detection (PID reuse is unreliable)
   - More reliable in rolling deployments

Benefits:
   - Works with any process manager (uvicorn, gunicorn, etc.)
   - Handles simultaneous startup attempts correctly
   - Automatic failover on instance crash (stale lock cleanup)
   - Clear error messages with troubleshooting steps
   - No environment variable required (lock is authoritative)
   - Scales to multi-worker deployments if combined with external job store

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

2026-04-29 20:10:53 +02:00

2.7 KiB

Raw Blame History

37) Multi-worker safety check depends on one environment variable

Where found:
- backend/app/startup.py
Why this is needed:
- Other process managers can still launch multiple workers without this variable.
Goal:
- Enforce scheduler single-executor safety regardless of launcher.
What to do:
- Add robust single-run lock/leader mechanism for scheduler ownership.
Possible traps and issues:
- Locking strategy must be reliable in container orchestration.
Docs changes needed:
- Expand deployment constraints and supported run modes.
Doc references:
- Docs/Architekture.md

38) History archive query paths may need explicit indexing plan

Where found:
- backend/app/db.py
- backend/app/repositories/history_archive_repo.py
Why this is needed:
- Large archive datasets can degrade filter/sort performance.
Goal:
- Add indexes aligned with real query patterns.
What to do:
- Benchmark common history queries.
- Add migration with targeted indexes.
Possible traps and issues:
- Extra indexes increase write cost and DB size.
Docs changes needed:
- Add DB performance/indexing section for history.
Doc references:
- Docs/Backend-Development.md
- https://www.sqlite.org/queryplanner.html

39) No explicit DI container strategy for backend service graph

Where found:
- backend/app/dependencies.py
- backend/app/services
Why this is needed:
- Dependency construction and lifecycle are partly implicit.
Goal:
- Define a clear dependency wiring pattern for services and repositories.
What to do:
- Create service composition root pattern and document usage.
Possible traps and issues:
- Over-engineering if container abstraction is too heavy for current size.
Docs changes needed:
- Add dependency wiring chapter.
Doc references:
- Docs/Architekture.md

40) Frontend and backend observability are not aligned

Where found:
- backend/app/main.py
- frontend/src
Why this is needed:
- Backend uses structured logging while frontend error telemetry is mostly local and ad-hoc.
Goal:
- Define unified error telemetry and correlation approach.
What to do:
- Introduce frontend error reporting pipeline and request correlation IDs.
Possible traps and issues:
- PII/sensitive payload leakage risk in client-side telemetry.
Docs changes needed:
- Add observability and privacy-safe logging guidelines.
Doc references:
- Docs/Architekture.md
- Docs/Web-Development.md

2.7 KiB Raw Blame History

37) Multi-worker safety check depends on one environment variable

38) History archive query paths may need explicit indexing plan

39) No explicit DI container strategy for backend service graph

40) Frontend and backend observability are not aligned

2.7 KiB

Raw Blame History