3 Commits

Author SHA1 Message Date
0d5882b32f Fix HIGH priority issues: unbounded queries, rate limiting, health checks
Issue #3 - Unbounded Query Results (OOM):
- get_all_archived_history() now uses keyset pagination with bounded max_rows (50k default)
- Added 'id' field to records from get_archived_history() and get_archived_history_keyset()
- Protocol signature updated with page_size, max_rows, last_ban_id params

Issue #7 - Docker Health Check Fails:
- Added curl to Dockerfile.backend runtime image
- HEALTHCHECK now uses 'curl -f http://localhost:8000/api/health'
- compose.prod.yml: increased start_period to 40s, timeout to 10s
- Frontend healthcheck proxies to backend /api/health

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-05-01 21:47:36 +02:00
05c3b564ae Refactor scheduler lock implementation with heartbeat mechanism
- Add heartbeat-based lock renewal in scheduler_lock_heartbeat.py
- Update scheduler_lock.py with improved lock management
- Add comprehensive tests for scheduler lock functionality
- Update deployment and task documentation

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-04-30 22:10:38 +02:00
187cd8250d Implement database-backed scheduler lock for multi-worker safety
Enforce single-executor safety regardless of process launcher through a
robust database-backed lock mechanism that works reliably in container
orchestration environments.

Key changes:
1. Add scheduler_lock table to database schema (migration 4)
   - Singleton row (id=1) prevents concurrent execution
   - Stores PID, hostname, creation timestamp, heartbeat timestamp
   - Atomic transaction prevents race conditions

2. Create scheduler lock utility (app/utils/scheduler_lock.py)
   - acquire_scheduler_lock(): Atomically acquire or fail
   - release_scheduler_lock(): Clean up on shutdown
   - update_scheduler_lock_heartbeat(): Keep lock alive (every 10 seconds)
   - get_scheduler_lock_info(): Debug/inspect lock status
   - Stale lock detection: TTL-based (60 second expiry)

3. Reorder startup DAG stages
   - DATABASE now comes first (required for lock acquisition)
   - WORKER_MODE depends on DATABASE (performs lock check after initialization)
   - Maintains all other stage dependencies intact

4. Update startup process (app/startup.py)
   - Replace _check_single_worker_mode() with two-tier check:
     * Fast check: BANGUI_WORKERS env var (if explicitly set to >1)
     * Authoritative check: Database lock (catches misconfiguration)
   - Return startup_db from startup_shared_resources() for lock management

5. Register scheduler lock heartbeat task
   - New task: scheduler_lock_heartbeat (app/tasks/scheduler_lock_heartbeat.py)
   - Updates lock heartbeat every 10 seconds (keeps lock alive)
   - Prevents false positives from temporary load spikes

6. Add lock release to lifespan shutdown (app/main.py)
   - Release lock before closing database
   - Allows other instances to acquire during rolling deployments
   - Graceful handoff between instances

7. Comprehensive test coverage (backend/tests/test_scheduler_lock.py)
   - Lock acquisition success and failure cases
   - Stale lock cleanup on startup
   - Lock release and heartbeat updates
   - Full lifecycle: acquire → heartbeat → release

8. Update documentation (Docs/Architekture.md § 9.3)
   - Explain single-executor requirement
   - Document database-backed locking mechanism
   - Compare with alternative approaches (filesystem, env var)
   - Include troubleshooting guide
   - Container orchestration examples (Docker, Kubernetes, systemd)

Why database-backed instead of filesystem?
   - Atomicity: SQLite transactions prevent TOCTOU race windows
   - Container-safe: Works across containers with shared DB volumes
   - No NFS/SMB edge cases
   - Timestamp-based stale detection (PID reuse is unreliable)
   - More reliable in rolling deployments

Benefits:
   - Works with any process manager (uvicorn, gunicorn, etc.)
   - Handles simultaneous startup attempts correctly
   - Automatic failover on instance crash (stale lock cleanup)
   - Clear error messages with troubleshooting steps
   - No environment variable required (lock is authoritative)
   - Scales to multi-worker deployments if combined with external job store

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-04-29 20:10:53 +02:00