Fix HIGH priority issues: unbounded queries, rate limiting, health checks

Issue #3 - Unbounded Query Results (OOM): - get_all_archived_history() now uses keyset pagination with bounded max_rows (50k default) - Added 'id' field to records from get_archived_history() and get_archived_history_keyset() - Protocol signature updated with page_size, max_rows, last_ban_id params Issue #7 - Docker Health Check Fails: - Added curl to Dockerfile.backend runtime image - HEALTHCHECK now uses 'curl -f http://localhost:8000/api/health' - compose.prod.yml: increased start_period to 40s, timeout to 10s - Frontend healthcheck proxies to backend /api/health Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-05-01 21:47:36 +02:00
parent 1830da496d
commit 0d5882b32f
39 changed files with 2067 additions and 339 deletions
--- a/Docs/Observability.md
+++ b/Docs/Observability.md
@@ -678,6 +678,63 @@ Planned observability improvements:

 ---

+## Scheduler Lock Health Monitoring
+
+The scheduler lock ensures only one instance runs background tasks. Monitoring its health is critical for production reliability.
+
+### Key Metrics
+
+Monitor these log events for scheduler lock health:
+
+| Event | Level | Meaning |
+|-------|-------|---------|
+| `scheduler_lock_acquired` | info | Successfully acquired the scheduler lock |
+| `scheduler_lock_held_by_other_instance` | warning | Another instance holds the lock (expected during normal multi-instance operation) |
+| `scheduler_lock_stale_overwrite` | info | Took over a stale lock from a crashed instance |
+| `scheduler_lock_heartbeat_lost` | warning | Heartbeat update failed; we lost the lock |
+| `scheduler_lock_release_mismatch` | warning | Release attempted but we don't hold the lock |
+
+### Lock Health Check
+
+Query current lock status via `get_lock_health()`:
+
+```python
+from app.utils.scheduler_lock import get_lock_health
+
+health = await get_lock_health(db)
+# Returns: {"locked": bool, "pid": int|None, "hostname": str|None,
+#           "age_seconds": float|None, "is_stale": bool, "ttl_remaining": float|None}
+```
+
+### Alerting Rules
+
+**Critical alerts:**
+- `scheduler_lock_acquired` not seen for >5 minutes during startup → Instance may not have acquired lock
+- `scheduler_lock_heartbeat_lost` repeated >3 times → Lock keeps being stolen, possible contention issue
+
+**Warning alerts:**
+- `scheduler_lock_held_by_other_instance` every few minutes → Normal if multiple instances, abnormal if single instance
+
+### Database Query
+
+Check lock state directly in SQLite:
+
+```sql
+SELECT pid, hostname, heartbeat_at, heartbeat_timeout,
+       (datetime('now') - datetime(heartbeat_at, 'unixepoch')) as age
+FROM scheduler_lock WHERE id = 1;
+```
+
+### Common Issues
+
+1. **Lock not acquired on startup**: Check logs for `scheduler_lock_held_by_other_instance`. If another instance holds it, verify if that instance is healthy.
+
+2. **Background tasks not running**: Use `get_lock_health()` to verify the lock is held. If not held, the instance cannot run scheduled tasks.
+
+3. **Frequent lock steals**: If `scheduler_lock_stale_overwrite` occurs frequently, the heartbeat interval may be too long or network latency is causing false staleness detection.
+
+---
+
 ## References

 - [structlog Documentation](https://www.structlog.org/)