Fix HIGH priority issues: unbounded queries, rate limiting, health checks

Issue #3 - Unbounded Query Results (OOM):
- get_all_archived_history() now uses keyset pagination with bounded max_rows (50k default)
- Added 'id' field to records from get_archived_history() and get_archived_history_keyset()
- Protocol signature updated with page_size, max_rows, last_ban_id params

Issue #7 - Docker Health Check Fails:
- Added curl to Dockerfile.backend runtime image
- HEALTHCHECK now uses 'curl -f http://localhost:8000/api/health'
- compose.prod.yml: increased start_period to 40s, timeout to 10s
- Frontend healthcheck proxies to backend /api/health

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
This commit is contained in:
2026-05-01 21:47:36 +02:00
parent 1830da496d
commit 0d5882b32f
39 changed files with 2067 additions and 339 deletions

View File

@@ -678,6 +678,63 @@ Planned observability improvements:
---
## Scheduler Lock Health Monitoring
The scheduler lock ensures only one instance runs background tasks. Monitoring its health is critical for production reliability.
### Key Metrics
Monitor these log events for scheduler lock health:
| Event | Level | Meaning |
|-------|-------|---------|
| `scheduler_lock_acquired` | info | Successfully acquired the scheduler lock |
| `scheduler_lock_held_by_other_instance` | warning | Another instance holds the lock (expected during normal multi-instance operation) |
| `scheduler_lock_stale_overwrite` | info | Took over a stale lock from a crashed instance |
| `scheduler_lock_heartbeat_lost` | warning | Heartbeat update failed; we lost the lock |
| `scheduler_lock_release_mismatch` | warning | Release attempted but we don't hold the lock |
### Lock Health Check
Query current lock status via `get_lock_health()`:
```python
from app.utils.scheduler_lock import get_lock_health
health = await get_lock_health(db)
# Returns: {"locked": bool, "pid": int|None, "hostname": str|None,
# "age_seconds": float|None, "is_stale": bool, "ttl_remaining": float|None}
```
### Alerting Rules
**Critical alerts:**
- `scheduler_lock_acquired` not seen for >5 minutes during startup → Instance may not have acquired lock
- `scheduler_lock_heartbeat_lost` repeated >3 times → Lock keeps being stolen, possible contention issue
**Warning alerts:**
- `scheduler_lock_held_by_other_instance` every few minutes → Normal if multiple instances, abnormal if single instance
### Database Query
Check lock state directly in SQLite:
```sql
SELECT pid, hostname, heartbeat_at, heartbeat_timeout,
(datetime('now') - datetime(heartbeat_at, 'unixepoch')) as age
FROM scheduler_lock WHERE id = 1;
```
### Common Issues
1. **Lock not acquired on startup**: Check logs for `scheduler_lock_held_by_other_instance`. If another instance holds it, verify if that instance is healthy.
2. **Background tasks not running**: Use `get_lock_health()` to verify the lock is held. If not held, the instance cannot run scheduled tasks.
3. **Frequent lock steals**: If `scheduler_lock_stale_overwrite` occurs frequently, the heartbeat interval may be too long or network latency is causing false staleness detection.
---
## References
- [structlog Documentation](https://www.structlog.org/)