Fix HIGH priority issues: unbounded queries, rate limiting, health checks
Issue #3 - Unbounded Query Results (OOM): - get_all_archived_history() now uses keyset pagination with bounded max_rows (50k default) - Added 'id' field to records from get_archived_history() and get_archived_history_keyset() - Protocol signature updated with page_size, max_rows, last_ban_id params Issue #7 - Docker Health Check Fails: - Added curl to Dockerfile.backend runtime image - HEALTHCHECK now uses 'curl -f http://localhost:8000/api/health' - compose.prod.yml: increased start_period to 40s, timeout to 10s - Frontend healthcheck proxies to backend /api/health Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
This commit is contained in:
@@ -678,6 +678,63 @@ Planned observability improvements:
|
||||
|
||||
---
|
||||
|
||||
## Scheduler Lock Health Monitoring
|
||||
|
||||
The scheduler lock ensures only one instance runs background tasks. Monitoring its health is critical for production reliability.
|
||||
|
||||
### Key Metrics
|
||||
|
||||
Monitor these log events for scheduler lock health:
|
||||
|
||||
| Event | Level | Meaning |
|
||||
|-------|-------|---------|
|
||||
| `scheduler_lock_acquired` | info | Successfully acquired the scheduler lock |
|
||||
| `scheduler_lock_held_by_other_instance` | warning | Another instance holds the lock (expected during normal multi-instance operation) |
|
||||
| `scheduler_lock_stale_overwrite` | info | Took over a stale lock from a crashed instance |
|
||||
| `scheduler_lock_heartbeat_lost` | warning | Heartbeat update failed; we lost the lock |
|
||||
| `scheduler_lock_release_mismatch` | warning | Release attempted but we don't hold the lock |
|
||||
|
||||
### Lock Health Check
|
||||
|
||||
Query current lock status via `get_lock_health()`:
|
||||
|
||||
```python
|
||||
from app.utils.scheduler_lock import get_lock_health
|
||||
|
||||
health = await get_lock_health(db)
|
||||
# Returns: {"locked": bool, "pid": int|None, "hostname": str|None,
|
||||
# "age_seconds": float|None, "is_stale": bool, "ttl_remaining": float|None}
|
||||
```
|
||||
|
||||
### Alerting Rules
|
||||
|
||||
**Critical alerts:**
|
||||
- `scheduler_lock_acquired` not seen for >5 minutes during startup → Instance may not have acquired lock
|
||||
- `scheduler_lock_heartbeat_lost` repeated >3 times → Lock keeps being stolen, possible contention issue
|
||||
|
||||
**Warning alerts:**
|
||||
- `scheduler_lock_held_by_other_instance` every few minutes → Normal if multiple instances, abnormal if single instance
|
||||
|
||||
### Database Query
|
||||
|
||||
Check lock state directly in SQLite:
|
||||
|
||||
```sql
|
||||
SELECT pid, hostname, heartbeat_at, heartbeat_timeout,
|
||||
(datetime('now') - datetime(heartbeat_at, 'unixepoch')) as age
|
||||
FROM scheduler_lock WHERE id = 1;
|
||||
```
|
||||
|
||||
### Common Issues
|
||||
|
||||
1. **Lock not acquired on startup**: Check logs for `scheduler_lock_held_by_other_instance`. If another instance holds it, verify if that instance is healthy.
|
||||
|
||||
2. **Background tasks not running**: Use `get_lock_health()` to verify the lock is held. If not held, the instance cannot run scheduled tasks.
|
||||
|
||||
3. **Frequent lock steals**: If `scheduler_lock_stale_overwrite` occurs frequently, the heartbeat interval may be too long or network latency is causing false staleness detection.
|
||||
|
||||
---
|
||||
|
||||
## References
|
||||
|
||||
- [structlog Documentation](https://www.structlog.org/)
|
||||
|
||||
Reference in New Issue
Block a user