Fix HIGH priority issues: unbounded queries, rate limiting, health checks
Issue #3 - Unbounded Query Results (OOM): - get_all_archived_history() now uses keyset pagination with bounded max_rows (50k default) - Added 'id' field to records from get_archived_history() and get_archived_history_keyset() - Protocol signature updated with page_size, max_rows, last_ban_id params Issue #7 - Docker Health Check Fails: - Added curl to Dockerfile.backend runtime image - HEALTHCHECK now uses 'curl -f http://localhost:8000/api/health' - compose.prod.yml: increased start_period to 40s, timeout to 10s - Frontend healthcheck proxies to backend /api/health Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
This commit is contained in:
115
Docs/TROUBLESHOOTING.md
Normal file
115
Docs/TROUBLESHOOTING.md
Normal file
@@ -0,0 +1,115 @@
|
||||
# Troubleshooting Guide
|
||||
|
||||
## Scheduler Lock Issues
|
||||
|
||||
### Lock Held by Crashed Instance (Orphaned Lock)
|
||||
|
||||
**Symptom:** Background tasks stop running. Logs show `scheduler_lock_held_by_other_instance` but no other instance is running.
|
||||
|
||||
**Diagnosis:**
|
||||
```bash
|
||||
sqlite3 /var/lib/bangui/bangui.db "SELECT pid, hostname, heartbeat_at FROM scheduler_lock;"
|
||||
```
|
||||
|
||||
If `heartbeat_at` is older than 5 minutes and the PID no longer exists, the lock is orphaned.
|
||||
|
||||
**Recovery:**
|
||||
```bash
|
||||
sqlite3 /var/lib/bangui/bangui.db "DELETE FROM scheduler_lock;"
|
||||
```
|
||||
|
||||
Restart the backend. It will acquire the lock fresh.
|
||||
|
||||
**Prevention:**
|
||||
- Monitor `scheduler_lock_heartbeat_lost` events in logs
|
||||
- If >3 occurrences per hour, investigate database I/O performance
|
||||
|
||||
---
|
||||
|
||||
### Two Instances Both Running Scheduler
|
||||
|
||||
**Symptom:** Duplicate blocklist imports, duplicate geo cache cleanups, or duplicate history syncs.
|
||||
|
||||
**Cause:** Both instances believe they hold the lock.
|
||||
|
||||
**Diagnosis:**
|
||||
1. Check which instance holds the lock: `SELECT pid, hostname FROM scheduler_lock;`
|
||||
2. Compare with running processes: `ps aux | grep bangui`
|
||||
|
||||
**Solution:**
|
||||
1. Stop one instance immediately
|
||||
2. Clear lock: `DELETE FROM scheduler_lock;`
|
||||
3. Restart the remaining instance
|
||||
|
||||
**Prevention:**
|
||||
- Ensure only one instance starts before heartbeat begins
|
||||
- Check `BANGUI_SINGLE_INSTANCE=true` is set if single-instance operation is required
|
||||
|
||||
---
|
||||
|
||||
### Heartbeat Update Failures
|
||||
|
||||
**Symptom:** Logs show `scheduler_lock_heartbeat_lost` repeatedly, then lock is lost.
|
||||
|
||||
**Cause:** Database writes failing or extremely slow (>5 seconds per write).
|
||||
|
||||
**Diagnosis:**
|
||||
```bash
|
||||
time sqlite3 /var/lib/bangui/bangui.db "UPDATE scheduler_lock SET heartbeat_at = unixepoch();"
|
||||
```
|
||||
|
||||
If this takes >1 second, database I/O is degraded.
|
||||
|
||||
**Solution:**
|
||||
1. Check disk health: `sqlite3 /var/lib/bangui/bangui.db "PRAGMA integrity_check;"`
|
||||
2. Move database to faster storage (SSD)
|
||||
3. Check for other I/O bottlenecks on the host
|
||||
|
||||
---
|
||||
|
||||
### Lock Not Acquired at Startup
|
||||
|
||||
**Symptom:** Instance fails to start with error "Could not acquire scheduler lock".
|
||||
|
||||
**Cause:** Another instance already holds the lock and appears healthy.
|
||||
|
||||
**Diagnosis:**
|
||||
```bash
|
||||
sqlite3 /var/lib/bangui/bangui.db "SELECT pid, hostname, heartbeat_at FROM scheduler_lock;"
|
||||
ps aux | grep <pid>
|
||||
```
|
||||
|
||||
**Solution:**
|
||||
- If other instance is healthy and should run scheduler: this instance must wait
|
||||
- If other instance is crashed: `DELETE FROM scheduler_lock;` then restart this instance
|
||||
- If running single instance: ensure no other instances are running before startup
|
||||
|
||||
---
|
||||
|
||||
## General Recovery Commands
|
||||
|
||||
Clear all locks:
|
||||
```bash
|
||||
sqlite3 /var/lib/bangui/bangui.db "DELETE FROM scheduler_lock;"
|
||||
```
|
||||
|
||||
Check lock status:
|
||||
```bash
|
||||
sqlite3 /var/lib/bangui/bangui.db "SELECT * FROM scheduler_lock;"
|
||||
```
|
||||
|
||||
Verify database integrity:
|
||||
```bash
|
||||
sqlite3 /var/lib/bangui/bangui.db "PRAGMA integrity_check;"
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Getting Help
|
||||
|
||||
If issues persist after following this guide:
|
||||
|
||||
1. Enable debug logging: `BANGUI_LOG_LEVEL=debug`
|
||||
2. Collect logs around the failure time
|
||||
3. Check `Docs/Deployment.md` for configuration guidance
|
||||
4. Check `Docs/Observability.md` for monitoring setup
|
||||
Reference in New Issue
Block a user