Files

Lukas 0d5882b32f Fix HIGH priority issues: unbounded queries, rate limiting, health checks

Issue #3 - Unbounded Query Results (OOM):
- get_all_archived_history() now uses keyset pagination with bounded max_rows (50k default)
- Added 'id' field to records from get_archived_history() and get_archived_history_keyset()
- Protocol signature updated with page_size, max_rows, last_ban_id params

Issue #7 - Docker Health Check Fails:
- Added curl to Dockerfile.backend runtime image
- HEALTHCHECK now uses 'curl -f http://localhost:8000/api/health'
- compose.prod.yml: increased start_period to 40s, timeout to 10s
- Frontend healthcheck proxies to backend /api/health

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

2026-05-01 21:47:36 +02:00

3.1 KiB

Raw Blame History

Troubleshooting Guide

Scheduler Lock Issues

Lock Held by Crashed Instance (Orphaned Lock)

Symptom: Background tasks stop running. Logs show scheduler_lock_held_by_other_instance but no other instance is running.

Diagnosis:

sqlite3 /var/lib/bangui/bangui.db "SELECT pid, hostname, heartbeat_at FROM scheduler_lock;"

If heartbeat_at is older than 5 minutes and the PID no longer exists, the lock is orphaned.

Recovery:

sqlite3 /var/lib/bangui/bangui.db "DELETE FROM scheduler_lock;"

Restart the backend. It will acquire the lock fresh.

Prevention:

Monitor scheduler_lock_heartbeat_lost events in logs
If >3 occurrences per hour, investigate database I/O performance

Two Instances Both Running Scheduler

Symptom: Duplicate blocklist imports, duplicate geo cache cleanups, or duplicate history syncs.

Cause: Both instances believe they hold the lock.

Diagnosis:

Check which instance holds the lock: SELECT pid, hostname FROM scheduler_lock;
Compare with running processes: ps aux | grep bangui

Solution:

Stop one instance immediately
Clear lock: DELETE FROM scheduler_lock;
Restart the remaining instance

Prevention:

Ensure only one instance starts before heartbeat begins
Check BANGUI_SINGLE_INSTANCE=true is set if single-instance operation is required

Heartbeat Update Failures

Symptom: Logs show scheduler_lock_heartbeat_lost repeatedly, then lock is lost.

Cause: Database writes failing or extremely slow (>5 seconds per write).

Diagnosis:

time sqlite3 /var/lib/bangui/bangui.db "UPDATE scheduler_lock SET heartbeat_at = unixepoch();"

If this takes >1 second, database I/O is degraded.

Solution:

Check disk health: sqlite3 /var/lib/bangui/bangui.db "PRAGMA integrity_check;"
Move database to faster storage (SSD)
Check for other I/O bottlenecks on the host

Lock Not Acquired at Startup

Symptom: Instance fails to start with error "Could not acquire scheduler lock".

Cause: Another instance already holds the lock and appears healthy.

Diagnosis:

sqlite3 /var/lib/bangui/bangui.db "SELECT pid, hostname, heartbeat_at FROM scheduler_lock;"
ps aux | grep <pid>

Solution:

If other instance is healthy and should run scheduler: this instance must wait
If other instance is crashed: DELETE FROM scheduler_lock; then restart this instance
If running single instance: ensure no other instances are running before startup

General Recovery Commands

Clear all locks:

sqlite3 /var/lib/bangui/bangui.db "DELETE FROM scheduler_lock;"

Check lock status:

sqlite3 /var/lib/bangui/bangui.db "SELECT * FROM scheduler_lock;"

Verify database integrity:

sqlite3 /var/lib/bangui/bangui.db "PRAGMA integrity_check;"

Getting Help

If issues persist after following this guide:

Enable debug logging: BANGUI_LOG_LEVEL=debug
Collect logs around the failure time
Check Docs/Deployment.md for configuration guidance
Check Docs/Observability.md for monitoring setup

3.1 KiB Raw Blame History

Troubleshooting Guide

Scheduler Lock Issues

Lock Held by Crashed Instance (Orphaned Lock)

Two Instances Both Running Scheduler

Heartbeat Update Failures

Lock Not Acquired at Startup

General Recovery Commands

Getting Help

3.1 KiB

Raw Blame History