Files
BanGUI/Docs/TROUBLESHOOTING.md
Lukas cc6dbcf3f0 feat: implement API versioning /api/v1/
- All backend routers moved to /api/v1/ prefix
- Frontend BASE_URL updated to /api/v1
- Setup redirect middleware updated to redirect to /api/v1/setup
- Health router path fixed: prefix=/api/v1/health, @router.get('')
- conftest.py: set server_status=online for test fixture
- Created Docs/API_VERSIONING.md with deprecation policy
- Updated Docs/Backend-Development.md with versioning section
- Updated Instructions.md curl examples

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-05-02 21:29:30 +02:00

5.1 KiB

Troubleshooting Guide

Scheduler Lock Issues

Lock Held by Crashed Instance (Orphaned Lock)

Symptom: Background tasks stop running. Logs show scheduler_lock_held_by_other_instance but no other instance is running.

Diagnosis:

sqlite3 /var/lib/bangui/bangui.db "SELECT pid, hostname, heartbeat_at FROM scheduler_lock;"

If heartbeat_at is older than 5 minutes and the PID no longer exists, the lock is orphaned.

Recovery:

sqlite3 /var/lib/bangui/bangui.db "DELETE FROM scheduler_lock;"

Restart the backend. It will acquire the lock fresh.

Prevention:

  • Monitor scheduler_lock_heartbeat_lost events in logs
  • If >3 occurrences per hour, investigate database I/O performance

Two Instances Both Running Scheduler

Symptom: Duplicate blocklist imports, duplicate geo cache cleanups, or duplicate history syncs.

Cause: Both instances believe they hold the lock.

Diagnosis:

  1. Check which instance holds the lock: SELECT pid, hostname FROM scheduler_lock;
  2. Compare with running processes: ps aux | grep bangui

Solution:

  1. Stop one instance immediately
  2. Clear lock: DELETE FROM scheduler_lock;
  3. Restart the remaining instance

Prevention:

  • Ensure only one instance starts before heartbeat begins
  • Check BANGUI_SINGLE_INSTANCE=true is set if single-instance operation is required

Heartbeat Update Failures

Symptom: Logs show scheduler_lock_heartbeat_lost repeatedly, then lock is lost.

Cause: Database writes failing or extremely slow (>5 seconds per write).

Diagnosis:

time sqlite3 /var/lib/bangui/bangui.db "UPDATE scheduler_lock SET heartbeat_at = unixepoch();"

If this takes >1 second, database I/O is degraded.

Solution:

  1. Check disk health: sqlite3 /var/lib/bangui/bangui.db "PRAGMA integrity_check;"
  2. Move database to faster storage (SSD)
  3. Check for other I/O bottlenecks on the host

Lock Not Acquired at Startup

Symptom: Instance fails to start with error "Could not acquire scheduler lock".

Cause: Another instance already holds the lock and appears healthy.

Diagnosis:

sqlite3 /var/lib/bangui/bangui.db "SELECT pid, hostname, heartbeat_at FROM scheduler_lock;"
ps aux | grep <pid>

Solution:

  • If other instance is healthy and should run scheduler: this instance must wait
  • If other instance is crashed: DELETE FROM scheduler_lock; then restart this instance
  • If running single instance: ensure no other instances are running before startup

Rate Limiting

Getting 429 Too Many Requests

Symptom: API returns HTTP 429 with rate_limit_exceeded error code.

Cause: You have exceeded the per-IP rate limit for a specific operation.

Diagnosis:

  1. Check the Retry-After header in the response — this tells you how many seconds to wait
  2. Look for the log event *_rate_limit_exceeded which shows the bucket and client IP

Rate limit buckets:

Bucket Limit Window Operations
bans:ban 100 1 minute Ban IP addresses
bans:unban 100 1 minute Unban IP addresses
blocklist:import 10 1 hour Import blocklists
config:update 50 1 minute Update configuration
jail:update 100 1 minute Update jail config
jail:create 100 1 minute Add log paths, assign filters/actions
jail:delete 100 1 minute Remove log paths, actions
jail:activate 100 1 minute Activate jails
jail:deactivate 100 1 minute Deactivate jails
filter:update 50 1 minute Update filters
filter:create 50 1 minute Create filters
filter:delete 50 1 minute Delete filters
action:update 50 1 minute Update actions
action:create 50 1 minute Create actions
action:delete 50 1 minute Delete actions

Solution:

  1. Wait for the Retry-After period before retrying
  2. If you hit the limit during legitimate bulk operations, consider batching requests
  3. For blocklist imports (10/hour), ensure automated imports are not more frequent

Prevention:

  • Monitor *_rate_limit_exceeded log events
  • Adjust limits via environment variables if needed (see Docs/CONFIGURATION.md)
  • For bulk operations, implement client-side throttling

Note: If rate limiting triggers unexpectedly for legitimate use, check for:

  • Internal monitoring scripts hitting endpoints too frequently
  • Multiple users behind the same proxy IP
  • Stale rate limit state after process restart (uses in-memory tracking)

General Recovery Commands

Clear all locks:

sqlite3 /var/lib/bangui/bangui.db "DELETE FROM scheduler_lock;"

Check lock status:

sqlite3 /var/lib/bangui/bangui.db "SELECT * FROM scheduler_lock;"

Verify database integrity:

sqlite3 /var/lib/bangui/bangui.db "PRAGMA integrity_check;"

Getting Help

If issues persist after following this guide:

  1. Enable debug logging: BANGUI_LOG_LEVEL=debug
  2. Collect logs around the failure time
  3. Check Docs/Deployment.md for configuration guidance
  4. Check Docs/Observability.md for monitoring setup