Files
BanGUI/Docs/TROUBLESHOOTING.md
Lukas b631c1c546 feat(backend): implement graceful shutdown for container stop
Graceful shutdown ensures in-flight operations complete before process exits:
- Lifespan shutdown handler drains pending tasks with 25s timeout
- Scheduler stops accepting new jobs immediately
- HTTP session, external logging, scheduler lock, DB conn closed cleanly
- 25s Python timeout leaves 5s margin before Docker's 30s SIGKILL

Files changed:
- backend/app/main.py: enhanced _lifespan shutdown with task drain
- Docker/Dockerfile.backend: documented signal handling in header
- Docker/docker-compose.yml: added stop_grace_period: 30s
- Docker/compose.prod.yml: added stop_grace_period: 30s
- Docs/Deployment.md: new Graceful Shutdown section with sequence table
- Docs/TROUBLESHOOTING.md: new Graceful Shutdown Issues section

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-05-02 22:47:10 +02:00

9.8 KiB

Troubleshooting Guide

Scheduler Lock Issues

Lock Held by Crashed Instance (Orphaned Lock)

Symptom: Background tasks stop running. Logs show scheduler_lock_held_by_other_instance but no other instance is running.

Diagnosis:

sqlite3 /var/lib/bangui/bangui.db "SELECT pid, hostname, heartbeat_at FROM scheduler_lock;"

If heartbeat_at is older than 5 minutes and the PID no longer exists, the lock is orphaned.

Recovery:

sqlite3 /var/lib/bangui/bangui.db "DELETE FROM scheduler_lock;"

Restart the backend. It will acquire the lock fresh.

Prevention:

  • Monitor scheduler_lock_heartbeat_lost events in logs
  • If >3 occurrences per hour, investigate database I/O performance

Two Instances Both Running Scheduler

Symptom: Duplicate blocklist imports, duplicate geo cache cleanups, or duplicate history syncs.

Cause: Both instances believe they hold the lock.

Diagnosis:

  1. Check which instance holds the lock: SELECT pid, hostname FROM scheduler_lock;
  2. Compare with running processes: ps aux | grep bangui

Solution:

  1. Stop one instance immediately
  2. Clear lock: DELETE FROM scheduler_lock;
  3. Restart the remaining instance

Prevention:

  • Ensure only one instance starts before heartbeat begins
  • Check BANGUI_SINGLE_INSTANCE=true is set if single-instance operation is required

Heartbeat Update Failures

Symptom: Logs show scheduler_lock_heartbeat_lost repeatedly, then lock is lost.

Cause: Database writes failing or extremely slow (>5 seconds per write).

Diagnosis:

time sqlite3 /var/lib/bangui/bangui.db "UPDATE scheduler_lock SET heartbeat_at = unixepoch();"

If this takes >1 second, database I/O is degraded.

Solution:

  1. Check disk health: sqlite3 /var/lib/bangui/bangui.db "PRAGMA integrity_check;"
  2. Move database to faster storage (SSD)
  3. Check for other I/O bottlenecks on the host

Lock Not Acquired at Startup

Symptom: Instance fails to start with error "Could not acquire scheduler lock".

Cause: Another instance already holds the lock and appears healthy.

Diagnosis:

sqlite3 /var/lib/bangui/bangui.db "SELECT pid, hostname, heartbeat_at FROM scheduler_lock;"
ps aux | grep <pid>

Solution:

  • If other instance is healthy and should run scheduler: this instance must wait
  • If other instance is crashed: DELETE FROM scheduler_lock; then restart this instance
  • If running single instance: ensure no other instances are running before startup

Rate Limiting

Getting 429 Too Many Requests

Symptom: API returns HTTP 429 with rate_limit_exceeded error code.

Cause: You have exceeded the per-IP rate limit for a specific operation.

Diagnosis:

  1. Check the Retry-After header in the response — this tells you how many seconds to wait
  2. Look for the log event *_rate_limit_exceeded which shows the bucket and client IP

Rate limit buckets:

Bucket Limit Window Operations
bans:ban 100 1 minute Ban IP addresses
bans:unban 100 1 minute Unban IP addresses
blocklist:import 10 1 hour Import blocklists
config:update 50 1 minute Update configuration
jail:update 100 1 minute Update jail config
jail:create 100 1 minute Add log paths, assign filters/actions
jail:delete 100 1 minute Remove log paths, actions
jail:activate 100 1 minute Activate jails
jail:deactivate 100 1 minute Deactivate jails
filter:update 50 1 minute Update filters
filter:create 50 1 minute Create filters
filter:delete 50 1 minute Delete filters
action:update 50 1 minute Update actions
action:create 50 1 minute Create actions
action:delete 50 1 minute Delete actions

Solution:

  1. Wait for the Retry-After period before retrying
  2. If you hit the limit during legitimate bulk operations, consider batching requests
  3. For blocklist imports (10/hour), ensure automated imports are not more frequent

Prevention:

  • Monitor *_rate_limit_exceeded log events
  • Adjust limits via environment variables if needed (see Docs/CONFIGURATION.md)
  • For bulk operations, implement client-side throttling

Note: If rate limiting triggers unexpectedly for legitimate use, check for:

  • Internal monitoring scripts hitting endpoints too frequently
  • Multiple users behind the same proxy IP
  • Stale rate limit state after process restart (uses in-memory tracking)

Database Migration Failures

Application Won't Start After Upgrade

Symptom: Application fails to start. Logs show migration errors.

Cause: Migration failed mid-transaction. Database left in inconsistent state.

Diagnosis:

# Check current schema version
sqlite3 /var/lib/bangui/bangui.db "SELECT MAX(version) FROM schema_migrations;"

# List all tables
sqlite3 /var/lib/bangui/bangui.db "SELECT name FROM sqlite_master WHERE type='table';"

# Check logs for specific error
grep -i migration /var/log/bangui.log

Solution:

  1. If migration was auto-rolled back: Startup will retry the same migration. Run application again.
  2. If migration keeps failing: Check if table already exists:
    sqlite3 /var/lib/bangui/bangui.db "SELECT name FROM sqlite_master WHERE type='table' AND name='<table>';"
    
    If it exists, manually insert the migration record:
    sqlite3 /var/lib/bangui/bangui.db "INSERT INTO schema_migrations (version) VALUES (?);"
    
  3. Full database reset (development only):
    rm /var/lib/bangui/bangui.db /var/lib/bangui/bangui.db-wal /var/lib/bangui/bangui.db-shm
    

Prevention:

  • Always backup before upgrades: cp bangui.db bangui.db.backup
  • Never manually modify database schema
  • Monitor migrating_database_schema log events during upgrades

Schema Version Mismatch

Symptom: Error: "database schema version X is newer than supported version Y"

Cause: Downgraded to older BanGUI version that doesn't support current schema.

Solution: Upgrade to a version compatible with the current schema, or restore from backup.


502 Bad Gateway Errors

Symptom: Nginx returns 502 Bad Gateway

Cause: The backend container is unreachable — either down, restarting, or not yet healthy.

Diagnosis:

# Check backend container status
docker ps -a | grep bangui-backend

# Check if backend is responding directly (on the container network)
docker exec bangui-frontend curl -f http://bangui-backend:8000/api/v1/health

# Check backend logs
docker logs bangui-backend --tail 50

Common causes and solutions:

Cause Diagnosis Solution
Backend restarting docker ps shows backend repeatedly restarting Check health check timing; may need longer start_period
Health check failing Backend log shows socket errors Verify fail2ban container is healthy before backend starts
Startup too slow start_period: 40s not enough on slow hosts Increase start_period in compose file
Port misconfiguration expose vs ports mismatch Ensure backend exposes 8000 and frontend proxies to it

Prevention:

  • The depends_on: condition: service_healthy ensures the backend is fully started before the frontend proxies requests.
  • The health check returns 503 when fail2ban is offline, triggering container restart automatically.
  • Health check parameters are tuned for typical startup time — adjust start_period if the host is slow or resource-constrained.

Graceful Shutdown Issues

Container Killed Before Tasks Complete

Symptom: Logs show pending_tasks_timeout and tasks are cancelled mid-execution.

Cause: Docker's stop_grace_period is too short, or tasks take longer than the 25s graceful timeout.

Diagnosis:

# Check if container was killed by SIGKILL
docker inspect bangui-backend --format '{{.State.ExitCode}}'
# Exit code 137 = SIGKILL

Solution:

  1. Increase stop_grace_period in docker-compose.yml:
    backend:
      stop_grace_period: 60s
    
  2. The Python graceful timeout is 25s (leaving margin before Docker kill)
  3. If tasks still timeout, check task code — long-running tasks should handle cancellation gracefully

Scheduler Lock Not Released

Symptom: After container restart, logs show Could not acquire scheduler lock.

Cause: Previous instance shut down without releasing the lock, or lock TTL hasn't expired.

Diagnosis:

sqlite3 /var/lib/bangui/bangui.db "SELECT * FROM scheduler_lock;"

Solution:

# Clear stale lock
sqlite3 /var/lib/bangui/bangui.db "DELETE FROM scheduler_lock;"
# Restart container

Prevention:

  • Graceful shutdown releases lock immediately (not waiting for TTL expiry)
  • Monitor logs for scheduler_lock_released on clean shutdown

In-Flight Requests Dropped

Symptom: Client connections closed abruptly during shutdown.

Cause: Too short a graceful timeout, or clients not configured to retry.

Solution:

  1. Ensure clients implement proper retry logic with backoff
  2. For critical operations, use background tasks with status polling
  3. Increase graceful timeout if network latency is high

General Recovery Commands

Clear all locks:

sqlite3 /var/lib/bangui/bangui.db "DELETE FROM scheduler_lock;"

Check lock status:

sqlite3 /var/lib/bangui/bangui.db "SELECT * FROM scheduler_lock;"

Verify database integrity:

sqlite3 /var/lib/bangui/bangui.db "PRAGMA integrity_check;"

Getting Help

If issues persist after following this guide:

  1. Enable debug logging: BANGUI_LOG_LEVEL=debug
  2. Collect logs around the failure time
  3. Check Docs/Deployment.md for configuration guidance
  4. Check Docs/Observability.md for monitoring setup