Graceful shutdown ensures in-flight operations complete before process exits: - Lifespan shutdown handler drains pending tasks with 25s timeout - Scheduler stops accepting new jobs immediately - HTTP session, external logging, scheduler lock, DB conn closed cleanly - 25s Python timeout leaves 5s margin before Docker's 30s SIGKILL Files changed: - backend/app/main.py: enhanced _lifespan shutdown with task drain - Docker/Dockerfile.backend: documented signal handling in header - Docker/docker-compose.yml: added stop_grace_period: 30s - Docker/compose.prod.yml: added stop_grace_period: 30s - Docs/Deployment.md: new Graceful Shutdown section with sequence table - Docs/TROUBLESHOOTING.md: new Graceful Shutdown Issues section Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
9.8 KiB
Troubleshooting Guide
Scheduler Lock Issues
Lock Held by Crashed Instance (Orphaned Lock)
Symptom: Background tasks stop running. Logs show scheduler_lock_held_by_other_instance but no other instance is running.
Diagnosis:
sqlite3 /var/lib/bangui/bangui.db "SELECT pid, hostname, heartbeat_at FROM scheduler_lock;"
If heartbeat_at is older than 5 minutes and the PID no longer exists, the lock is orphaned.
Recovery:
sqlite3 /var/lib/bangui/bangui.db "DELETE FROM scheduler_lock;"
Restart the backend. It will acquire the lock fresh.
Prevention:
- Monitor
scheduler_lock_heartbeat_lostevents in logs - If >3 occurrences per hour, investigate database I/O performance
Two Instances Both Running Scheduler
Symptom: Duplicate blocklist imports, duplicate geo cache cleanups, or duplicate history syncs.
Cause: Both instances believe they hold the lock.
Diagnosis:
- Check which instance holds the lock:
SELECT pid, hostname FROM scheduler_lock; - Compare with running processes:
ps aux | grep bangui
Solution:
- Stop one instance immediately
- Clear lock:
DELETE FROM scheduler_lock; - Restart the remaining instance
Prevention:
- Ensure only one instance starts before heartbeat begins
- Check
BANGUI_SINGLE_INSTANCE=trueis set if single-instance operation is required
Heartbeat Update Failures
Symptom: Logs show scheduler_lock_heartbeat_lost repeatedly, then lock is lost.
Cause: Database writes failing or extremely slow (>5 seconds per write).
Diagnosis:
time sqlite3 /var/lib/bangui/bangui.db "UPDATE scheduler_lock SET heartbeat_at = unixepoch();"
If this takes >1 second, database I/O is degraded.
Solution:
- Check disk health:
sqlite3 /var/lib/bangui/bangui.db "PRAGMA integrity_check;" - Move database to faster storage (SSD)
- Check for other I/O bottlenecks on the host
Lock Not Acquired at Startup
Symptom: Instance fails to start with error "Could not acquire scheduler lock".
Cause: Another instance already holds the lock and appears healthy.
Diagnosis:
sqlite3 /var/lib/bangui/bangui.db "SELECT pid, hostname, heartbeat_at FROM scheduler_lock;"
ps aux | grep <pid>
Solution:
- If other instance is healthy and should run scheduler: this instance must wait
- If other instance is crashed:
DELETE FROM scheduler_lock;then restart this instance - If running single instance: ensure no other instances are running before startup
Rate Limiting
Getting 429 Too Many Requests
Symptom: API returns HTTP 429 with rate_limit_exceeded error code.
Cause: You have exceeded the per-IP rate limit for a specific operation.
Diagnosis:
- Check the
Retry-Afterheader in the response — this tells you how many seconds to wait - Look for the log event
*_rate_limit_exceededwhich shows the bucket and client IP
Rate limit buckets:
| Bucket | Limit | Window | Operations |
|---|---|---|---|
bans:ban |
100 | 1 minute | Ban IP addresses |
bans:unban |
100 | 1 minute | Unban IP addresses |
blocklist:import |
10 | 1 hour | Import blocklists |
config:update |
50 | 1 minute | Update configuration |
jail:update |
100 | 1 minute | Update jail config |
jail:create |
100 | 1 minute | Add log paths, assign filters/actions |
jail:delete |
100 | 1 minute | Remove log paths, actions |
jail:activate |
100 | 1 minute | Activate jails |
jail:deactivate |
100 | 1 minute | Deactivate jails |
filter:update |
50 | 1 minute | Update filters |
filter:create |
50 | 1 minute | Create filters |
filter:delete |
50 | 1 minute | Delete filters |
action:update |
50 | 1 minute | Update actions |
action:create |
50 | 1 minute | Create actions |
action:delete |
50 | 1 minute | Delete actions |
Solution:
- Wait for the
Retry-Afterperiod before retrying - If you hit the limit during legitimate bulk operations, consider batching requests
- For blocklist imports (10/hour), ensure automated imports are not more frequent
Prevention:
- Monitor
*_rate_limit_exceededlog events - Adjust limits via environment variables if needed (see
Docs/CONFIGURATION.md) - For bulk operations, implement client-side throttling
Note: If rate limiting triggers unexpectedly for legitimate use, check for:
- Internal monitoring scripts hitting endpoints too frequently
- Multiple users behind the same proxy IP
- Stale rate limit state after process restart (uses in-memory tracking)
Database Migration Failures
Application Won't Start After Upgrade
Symptom: Application fails to start. Logs show migration errors.
Cause: Migration failed mid-transaction. Database left in inconsistent state.
Diagnosis:
# Check current schema version
sqlite3 /var/lib/bangui/bangui.db "SELECT MAX(version) FROM schema_migrations;"
# List all tables
sqlite3 /var/lib/bangui/bangui.db "SELECT name FROM sqlite_master WHERE type='table';"
# Check logs for specific error
grep -i migration /var/log/bangui.log
Solution:
- If migration was auto-rolled back: Startup will retry the same migration. Run application again.
- If migration keeps failing: Check if table already exists:
If it exists, manually insert the migration record:
sqlite3 /var/lib/bangui/bangui.db "SELECT name FROM sqlite_master WHERE type='table' AND name='<table>';"sqlite3 /var/lib/bangui/bangui.db "INSERT INTO schema_migrations (version) VALUES (?);" - Full database reset (development only):
rm /var/lib/bangui/bangui.db /var/lib/bangui/bangui.db-wal /var/lib/bangui/bangui.db-shm
Prevention:
- Always backup before upgrades:
cp bangui.db bangui.db.backup - Never manually modify database schema
- Monitor
migrating_database_schemalog events during upgrades
Schema Version Mismatch
Symptom: Error: "database schema version X is newer than supported version Y"
Cause: Downgraded to older BanGUI version that doesn't support current schema.
Solution: Upgrade to a version compatible with the current schema, or restore from backup.
502 Bad Gateway Errors
Symptom: Nginx returns 502 Bad Gateway
Cause: The backend container is unreachable — either down, restarting, or not yet healthy.
Diagnosis:
# Check backend container status
docker ps -a | grep bangui-backend
# Check if backend is responding directly (on the container network)
docker exec bangui-frontend curl -f http://bangui-backend:8000/api/v1/health
# Check backend logs
docker logs bangui-backend --tail 50
Common causes and solutions:
| Cause | Diagnosis | Solution |
|---|---|---|
| Backend restarting | docker ps shows backend repeatedly restarting |
Check health check timing; may need longer start_period |
| Health check failing | Backend log shows socket errors | Verify fail2ban container is healthy before backend starts |
| Startup too slow | start_period: 40s not enough on slow hosts |
Increase start_period in compose file |
| Port misconfiguration | expose vs ports mismatch |
Ensure backend exposes 8000 and frontend proxies to it |
Prevention:
- The
depends_on: condition: service_healthyensures the backend is fully started before the frontend proxies requests. - The health check returns 503 when fail2ban is offline, triggering container restart automatically.
- Health check parameters are tuned for typical startup time — adjust
start_periodif the host is slow or resource-constrained.
Graceful Shutdown Issues
Container Killed Before Tasks Complete
Symptom: Logs show pending_tasks_timeout and tasks are cancelled mid-execution.
Cause: Docker's stop_grace_period is too short, or tasks take longer than the 25s graceful timeout.
Diagnosis:
# Check if container was killed by SIGKILL
docker inspect bangui-backend --format '{{.State.ExitCode}}'
# Exit code 137 = SIGKILL
Solution:
- Increase
stop_grace_periodindocker-compose.yml:backend: stop_grace_period: 60s - The Python graceful timeout is 25s (leaving margin before Docker kill)
- If tasks still timeout, check task code — long-running tasks should handle cancellation gracefully
Scheduler Lock Not Released
Symptom: After container restart, logs show Could not acquire scheduler lock.
Cause: Previous instance shut down without releasing the lock, or lock TTL hasn't expired.
Diagnosis:
sqlite3 /var/lib/bangui/bangui.db "SELECT * FROM scheduler_lock;"
Solution:
# Clear stale lock
sqlite3 /var/lib/bangui/bangui.db "DELETE FROM scheduler_lock;"
# Restart container
Prevention:
- Graceful shutdown releases lock immediately (not waiting for TTL expiry)
- Monitor logs for
scheduler_lock_releasedon clean shutdown
In-Flight Requests Dropped
Symptom: Client connections closed abruptly during shutdown.
Cause: Too short a graceful timeout, or clients not configured to retry.
Solution:
- Ensure clients implement proper retry logic with backoff
- For critical operations, use background tasks with status polling
- Increase graceful timeout if network latency is high
General Recovery Commands
Clear all locks:
sqlite3 /var/lib/bangui/bangui.db "DELETE FROM scheduler_lock;"
Check lock status:
sqlite3 /var/lib/bangui/bangui.db "SELECT * FROM scheduler_lock;"
Verify database integrity:
sqlite3 /var/lib/bangui/bangui.db "PRAGMA integrity_check;"
Getting Help
If issues persist after following this guide:
- Enable debug logging:
BANGUI_LOG_LEVEL=debug - Collect logs around the failure time
- Check
Docs/Deployment.mdfor configuration guidance - Check
Docs/Observability.mdfor monitoring setup