Files

Lukas b631c1c546 feat(backend): implement graceful shutdown for container stop

Graceful shutdown ensures in-flight operations complete before process exits:
- Lifespan shutdown handler drains pending tasks with 25s timeout
- Scheduler stops accepting new jobs immediately
- HTTP session, external logging, scheduler lock, DB conn closed cleanly
- 25s Python timeout leaves 5s margin before Docker's 30s SIGKILL

Files changed:
- backend/app/main.py: enhanced _lifespan shutdown with task drain
- Docker/Dockerfile.backend: documented signal handling in header
- Docker/docker-compose.yml: added stop_grace_period: 30s
- Docker/compose.prod.yml: added stop_grace_period: 30s
- Docs/Deployment.md: new Graceful Shutdown section with sequence table
- Docs/TROUBLESHOOTING.md: new Graceful Shutdown Issues section

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

2026-05-02 22:47:10 +02:00

9.8 KiB

Raw Blame History

Troubleshooting Guide

Scheduler Lock Issues

Lock Held by Crashed Instance (Orphaned Lock)

Symptom: Background tasks stop running. Logs show scheduler_lock_held_by_other_instance but no other instance is running.

Diagnosis:

sqlite3 /var/lib/bangui/bangui.db "SELECT pid, hostname, heartbeat_at FROM scheduler_lock;"

If heartbeat_at is older than 5 minutes and the PID no longer exists, the lock is orphaned.

Recovery:

sqlite3 /var/lib/bangui/bangui.db "DELETE FROM scheduler_lock;"

Restart the backend. It will acquire the lock fresh.

Prevention:

Monitor scheduler_lock_heartbeat_lost events in logs
If >3 occurrences per hour, investigate database I/O performance

Two Instances Both Running Scheduler

Symptom: Duplicate blocklist imports, duplicate geo cache cleanups, or duplicate history syncs.

Cause: Both instances believe they hold the lock.

Diagnosis:

Check which instance holds the lock: SELECT pid, hostname FROM scheduler_lock;
Compare with running processes: ps aux | grep bangui

Solution:

Stop one instance immediately
Clear lock: DELETE FROM scheduler_lock;
Restart the remaining instance

Prevention:

Ensure only one instance starts before heartbeat begins
Check BANGUI_SINGLE_INSTANCE=true is set if single-instance operation is required

Heartbeat Update Failures

Symptom: Logs show scheduler_lock_heartbeat_lost repeatedly, then lock is lost.

Cause: Database writes failing or extremely slow (>5 seconds per write).

Diagnosis:

time sqlite3 /var/lib/bangui/bangui.db "UPDATE scheduler_lock SET heartbeat_at = unixepoch();"

If this takes >1 second, database I/O is degraded.

Solution:

Check disk health: sqlite3 /var/lib/bangui/bangui.db "PRAGMA integrity_check;"
Move database to faster storage (SSD)
Check for other I/O bottlenecks on the host

Lock Not Acquired at Startup

Symptom: Instance fails to start with error "Could not acquire scheduler lock".

Cause: Another instance already holds the lock and appears healthy.

Diagnosis:

sqlite3 /var/lib/bangui/bangui.db "SELECT pid, hostname, heartbeat_at FROM scheduler_lock;"
ps aux | grep <pid>

Solution:

If other instance is healthy and should run scheduler: this instance must wait
If other instance is crashed: DELETE FROM scheduler_lock; then restart this instance
If running single instance: ensure no other instances are running before startup

Rate Limiting

Getting 429 Too Many Requests

Symptom: API returns HTTP 429 with rate_limit_exceeded error code.

Cause: You have exceeded the per-IP rate limit for a specific operation.

Diagnosis:

Check the Retry-After header in the response — this tells you how many seconds to wait
Look for the log event *_rate_limit_exceeded which shows the bucket and client IP

Rate limit buckets:

Bucket	Limit	Window	Operations
`bans:ban`	100	1 minute	Ban IP addresses
`bans:unban`	100	1 minute	Unban IP addresses
`blocklist:import`	10	1 hour	Import blocklists
`config:update`	50	1 minute	Update configuration
`jail:update`	100	1 minute	Update jail config
`jail:create`	100	1 minute	Add log paths, assign filters/actions
`jail:delete`	100	1 minute	Remove log paths, actions
`jail:activate`	100	1 minute	Activate jails
`jail:deactivate`	100	1 minute	Deactivate jails
`filter:update`	50	1 minute	Update filters
`filter:create`	50	1 minute	Create filters
`filter:delete`	50	1 minute	Delete filters
`action:update`	50	1 minute	Update actions
`action:create`	50	1 minute	Create actions
`action:delete`	50	1 minute	Delete actions

Solution:

Wait for the Retry-After period before retrying
If you hit the limit during legitimate bulk operations, consider batching requests
For blocklist imports (10/hour), ensure automated imports are not more frequent

Prevention:

Monitor *_rate_limit_exceeded log events
Adjust limits via environment variables if needed (see Docs/CONFIGURATION.md)
For bulk operations, implement client-side throttling

Note: If rate limiting triggers unexpectedly for legitimate use, check for:

Internal monitoring scripts hitting endpoints too frequently
Multiple users behind the same proxy IP
Stale rate limit state after process restart (uses in-memory tracking)

Database Migration Failures

Application Won't Start After Upgrade

Symptom: Application fails to start. Logs show migration errors.

Cause: Migration failed mid-transaction. Database left in inconsistent state.

Diagnosis:

# Check current schema version
sqlite3 /var/lib/bangui/bangui.db "SELECT MAX(version) FROM schema_migrations;"

# List all tables
sqlite3 /var/lib/bangui/bangui.db "SELECT name FROM sqlite_master WHERE type='table';"

# Check logs for specific error
grep -i migration /var/log/bangui.log

Solution:

If migration was auto-rolled back: Startup will retry the same migration. Run application again.

If migration keeps failing: Check if table already exists:

sqlite3 /var/lib/bangui/bangui.db "SELECT name FROM sqlite_master WHERE type='table' AND name='<table>';"

If it exists, manually insert the migration record:

sqlite3 /var/lib/bangui/bangui.db "INSERT INTO schema_migrations (version) VALUES (?);"

Full database reset (development only):

rm /var/lib/bangui/bangui.db /var/lib/bangui/bangui.db-wal /var/lib/bangui/bangui.db-shm

Prevention:

Always backup before upgrades: cp bangui.db bangui.db.backup
Never manually modify database schema
Monitor migrating_database_schema log events during upgrades

Schema Version Mismatch

Symptom: Error: "database schema version X is newer than supported version Y"

Cause: Downgraded to older BanGUI version that doesn't support current schema.

Solution: Upgrade to a version compatible with the current schema, or restore from backup.

502 Bad Gateway Errors

Symptom: Nginx returns 502 Bad Gateway

Cause: The backend container is unreachable — either down, restarting, or not yet healthy.

Diagnosis:

# Check backend container status
docker ps -a | grep bangui-backend

# Check if backend is responding directly (on the container network)
docker exec bangui-frontend curl -f http://bangui-backend:8000/api/v1/health

# Check backend logs
docker logs bangui-backend --tail 50

Common causes and solutions:

Cause	Diagnosis	Solution
Backend restarting	`docker ps` shows backend repeatedly restarting	Check health check timing; may need longer `start_period`
Health check failing	Backend log shows socket errors	Verify fail2ban container is healthy before backend starts
Startup too slow	`start_period: 40s` not enough on slow hosts	Increase `start_period` in compose file
Port misconfiguration	`expose` vs `ports` mismatch	Ensure backend exposes 8000 and frontend proxies to it

Prevention:

The depends_on: condition: service_healthy ensures the backend is fully started before the frontend proxies requests.
The health check returns 503 when fail2ban is offline, triggering container restart automatically.
Health check parameters are tuned for typical startup time — adjust start_period if the host is slow or resource-constrained.

Graceful Shutdown Issues

Container Killed Before Tasks Complete

Symptom: Logs show pending_tasks_timeout and tasks are cancelled mid-execution.

Cause: Docker's stop_grace_period is too short, or tasks take longer than the 25s graceful timeout.

Diagnosis:

# Check if container was killed by SIGKILL
docker inspect bangui-backend --format '{{.State.ExitCode}}'
# Exit code 137 = SIGKILL

Solution:

Increase stop_grace_period in docker-compose.yml:
```
backend:
  stop_grace_period: 60s
```
The Python graceful timeout is 25s (leaving margin before Docker kill)
If tasks still timeout, check task code — long-running tasks should handle cancellation gracefully

Scheduler Lock Not Released

Symptom: After container restart, logs show Could not acquire scheduler lock.

Cause: Previous instance shut down without releasing the lock, or lock TTL hasn't expired.

Diagnosis:

sqlite3 /var/lib/bangui/bangui.db "SELECT * FROM scheduler_lock;"

Solution:

# Clear stale lock
sqlite3 /var/lib/bangui/bangui.db "DELETE FROM scheduler_lock;"
# Restart container

Prevention:

Graceful shutdown releases lock immediately (not waiting for TTL expiry)
Monitor logs for scheduler_lock_released on clean shutdown

In-Flight Requests Dropped

Symptom: Client connections closed abruptly during shutdown.

Cause: Too short a graceful timeout, or clients not configured to retry.

Solution:

Ensure clients implement proper retry logic with backoff
For critical operations, use background tasks with status polling
Increase graceful timeout if network latency is high

General Recovery Commands

Clear all locks:

sqlite3 /var/lib/bangui/bangui.db "DELETE FROM scheduler_lock;"

Check lock status:

sqlite3 /var/lib/bangui/bangui.db "SELECT * FROM scheduler_lock;"

Verify database integrity:

sqlite3 /var/lib/bangui/bangui.db "PRAGMA integrity_check;"

Getting Help

If issues persist after following this guide:

Enable debug logging: BANGUI_LOG_LEVEL=debug
Collect logs around the failure time
Check Docs/Deployment.md for configuration guidance
Check Docs/Observability.md for monitoring setup

9.8 KiB Raw Blame History

Troubleshooting Guide

Scheduler Lock Issues

Lock Held by Crashed Instance (Orphaned Lock)

Two Instances Both Running Scheduler

Heartbeat Update Failures

Lock Not Acquired at Startup

Rate Limiting

Getting 429 Too Many Requests

Database Migration Failures

Application Won't Start After Upgrade

Schema Version Mismatch

502 Bad Gateway Errors

Symptom: Nginx returns 502 Bad Gateway

Graceful Shutdown Issues

Container Killed Before Tasks Complete

Scheduler Lock Not Released

In-Flight Requests Dropped

General Recovery Commands

Getting Help

9.8 KiB

Raw Blame History