- Split config_service.py into config_service.py and jail_config_service.py - Update Docs/Tasks.md, Security.md, TROUBLESHOOTING.md
11 KiB
Troubleshooting Guide
Scheduler Lock Issues
Lock Held by Crashed Instance (Orphaned Lock)
Symptom: Background tasks stop running. Logs show scheduler_lock_held_by_other_instance but no other instance is running.
Diagnosis:
sqlite3 /var/lib/bangui/bangui.db "SELECT pid, hostname, heartbeat_at FROM scheduler_lock;"
If heartbeat_at is older than 5 minutes and the PID no longer exists, the lock is orphaned.
Recovery:
sqlite3 /var/lib/bangui/bangui.db "DELETE FROM scheduler_lock;"
Restart the backend. It will acquire the lock fresh.
Prevention:
- Monitor
scheduler_lock_heartbeat_lostevents in logs - If >3 occurrences per hour, investigate database I/O performance
Two Instances Both Running Scheduler
Symptom: Duplicate blocklist imports, duplicate geo cache cleanups, or duplicate history syncs.
Cause: Both instances believe they hold the lock.
Diagnosis:
- Check which instance holds the lock:
SELECT pid, hostname FROM scheduler_lock; - Compare with running processes:
ps aux | grep bangui
Solution:
- Stop one instance immediately
- Clear lock:
DELETE FROM scheduler_lock; - Restart the remaining instance
Prevention:
- Ensure only one instance starts before heartbeat begins
- Check
BANGUI_SINGLE_INSTANCE=trueis set if single-instance operation is required
Heartbeat Update Failures
Symptom: Logs show scheduler_lock_heartbeat_lost repeatedly, then lock is lost.
Cause: Database writes failing or extremely slow (>5 seconds per write).
Diagnosis:
time sqlite3 /var/lib/bangui/bangui.db "UPDATE scheduler_lock SET heartbeat_at = unixepoch();"
If this takes >1 second, database I/O is degraded.
Solution:
- Check disk health:
sqlite3 /var/lib/bangui/bangui.db "PRAGMA integrity_check;" - Move database to faster storage (SSD)
- Check for other I/O bottlenecks on the host
Lock Not Acquired at Startup
Symptom: Instance fails to start with error "Could not acquire scheduler lock".
Cause: Another instance already holds the lock and appears healthy.
Diagnosis:
sqlite3 /var/lib/bangui/bangui.db "SELECT pid, hostname, heartbeat_at FROM scheduler_lock;"
ps aux | grep <pid>
Solution:
- If other instance is healthy and should run scheduler: this instance must wait
- If other instance is crashed:
DELETE FROM scheduler_lock;then restart this instance - If running single instance: ensure no other instances are running before startup
Rate Limiting
Getting 429 Too Many Requests
Symptom: API returns HTTP 429 with rate_limit_exceeded error code.
Cause: You have exceeded the per-IP rate limit for a specific operation.
Diagnosis:
- Check the
Retry-Afterheader in the response — this tells you how many seconds to wait - Look for the log event
*_rate_limit_exceededwhich shows the bucket and client IP
Rate limit buckets:
| Bucket | Limit | Window | Operations |
|---|---|---|---|
bans:ban |
100 | 1 minute | Ban IP addresses |
bans:unban |
100 | 1 minute | Unban IP addresses |
blocklist:import |
10 | 1 hour | Import blocklists |
config:update |
50 | 1 minute | Update configuration |
jail:update |
100 | 1 minute | Update jail config |
jail:create |
100 | 1 minute | Add log paths, assign filters/actions |
jail:delete |
100 | 1 minute | Remove log paths, actions |
jail:activate |
100 | 1 minute | Activate jails |
jail:deactivate |
100 | 1 minute | Deactivate jails |
filter:update |
50 | 1 minute | Update filters |
filter:create |
50 | 1 minute | Create filters |
filter:delete |
50 | 1 minute | Delete filters |
action:update |
50 | 1 minute | Update actions |
action:create |
50 | 1 minute | Create actions |
action:delete |
50 | 1 minute | Delete actions |
Solution:
- Wait for the
Retry-Afterperiod before retrying - If you hit the limit during legitimate bulk operations, consider batching requests
- For blocklist imports (10/hour), ensure automated imports are not more frequent
Prevention:
- Monitor
*_rate_limit_exceededlog events - Adjust limits via environment variables if needed (see
Docs/CONFIGURATION.md) - For bulk operations, implement client-side throttling
Note: If rate limiting triggers unexpectedly for legitimate use, check for:
- Internal monitoring scripts hitting endpoints too frequently
- Multiple users behind the same proxy IP
- Stale rate limit state after process restart (uses in-memory tracking)
Database Migration Failures
Application Won't Start After Upgrade
Symptom: Application fails to start. Logs show migration errors.
Cause: Migration failed mid-transaction. Database left in inconsistent state.
Diagnosis:
# Check current schema version
sqlite3 /var/lib/bangui/bangui.db "SELECT MAX(version) FROM schema_migrations;"
# List all tables
sqlite3 /var/lib/bangui/bangui.db "SELECT name FROM sqlite_master WHERE type='table';"
# Check logs for specific error
grep -i migration /var/log/bangui.log
Solution:
- If migration was auto-rolled back: Startup will retry the same migration. Run application again.
- If migration keeps failing: Check if table already exists:
If it exists, manually insert the migration record:
sqlite3 /var/lib/bangui/bangui.db "SELECT name FROM sqlite_master WHERE type='table' AND name='<table>';"sqlite3 /var/lib/bangui/bangui.db "INSERT INTO schema_migrations (version) VALUES (?);" - Full database reset (development only):
rm /var/lib/bangui/bangui.db /var/lib/bangui/bangui.db-wal /var/lib/bangui/bangui.db-shm
Prevention:
- Always backup before upgrades:
cp bangui.db bangui.db.backup - Never manually modify database schema
- Monitor
migrating_database_schemalog events during upgrades
Schema Version Mismatch
Symptom: Error: "database schema version X is newer than supported version Y"
Cause: Downgraded to older BanGUI version that doesn't support current schema.
Solution: Upgrade to a version compatible with the current schema, or restore from backup.
502 Bad Gateway Errors
Symptom: Nginx returns 502 Bad Gateway
Cause: The backend container is unreachable — either down, restarting, or not yet healthy.
Diagnosis:
# Check backend container status
docker ps -a | grep bangui-backend
# Check if backend is responding directly (on the container network)
docker exec bangui-frontend curl -f http://bangui-backend:8000/api/v1/health
# Check backend logs
docker logs bangui-backend --tail 50
Common causes and solutions:
| Cause | Diagnosis | Solution |
|---|---|---|
| Backend restarting | docker ps shows backend repeatedly restarting |
Check health check timing; may need longer start_period |
| Health check failing | Backend log shows socket errors | Verify fail2ban container is healthy before backend starts |
| Startup too slow | start_period: 40s not enough on slow hosts |
Increase start_period in compose file |
| Port misconfiguration | expose vs ports mismatch |
Ensure backend exposes 8000 and frontend proxies to it |
Prevention:
- The
depends_on: condition: service_healthyensures the backend is fully started before the frontend proxies requests. - The health check returns 503 when fail2ban is offline, triggering container restart automatically.
- Health check parameters are tuned for typical startup time — adjust
start_periodif the host is slow or resource-constrained.
Graceful Shutdown Issues
Container Killed Before Tasks Complete
Symptom: Logs show pending_tasks_timeout and tasks are cancelled mid-execution.
Cause: Docker's stop_grace_period is too short, or tasks take longer than the 25s graceful timeout.
Diagnosis:
# Check if container was killed by SIGKILL
docker inspect bangui-backend --format '{{.State.ExitCode}}'
# Exit code 137 = SIGKILL
Solution:
- Increase
stop_grace_periodindocker-compose.yml:backend: stop_grace_period: 60s - The Python graceful timeout is 25s (leaving margin before Docker kill)
- If tasks still timeout, check task code — long-running tasks should handle cancellation gracefully
Scheduler Lock Not Released
Symptom: After container restart, logs show Could not acquire scheduler lock.
Cause: Previous instance shut down without releasing the lock, or lock TTL hasn't expired.
Diagnosis:
sqlite3 /var/lib/bangui/bangui.db "SELECT * FROM scheduler_lock;"
Solution:
# Clear stale lock
sqlite3 /var/lib/bangui/bangui.db "DELETE FROM scheduler_lock;"
# Restart container
Prevention:
- Graceful shutdown releases lock immediately (not waiting for TTL expiry)
- Monitor logs for
scheduler_lock_releasedon clean shutdown
In-Flight Requests Dropped
Symptom: Client connections closed abruptly during shutdown.
Cause: Too short a graceful timeout, or clients not configured to retry.
Solution:
- Ensure clients implement proper retry logic with backoff
- For critical operations, use background tasks with status polling
- Increase graceful timeout if network latency is high
General Recovery Commands
Clear all locks:
sqlite3 /var/lib/bangui/bangui.db "DELETE FROM scheduler_lock;"
Check lock status:
sqlite3 /var/lib/bangui/bangui.db "SELECT * FROM scheduler_lock;"
Verify database integrity:
sqlite3 /var/lib/bangui/bangui.db "PRAGMA integrity_check;"
Regex Pattern Rejected
Symptom: Filter or action configuration fails with "Invalid regex" error
Cause: The regex pattern is either syntactically invalid or detected as a ReDoS (Regular Expression Denial of Service) vulnerability.
Diagnosis:
- Check the error message — it indicates whether the pattern is syntactically invalid or flagged as dangerous
- Look for log events:
regex_redos_detectedorregex_compilation_timeout
Common ReDoS patterns that are rejected:
| Pattern | Problem |
|---|---|
(a+)+b |
Nested quantifiers with overlap |
([a-z]+)*d |
Quantifier inside quantifier |
(x+)+y |
Nested plus operators |
Solution:
- Rewrite the pattern to avoid nested quantifiers on overlapping groups
- Use atomic groups or possessive quantifiers where possible:
(?>a+)+b - Simplify complex alternations
Prevention:
- Test regex patterns in isolation before deploying
- Avoid patterns with quantified groups inside other quantifiers
- Prefer explicit character classes over
.*where possible - Use regexploit to audit patterns
Getting Help
If issues persist after following this guide:
- Enable debug logging:
BANGUI_LOG_LEVEL=debug - Collect logs around the failure time
- Check
Docs/Deployment.mdfor configuration guidance - Check
Docs/Observability.mdfor monitoring setup