# Troubleshooting Guide ## Scheduler Lock Issues ### Lock Held by Crashed Instance (Orphaned Lock) **Symptom:** Background tasks stop running. Logs show `scheduler_lock_held_by_other_instance` but no other instance is running. **Diagnosis:** ```bash sqlite3 /var/lib/bangui/bangui.db "SELECT pid, hostname, heartbeat_at FROM scheduler_lock;" ``` If `heartbeat_at` is older than 5 minutes and the PID no longer exists, the lock is orphaned. **Recovery:** ```bash sqlite3 /var/lib/bangui/bangui.db "DELETE FROM scheduler_lock;" ``` Restart the backend. It will acquire the lock fresh. **Prevention:** - Monitor `scheduler_lock_heartbeat_lost` events in logs - If >3 occurrences per hour, investigate database I/O performance --- ### Two Instances Both Running Scheduler **Symptom:** Duplicate blocklist imports, duplicate geo cache cleanups, or duplicate history syncs. **Cause:** Both instances believe they hold the lock. **Diagnosis:** 1. Check which instance holds the lock: `SELECT pid, hostname FROM scheduler_lock;` 2. Compare with running processes: `ps aux | grep bangui` **Solution:** 1. Stop one instance immediately 2. Clear lock: `DELETE FROM scheduler_lock;` 3. Restart the remaining instance **Prevention:** - Ensure only one instance starts before heartbeat begins - Check `BANGUI_SINGLE_INSTANCE=true` is set if single-instance operation is required --- ### Heartbeat Update Failures **Symptom:** Logs show `scheduler_lock_heartbeat_lost` repeatedly, then lock is lost. **Cause:** Database writes failing or extremely slow (>5 seconds per write). **Diagnosis:** ```bash time sqlite3 /var/lib/bangui/bangui.db "UPDATE scheduler_lock SET heartbeat_at = unixepoch();" ``` If this takes >1 second, database I/O is degraded. **Solution:** 1. Check disk health: `sqlite3 /var/lib/bangui/bangui.db "PRAGMA integrity_check;"` 2. Move database to faster storage (SSD) 3. Check for other I/O bottlenecks on the host --- ### Lock Not Acquired at Startup **Symptom:** Instance fails to start with error "Could not acquire scheduler lock". **Cause:** Another instance already holds the lock and appears healthy. **Diagnosis:** ```bash sqlite3 /var/lib/bangui/bangui.db "SELECT pid, hostname, heartbeat_at FROM scheduler_lock;" ps aux | grep ``` **Solution:** - If other instance is healthy and should run scheduler: this instance must wait - If other instance is crashed: `DELETE FROM scheduler_lock;` then restart this instance - If running single instance: ensure no other instances are running before startup --- ## Rate Limiting ### Getting 429 Too Many Requests **Symptom:** API returns HTTP 429 with `rate_limit_exceeded` error code. **Cause:** You have exceeded the per-IP rate limit for a specific operation. **Diagnosis:** 1. Check the `Retry-After` header in the response — this tells you how many seconds to wait 2. Look for the log event `*_rate_limit_exceeded` which shows the bucket and client IP **Rate limit buckets:** | Bucket | Limit | Window | Operations | |--------|-------|--------|------------| | `bans:ban` | 100 | 1 minute | Ban IP addresses | | `bans:unban` | 100 | 1 minute | Unban IP addresses | | `blocklist:import` | 10 | 1 hour | Import blocklists | | `config:update` | 50 | 1 minute | Update configuration | | `jail:update` | 100 | 1 minute | Update jail config | | `jail:create` | 100 | 1 minute | Add log paths, assign filters/actions | | `jail:delete` | 100 | 1 minute | Remove log paths, actions | | `jail:activate` | 100 | 1 minute | Activate jails | | `jail:deactivate` | 100 | 1 minute | Deactivate jails | | `filter:update` | 50 | 1 minute | Update filters | | `filter:create` | 50 | 1 minute | Create filters | | `filter:delete` | 50 | 1 minute | Delete filters | | `action:update` | 50 | 1 minute | Update actions | | `action:create` | 50 | 1 minute | Create actions | | `action:delete` | 50 | 1 minute | Delete actions | **Solution:** 1. Wait for the `Retry-After` period before retrying 2. If you hit the limit during legitimate bulk operations, consider batching requests 3. For blocklist imports (10/hour), ensure automated imports are not more frequent **Prevention:** - Monitor `*_rate_limit_exceeded` log events - Adjust limits via environment variables if needed (see `Docs/CONFIGURATION.md`) - For bulk operations, implement client-side throttling **Note:** If rate limiting triggers unexpectedly for legitimate use, check for: - Internal monitoring scripts hitting endpoints too frequently - Multiple users behind the same proxy IP - Stale rate limit state after process restart (uses in-memory tracking) --- ## Database Migration Failures ### Application Won't Start After Upgrade **Symptom:** Application fails to start. Logs show migration errors. **Cause:** Migration failed mid-transaction. Database left in inconsistent state. **Diagnosis:** ```bash # Check current schema version sqlite3 /var/lib/bangui/bangui.db "SELECT MAX(version) FROM schema_migrations;" # List all tables sqlite3 /var/lib/bangui/bangui.db "SELECT name FROM sqlite_master WHERE type='table';" # Check logs for specific error grep -i migration /var/log/bangui.log ``` **Solution:** 1. **If migration was auto-rolled back**: Startup will retry the same migration. Run application again. 2. **If migration keeps failing**: Check if table already exists: ```bash sqlite3 /var/lib/bangui/bangui.db "SELECT name FROM sqlite_master WHERE type='table' AND name='';" ``` If it exists, manually insert the migration record: ```bash sqlite3 /var/lib/bangui/bangui.db "INSERT INTO schema_migrations (version) VALUES (?);" ``` 3. **Full database reset** (development only): ```bash rm /var/lib/bangui/bangui.db /var/lib/bangui/bangui.db-wal /var/lib/bangui/bangui.db-shm ``` **Prevention:** - Always backup before upgrades: `cp bangui.db bangui.db.backup` - Never manually modify database schema - Monitor `migrating_database_schema` log events during upgrades --- ### Schema Version Mismatch **Symptom:** Error: "database schema version X is newer than supported version Y" **Cause:** Downgraded to older BanGUI version that doesn't support current schema. **Solution:** Upgrade to a version compatible with the current schema, or restore from backup. --- ## 502 Bad Gateway Errors ### Symptom: Nginx returns 502 Bad Gateway **Cause:** The backend container is unreachable — either down, restarting, or not yet healthy. **Diagnosis:** ```bash # Check backend container status docker ps -a | grep bangui-backend # Check if backend is responding directly (on the container network) docker exec bangui-frontend curl -f http://bangui-backend:8000/api/v1/health # Check backend logs docker logs bangui-backend --tail 50 ``` **Common causes and solutions:** | Cause | Diagnosis | Solution | |---|---|---| | Backend restarting | `docker ps` shows backend repeatedly restarting | Check health check timing; may need longer `start_period` | | Health check failing | Backend log shows socket errors | Verify fail2ban container is healthy before backend starts | | Startup too slow | `start_period: 40s` not enough on slow hosts | Increase `start_period` in compose file | | Port misconfiguration | `expose` vs `ports` mismatch | Ensure backend exposes 8000 and frontend proxies to it | **Prevention:** - The `depends_on: condition: service_healthy` ensures the backend is fully started before the frontend proxies requests. - The health check returns 503 when fail2ban is offline, triggering container restart automatically. - Health check parameters are tuned for typical startup time — adjust `start_period` if the host is slow or resource-constrained. --- ## Graceful Shutdown Issues ### Container Killed Before Tasks Complete **Symptom:** Logs show `pending_tasks_timeout` and tasks are cancelled mid-execution. **Cause:** Docker's `stop_grace_period` is too short, or tasks take longer than the 25s graceful timeout. **Diagnosis:** ```bash # Check if container was killed by SIGKILL docker inspect bangui-backend --format '{{.State.ExitCode}}' # Exit code 137 = SIGKILL ``` **Solution:** 1. Increase `stop_grace_period` in `docker-compose.yml`: ```yaml backend: stop_grace_period: 60s ``` 2. The Python graceful timeout is 25s (leaving margin before Docker kill) 3. If tasks still timeout, check task code — long-running tasks should handle cancellation gracefully ### Scheduler Lock Not Released **Symptom:** After container restart, logs show `Could not acquire scheduler lock`. **Cause:** Previous instance shut down without releasing the lock, or lock TTL hasn't expired. **Diagnosis:** ```bash sqlite3 /var/lib/bangui/bangui.db "SELECT * FROM scheduler_lock;" ``` **Solution:** ```bash # Clear stale lock sqlite3 /var/lib/bangui/bangui.db "DELETE FROM scheduler_lock;" # Restart container ``` **Prevention:** - Graceful shutdown releases lock immediately (not waiting for TTL expiry) - Monitor logs for `scheduler_lock_released` on clean shutdown ### In-Flight Requests Dropped **Symptom:** Client connections closed abruptly during shutdown. **Cause:** Too short a graceful timeout, or clients not configured to retry. **Solution:** 1. Ensure clients implement proper retry logic with backoff 2. For critical operations, use background tasks with status polling 3. Increase graceful timeout if network latency is high --- ## General Recovery Commands Clear all locks: ```bash sqlite3 /var/lib/bangui/bangui.db "DELETE FROM scheduler_lock;" ``` Check lock status: ```bash sqlite3 /var/lib/bangui/bangui.db "SELECT * FROM scheduler_lock;" ``` Verify database integrity: ```bash sqlite3 /var/lib/bangui/bangui.db "PRAGMA integrity_check;" ``` --- ## Getting Help If issues persist after following this guide: 1. Enable debug logging: `BANGUI_LOG_LEVEL=debug` 2. Collect logs around the failure time 3. Check `Docs/Deployment.md` for configuration guidance 4. Check `Docs/Observability.md` for monitoring setup