Files
BanGUI/Docs/TROUBLESHOOTING.md
Lukas b631c1c546 feat(backend): implement graceful shutdown for container stop
Graceful shutdown ensures in-flight operations complete before process exits:
- Lifespan shutdown handler drains pending tasks with 25s timeout
- Scheduler stops accepting new jobs immediately
- HTTP session, external logging, scheduler lock, DB conn closed cleanly
- 25s Python timeout leaves 5s margin before Docker's 30s SIGKILL

Files changed:
- backend/app/main.py: enhanced _lifespan shutdown with task drain
- Docker/Dockerfile.backend: documented signal handling in header
- Docker/docker-compose.yml: added stop_grace_period: 30s
- Docker/compose.prod.yml: added stop_grace_period: 30s
- Docs/Deployment.md: new Graceful Shutdown section with sequence table
- Docs/TROUBLESHOOTING.md: new Graceful Shutdown Issues section

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-05-02 22:47:10 +02:00

312 lines
9.8 KiB
Markdown

# Troubleshooting Guide
## Scheduler Lock Issues
### Lock Held by Crashed Instance (Orphaned Lock)
**Symptom:** Background tasks stop running. Logs show `scheduler_lock_held_by_other_instance` but no other instance is running.
**Diagnosis:**
```bash
sqlite3 /var/lib/bangui/bangui.db "SELECT pid, hostname, heartbeat_at FROM scheduler_lock;"
```
If `heartbeat_at` is older than 5 minutes and the PID no longer exists, the lock is orphaned.
**Recovery:**
```bash
sqlite3 /var/lib/bangui/bangui.db "DELETE FROM scheduler_lock;"
```
Restart the backend. It will acquire the lock fresh.
**Prevention:**
- Monitor `scheduler_lock_heartbeat_lost` events in logs
- If >3 occurrences per hour, investigate database I/O performance
---
### Two Instances Both Running Scheduler
**Symptom:** Duplicate blocklist imports, duplicate geo cache cleanups, or duplicate history syncs.
**Cause:** Both instances believe they hold the lock.
**Diagnosis:**
1. Check which instance holds the lock: `SELECT pid, hostname FROM scheduler_lock;`
2. Compare with running processes: `ps aux | grep bangui`
**Solution:**
1. Stop one instance immediately
2. Clear lock: `DELETE FROM scheduler_lock;`
3. Restart the remaining instance
**Prevention:**
- Ensure only one instance starts before heartbeat begins
- Check `BANGUI_SINGLE_INSTANCE=true` is set if single-instance operation is required
---
### Heartbeat Update Failures
**Symptom:** Logs show `scheduler_lock_heartbeat_lost` repeatedly, then lock is lost.
**Cause:** Database writes failing or extremely slow (>5 seconds per write).
**Diagnosis:**
```bash
time sqlite3 /var/lib/bangui/bangui.db "UPDATE scheduler_lock SET heartbeat_at = unixepoch();"
```
If this takes >1 second, database I/O is degraded.
**Solution:**
1. Check disk health: `sqlite3 /var/lib/bangui/bangui.db "PRAGMA integrity_check;"`
2. Move database to faster storage (SSD)
3. Check for other I/O bottlenecks on the host
---
### Lock Not Acquired at Startup
**Symptom:** Instance fails to start with error "Could not acquire scheduler lock".
**Cause:** Another instance already holds the lock and appears healthy.
**Diagnosis:**
```bash
sqlite3 /var/lib/bangui/bangui.db "SELECT pid, hostname, heartbeat_at FROM scheduler_lock;"
ps aux | grep <pid>
```
**Solution:**
- If other instance is healthy and should run scheduler: this instance must wait
- If other instance is crashed: `DELETE FROM scheduler_lock;` then restart this instance
- If running single instance: ensure no other instances are running before startup
---
## Rate Limiting
### Getting 429 Too Many Requests
**Symptom:** API returns HTTP 429 with `rate_limit_exceeded` error code.
**Cause:** You have exceeded the per-IP rate limit for a specific operation.
**Diagnosis:**
1. Check the `Retry-After` header in the response — this tells you how many seconds to wait
2. Look for the log event `*_rate_limit_exceeded` which shows the bucket and client IP
**Rate limit buckets:**
| Bucket | Limit | Window | Operations |
|--------|-------|--------|------------|
| `bans:ban` | 100 | 1 minute | Ban IP addresses |
| `bans:unban` | 100 | 1 minute | Unban IP addresses |
| `blocklist:import` | 10 | 1 hour | Import blocklists |
| `config:update` | 50 | 1 minute | Update configuration |
| `jail:update` | 100 | 1 minute | Update jail config |
| `jail:create` | 100 | 1 minute | Add log paths, assign filters/actions |
| `jail:delete` | 100 | 1 minute | Remove log paths, actions |
| `jail:activate` | 100 | 1 minute | Activate jails |
| `jail:deactivate` | 100 | 1 minute | Deactivate jails |
| `filter:update` | 50 | 1 minute | Update filters |
| `filter:create` | 50 | 1 minute | Create filters |
| `filter:delete` | 50 | 1 minute | Delete filters |
| `action:update` | 50 | 1 minute | Update actions |
| `action:create` | 50 | 1 minute | Create actions |
| `action:delete` | 50 | 1 minute | Delete actions |
**Solution:**
1. Wait for the `Retry-After` period before retrying
2. If you hit the limit during legitimate bulk operations, consider batching requests
3. For blocklist imports (10/hour), ensure automated imports are not more frequent
**Prevention:**
- Monitor `*_rate_limit_exceeded` log events
- Adjust limits via environment variables if needed (see `Docs/CONFIGURATION.md`)
- For bulk operations, implement client-side throttling
**Note:** If rate limiting triggers unexpectedly for legitimate use, check for:
- Internal monitoring scripts hitting endpoints too frequently
- Multiple users behind the same proxy IP
- Stale rate limit state after process restart (uses in-memory tracking)
---
## Database Migration Failures
### Application Won't Start After Upgrade
**Symptom:** Application fails to start. Logs show migration errors.
**Cause:** Migration failed mid-transaction. Database left in inconsistent state.
**Diagnosis:**
```bash
# Check current schema version
sqlite3 /var/lib/bangui/bangui.db "SELECT MAX(version) FROM schema_migrations;"
# List all tables
sqlite3 /var/lib/bangui/bangui.db "SELECT name FROM sqlite_master WHERE type='table';"
# Check logs for specific error
grep -i migration /var/log/bangui.log
```
**Solution:**
1. **If migration was auto-rolled back**: Startup will retry the same migration. Run application again.
2. **If migration keeps failing**: Check if table already exists:
```bash
sqlite3 /var/lib/bangui/bangui.db "SELECT name FROM sqlite_master WHERE type='table' AND name='<table>';"
```
If it exists, manually insert the migration record:
```bash
sqlite3 /var/lib/bangui/bangui.db "INSERT INTO schema_migrations (version) VALUES (?);"
```
3. **Full database reset** (development only):
```bash
rm /var/lib/bangui/bangui.db /var/lib/bangui/bangui.db-wal /var/lib/bangui/bangui.db-shm
```
**Prevention:**
- Always backup before upgrades: `cp bangui.db bangui.db.backup`
- Never manually modify database schema
- Monitor `migrating_database_schema` log events during upgrades
---
### Schema Version Mismatch
**Symptom:** Error: "database schema version X is newer than supported version Y"
**Cause:** Downgraded to older BanGUI version that doesn't support current schema.
**Solution:** Upgrade to a version compatible with the current schema, or restore from backup.
---
## 502 Bad Gateway Errors
### Symptom: Nginx returns 502 Bad Gateway
**Cause:** The backend container is unreachable — either down, restarting, or not yet healthy.
**Diagnosis:**
```bash
# Check backend container status
docker ps -a | grep bangui-backend
# Check if backend is responding directly (on the container network)
docker exec bangui-frontend curl -f http://bangui-backend:8000/api/v1/health
# Check backend logs
docker logs bangui-backend --tail 50
```
**Common causes and solutions:**
| Cause | Diagnosis | Solution |
|---|---|---|
| Backend restarting | `docker ps` shows backend repeatedly restarting | Check health check timing; may need longer `start_period` |
| Health check failing | Backend log shows socket errors | Verify fail2ban container is healthy before backend starts |
| Startup too slow | `start_period: 40s` not enough on slow hosts | Increase `start_period` in compose file |
| Port misconfiguration | `expose` vs `ports` mismatch | Ensure backend exposes 8000 and frontend proxies to it |
**Prevention:**
- The `depends_on: condition: service_healthy` ensures the backend is fully started before the frontend proxies requests.
- The health check returns 503 when fail2ban is offline, triggering container restart automatically.
- Health check parameters are tuned for typical startup time — adjust `start_period` if the host is slow or resource-constrained.
---
## Graceful Shutdown Issues
### Container Killed Before Tasks Complete
**Symptom:** Logs show `pending_tasks_timeout` and tasks are cancelled mid-execution.
**Cause:** Docker's `stop_grace_period` is too short, or tasks take longer than the 25s graceful timeout.
**Diagnosis:**
```bash
# Check if container was killed by SIGKILL
docker inspect bangui-backend --format '{{.State.ExitCode}}'
# Exit code 137 = SIGKILL
```
**Solution:**
1. Increase `stop_grace_period` in `docker-compose.yml`:
```yaml
backend:
stop_grace_period: 60s
```
2. The Python graceful timeout is 25s (leaving margin before Docker kill)
3. If tasks still timeout, check task code — long-running tasks should handle cancellation gracefully
### Scheduler Lock Not Released
**Symptom:** After container restart, logs show `Could not acquire scheduler lock`.
**Cause:** Previous instance shut down without releasing the lock, or lock TTL hasn't expired.
**Diagnosis:**
```bash
sqlite3 /var/lib/bangui/bangui.db "SELECT * FROM scheduler_lock;"
```
**Solution:**
```bash
# Clear stale lock
sqlite3 /var/lib/bangui/bangui.db "DELETE FROM scheduler_lock;"
# Restart container
```
**Prevention:**
- Graceful shutdown releases lock immediately (not waiting for TTL expiry)
- Monitor logs for `scheduler_lock_released` on clean shutdown
### In-Flight Requests Dropped
**Symptom:** Client connections closed abruptly during shutdown.
**Cause:** Too short a graceful timeout, or clients not configured to retry.
**Solution:**
1. Ensure clients implement proper retry logic with backoff
2. For critical operations, use background tasks with status polling
3. Increase graceful timeout if network latency is high
---
## General Recovery Commands
Clear all locks:
```bash
sqlite3 /var/lib/bangui/bangui.db "DELETE FROM scheduler_lock;"
```
Check lock status:
```bash
sqlite3 /var/lib/bangui/bangui.db "SELECT * FROM scheduler_lock;"
```
Verify database integrity:
```bash
sqlite3 /var/lib/bangui/bangui.db "PRAGMA integrity_check;"
```
---
## Getting Help
If issues persist after following this guide:
1. Enable debug logging: `BANGUI_LOG_LEVEL=debug`
2. Collect logs around the failure time
3. Check `Docs/Deployment.md` for configuration guidance
4. Check `Docs/Observability.md` for monitoring setup