- Remove structlog dependency from backend/pyproject.toml - Add app.utils.logging_compat shim for keyword-arg logging API - Add app.utils.json_formatter for JSON log output with extra fields - Update all backend modules to use logging_compat.get_logger() - Update docstrings in log_sanitizer.py and json_formatter.py - Update test comment in test_async_utils.py - Record 406 failing tests in Docs/Tasks.md for tracking
488 lines
15 KiB
Markdown
488 lines
15 KiB
Markdown
# Troubleshooting Guide
|
|
|
|
## Scheduler Lock Issues
|
|
|
|
### Lock Held by Crashed Instance (Orphaned Lock)
|
|
|
|
**Symptom:** Background tasks stop running. Logs show `scheduler_lock_held_by_other_instance` but no other instance is running.
|
|
|
|
**Diagnosis:**
|
|
```bash
|
|
sqlite3 /var/lib/bangui/bangui.db "SELECT pid, hostname, heartbeat_at FROM scheduler_lock;"
|
|
```
|
|
|
|
If `heartbeat_at` is older than 5 minutes and the PID no longer exists, the lock is orphaned.
|
|
|
|
**Recovery:**
|
|
```bash
|
|
sqlite3 /var/lib/bangui/bangui.db "DELETE FROM scheduler_lock;"
|
|
```
|
|
|
|
Restart the backend. It will acquire the lock fresh.
|
|
|
|
**Prevention:**
|
|
- Monitor `scheduler_lock_heartbeat_lost` events in logs
|
|
- If >3 occurrences per hour, investigate database I/O performance
|
|
|
|
---
|
|
|
|
### Two Instances Both Running Scheduler
|
|
|
|
**Symptom:** Duplicate blocklist imports, duplicate geo cache cleanups, or duplicate history syncs.
|
|
|
|
**Cause:** Both instances believe they hold the lock.
|
|
|
|
**Diagnosis:**
|
|
1. Check which instance holds the lock: `SELECT pid, hostname FROM scheduler_lock;`
|
|
2. Compare with running processes: `ps aux | grep bangui`
|
|
|
|
**Solution:**
|
|
1. Stop one instance immediately
|
|
2. Clear lock: `DELETE FROM scheduler_lock;`
|
|
3. Restart the remaining instance
|
|
|
|
**Prevention:**
|
|
- Ensure only one instance starts before heartbeat begins
|
|
- Check `BANGUI_SINGLE_INSTANCE=true` is set if single-instance operation is required
|
|
|
|
---
|
|
|
|
### Heartbeat Update Failures
|
|
|
|
**Symptom:** Logs show `scheduler_lock_heartbeat_lost` repeatedly, then lock is lost.
|
|
|
|
**Cause:** Database writes failing or extremely slow (>5 seconds per write).
|
|
|
|
**Diagnosis:**
|
|
```bash
|
|
time sqlite3 /var/lib/bangui/bangui.db "UPDATE scheduler_lock SET heartbeat_at = unixepoch();"
|
|
```
|
|
|
|
If this takes >1 second, database I/O is degraded.
|
|
|
|
**Solution:**
|
|
1. Check disk health: `sqlite3 /var/lib/bangui/bangui.db "PRAGMA integrity_check;"`
|
|
2. Move database to faster storage (SSD)
|
|
3. Check for other I/O bottlenecks on the host
|
|
|
|
---
|
|
|
|
### Lock Not Acquired at Startup
|
|
|
|
**Symptom:** Instance fails to start with error "Could not acquire scheduler lock".
|
|
|
|
**Cause:** Another instance already holds the lock and appears healthy.
|
|
|
|
**Diagnosis:**
|
|
```bash
|
|
sqlite3 /var/lib/bangui/bangui.db "SELECT pid, hostname, heartbeat_at FROM scheduler_lock;"
|
|
ps aux | grep <pid>
|
|
```
|
|
|
|
**Solution:**
|
|
- If other instance is healthy and should run scheduler: this instance must wait
|
|
- If other instance is crashed: `DELETE FROM scheduler_lock;` then restart this instance
|
|
- If running single instance: ensure no other instances are running before startup
|
|
|
|
---
|
|
|
|
## Rate Limiting
|
|
|
|
### Getting 429 Too Many Requests
|
|
|
|
**Symptom:** API returns HTTP 429 with `rate_limit_exceeded` error code.
|
|
|
|
**Cause:** You have exceeded the per-IP rate limit for a specific operation.
|
|
|
|
**Diagnosis:**
|
|
1. Check the `Retry-After` header in the response — this tells you how many seconds to wait
|
|
2. Look for the log event `*_rate_limit_exceeded` which shows the bucket and client IP
|
|
|
|
**Rate limit buckets:**
|
|
| Bucket | Limit | Window | Operations |
|
|
|--------|-------|--------|------------|
|
|
| `bans:ban` | 100 | 1 minute | Ban IP addresses |
|
|
| `bans:unban` | 100 | 1 minute | Unban IP addresses |
|
|
| `blocklist:import` | 10 | 1 hour | Import blocklists |
|
|
| `config:update` | 50 | 1 minute | Update configuration |
|
|
| `jail:update` | 100 | 1 minute | Update jail config |
|
|
| `jail:create` | 100 | 1 minute | Add log paths, assign filters/actions |
|
|
| `jail:delete` | 100 | 1 minute | Remove log paths, actions |
|
|
| `jail:activate` | 100 | 1 minute | Activate jails |
|
|
| `jail:deactivate` | 100 | 1 minute | Deactivate jails |
|
|
| `filter:update` | 50 | 1 minute | Update filters |
|
|
| `filter:create` | 50 | 1 minute | Create filters |
|
|
| `filter:delete` | 50 | 1 minute | Delete filters |
|
|
| `action:update` | 50 | 1 minute | Update actions |
|
|
| `action:create` | 50 | 1 minute | Create actions |
|
|
| `action:delete` | 50 | 1 minute | Delete actions |
|
|
|
|
**Solution:**
|
|
1. Wait for the `Retry-After` period before retrying
|
|
2. If you hit the limit during legitimate bulk operations, consider batching requests
|
|
3. For blocklist imports (10/hour), ensure automated imports are not more frequent
|
|
|
|
**Prevention:**
|
|
- Monitor `*_rate_limit_exceeded` log events
|
|
- Adjust limits via environment variables if needed (see `Docs/CONFIGURATION.md`)
|
|
- For bulk operations, implement client-side throttling
|
|
|
|
**Note:** If rate limiting triggers unexpectedly for legitimate use, check for:
|
|
- Internal monitoring scripts hitting endpoints too frequently
|
|
- Multiple users behind the same proxy IP
|
|
- Stale rate limit state after process restart (uses in-memory tracking)
|
|
|
|
---
|
|
|
|
## Database Migration Failures
|
|
|
|
### Application Won't Start After Upgrade
|
|
|
|
**Symptom:** Application fails to start. Logs show migration errors.
|
|
|
|
**Cause:** Migration failed mid-transaction. Database left in inconsistent state.
|
|
|
|
**Diagnosis:**
|
|
```bash
|
|
# Check current schema version
|
|
sqlite3 /var/lib/bangui/bangui.db "SELECT MAX(version) FROM schema_migrations;"
|
|
|
|
# List all tables
|
|
sqlite3 /var/lib/bangui/bangui.db "SELECT name FROM sqlite_master WHERE type='table';"
|
|
|
|
# Check logs for specific error
|
|
grep -i migration /var/log/bangui.log
|
|
```
|
|
|
|
**Solution:**
|
|
|
|
1. **If migration was auto-rolled back**: Startup will retry the same migration. Run application again.
|
|
2. **If migration keeps failing**: Check if table already exists:
|
|
```bash
|
|
sqlite3 /var/lib/bangui/bangui.db "SELECT name FROM sqlite_master WHERE type='table' AND name='<table>';"
|
|
```
|
|
If it exists, manually insert the migration record:
|
|
```bash
|
|
sqlite3 /var/lib/bangui/bangui.db "INSERT INTO schema_migrations (version) VALUES (?);"
|
|
```
|
|
3. **Full database reset** (development only):
|
|
```bash
|
|
rm /var/lib/bangui/bangui.db /var/lib/bangui/bangui.db-wal /var/lib/bangui/bangui.db-shm
|
|
```
|
|
|
|
**Prevention:**
|
|
- Always backup before upgrades: `cp bangui.db bangui.db.backup`
|
|
- Never manually modify database schema
|
|
- Monitor `migrating_database_schema` log events during upgrades
|
|
|
|
---
|
|
|
|
### Schema Version Mismatch
|
|
|
|
**Symptom:** Error: "database schema version X is newer than supported version Y"
|
|
|
|
**Cause:** Downgraded to older BanGUI version that doesn't support current schema.
|
|
|
|
**Solution:** Upgrade to a version compatible with the current schema, or restore from backup.
|
|
|
|
---
|
|
|
|
## 502 Bad Gateway Errors
|
|
|
|
### Symptom: Nginx returns 502 Bad Gateway
|
|
|
|
**Cause:** The backend container is unreachable — either down, restarting, or not yet healthy.
|
|
|
|
**Diagnosis:**
|
|
|
|
```bash
|
|
# Check backend container status
|
|
docker ps -a | grep bangui-backend
|
|
|
|
# Check if backend is responding directly (on the container network)
|
|
docker exec bangui-frontend curl -f http://bangui-backend:8000/api/v1/health
|
|
|
|
# Check backend logs
|
|
docker logs bangui-backend --tail 50
|
|
```
|
|
|
|
**Common causes and solutions:**
|
|
|
|
| Cause | Diagnosis | Solution |
|
|
|---|---|---|
|
|
| Backend restarting | `docker ps` shows backend repeatedly restarting | Check health check timing; may need longer `start_period` |
|
|
| Health check failing | Backend log shows socket errors | Verify fail2ban container is healthy before backend starts |
|
|
| Startup too slow | `start_period: 40s` not enough on slow hosts | Increase `start_period` in compose file |
|
|
| Port misconfiguration | `expose` vs `ports` mismatch | Ensure backend exposes 8000 and frontend proxies to it |
|
|
|
|
**Prevention:**
|
|
|
|
- The `depends_on: condition: service_healthy` ensures the backend is fully started before the frontend proxies requests.
|
|
- The health check returns 503 when fail2ban is offline, triggering container restart automatically.
|
|
- Health check parameters are tuned for typical startup time — adjust `start_period` if the host is slow or resource-constrained.
|
|
|
|
---
|
|
|
|
## Graceful Shutdown Issues
|
|
|
|
### Container Killed Before Tasks Complete
|
|
|
|
**Symptom:** Logs show `pending_tasks_timeout` and tasks are cancelled mid-execution.
|
|
|
|
**Cause:** Docker's `stop_grace_period` is too short, or tasks take longer than the 25s graceful timeout.
|
|
|
|
**Diagnosis:**
|
|
```bash
|
|
# Check if container was killed by SIGKILL
|
|
docker inspect bangui-backend --format '{{.State.ExitCode}}'
|
|
# Exit code 137 = SIGKILL
|
|
```
|
|
|
|
**Solution:**
|
|
1. Increase `stop_grace_period` in `docker-compose.yml`:
|
|
```yaml
|
|
backend:
|
|
stop_grace_period: 60s
|
|
```
|
|
2. The Python graceful timeout is 25s (leaving margin before Docker kill)
|
|
3. If tasks still timeout, check task code — long-running tasks should handle cancellation gracefully
|
|
|
|
### Scheduler Lock Not Released
|
|
|
|
**Symptom:** After container restart, logs show `Could not acquire scheduler lock`.
|
|
|
|
**Cause:** Previous instance shut down without releasing the lock, or lock TTL hasn't expired.
|
|
|
|
**Diagnosis:**
|
|
```bash
|
|
sqlite3 /var/lib/bangui/bangui.db "SELECT * FROM scheduler_lock;"
|
|
```
|
|
|
|
**Solution:**
|
|
```bash
|
|
# Clear stale lock
|
|
sqlite3 /var/lib/bangui/bangui.db "DELETE FROM scheduler_lock;"
|
|
# Restart container
|
|
```
|
|
|
|
**Prevention:**
|
|
- Graceful shutdown releases lock immediately (not waiting for TTL expiry)
|
|
- Monitor logs for `scheduler_lock_released` on clean shutdown
|
|
|
|
### In-Flight Requests Dropped
|
|
|
|
**Symptom:** Client connections closed abruptly during shutdown.
|
|
|
|
**Cause:** Too short a graceful timeout, or clients not configured to retry.
|
|
|
|
**Solution:**
|
|
1. Ensure clients implement proper retry logic with backoff
|
|
2. For critical operations, use background tasks with status polling
|
|
3. Increase graceful timeout if network latency is high
|
|
|
|
---
|
|
|
|
## General Recovery Commands
|
|
|
|
Clear all locks:
|
|
```bash
|
|
sqlite3 /var/lib/bangui/bangui.db "DELETE FROM scheduler_lock;"
|
|
```
|
|
|
|
Check lock status:
|
|
```bash
|
|
sqlite3 /var/lib/bangui/bangui.db "SELECT * FROM scheduler_lock;"
|
|
```
|
|
|
|
Verify database integrity:
|
|
```bash
|
|
sqlite3 /var/lib/bangui/bangui.db "PRAGMA integrity_check;"
|
|
```
|
|
|
|
---
|
|
|
|
## Regex Pattern Rejected
|
|
|
|
### Symptom: Filter or action configuration fails with "Invalid regex" error
|
|
|
|
**Cause:** The regex pattern is either syntactically invalid or detected as a ReDoS (Regular Expression Denial of Service) vulnerability.
|
|
|
|
**Diagnosis:**
|
|
1. Check the error message — it indicates whether the pattern is syntactically invalid or flagged as dangerous
|
|
2. Look for log events: `regex_redos_detected` or `regex_compilation_timeout`
|
|
|
|
**Common ReDoS patterns that are rejected:**
|
|
| Pattern | Problem |
|
|
|---------|---------|
|
|
| `(a+)+b` | Nested quantifiers with overlap |
|
|
| `([a-z]+)*d` | Quantifier inside quantifier |
|
|
| `(x+)+y` | Nested plus operators |
|
|
|
|
**Solution:**
|
|
1. Rewrite the pattern to avoid nested quantifiers on overlapping groups
|
|
2. Use atomic groups or possessive quantifiers where possible: `(?>a+)+b`
|
|
3. Simplify complex alternations
|
|
|
|
**Prevention:**
|
|
- Test regex patterns in isolation before deploying
|
|
- Avoid patterns with quantified groups inside other quantifiers
|
|
- Prefer explicit character classes over `.*` where possible
|
|
- Use [regexploit](https://github.com/doyensec/regexploit) to audit patterns
|
|
|
|
---
|
|
|
|
## Configuration Validation at Startup
|
|
|
|
BanGUI validates configuration at startup. Errors raised here indicate misconfiguration that must be fixed before the application can start.
|
|
|
|
### Database Parent Directory Does Not Exist
|
|
|
|
**Symptom:** Application fails to start with: `Database parent directory does not exist: /path/to/parent`
|
|
|
|
**Cause:** The parent directory of `BANGUI_DATABASE_PATH` does not exist.
|
|
|
|
**Solution:**
|
|
```bash
|
|
mkdir -p /path/to/parent
|
|
# Then restart BanGUI
|
|
```
|
|
|
|
---
|
|
|
|
### Database Parent Directory Not Writable
|
|
|
|
**Symptom:** Application fails to start with: `Database parent directory not writable: /path/to/parent`
|
|
|
|
**Cause:** The process cannot write to the database parent directory.
|
|
|
|
**Solution:**
|
|
```bash
|
|
chmod 755 /path/to/parent
|
|
# Verify the user running BanGUI owns the directory or has write access
|
|
```
|
|
|
|
---
|
|
|
|
### fail2ban Socket Not Readable
|
|
|
|
**Symptom:** Application fails to start with: `fail2ban socket not readable: /path/to/socket`
|
|
|
|
**Cause:** The socket file exists but is not readable by the BanGUI process.
|
|
|
|
**Solution:**
|
|
```bash
|
|
chmod 644 /path/to/socket
|
|
ls -la /path/to/socket
|
|
```
|
|
|
|
---
|
|
|
|
### fail2ban Config Directory Does Not Exist
|
|
|
|
**Symptom:** Application fails to start with: `fail2ban config directory does not exist: /path/to/config`
|
|
|
|
**Cause:** `BANGUI_FAIL2BAN_CONFIG_DIR` points to a directory that does not exist.
|
|
|
|
**Solution:**
|
|
- Mount the fail2ban configuration directory at the expected path
|
|
- Or adjust `BANGUI_FAIL2BAN_CONFIG_DIR` to point to the correct location
|
|
- In Docker: add a volume mount for the fail2ban config directory
|
|
|
|
---
|
|
|
|
### GeoIP Database File Does Not Exist
|
|
|
|
**Symptom:** Application fails to start with: `GeoIP database file does not exist: /path/to/GeoLite2-Country.mmdb`
|
|
|
|
**Cause:** `BANGUI_GEOIP_DB_PATH` points to a file that does not exist.
|
|
|
|
**Solution:**
|
|
1. Download the MaxMind GeoLite2-Country database from https://dev.maxmind.com/geoip/geolite2-country
|
|
2. Place it at the configured path, or update `BANGUI_GEOIP_DB_PATH` to the correct location
|
|
3. Alternatively, set `BANGUI_GEOIP_DB_PATH` to `null` to disable GeoIP lookups
|
|
|
|
---
|
|
|
|
### session_secret Too Short or Weak
|
|
|
|
**Symptom:** Application fails to start with: `session_secret must be at least 32 characters` or `session_secret is too weak`
|
|
|
|
**Cause:** `BANGUI_SESSION_SECRET` is missing, too short, or contains common weak words.
|
|
|
|
**Solution:**
|
|
```bash
|
|
# Generate a new secret
|
|
python -c "import secrets; print(secrets.token_hex(32))"
|
|
```
|
|
Then set it in your `.env` file or environment variables.
|
|
|
|
---
|
|
|
|
## Enabling Debug Logs for Third-Party Libraries
|
|
|
|
BanGUI suppresses verbose DEBUG logs from APScheduler and aiosqlite by default (see `Docs/Observability.md`). When troubleshooting scheduler or database issues, you can temporarily re-enable these logs.
|
|
|
|
### Quick method (environment variable)
|
|
|
|
Set `BANGUI_SUPPRESS_THIRD_PARTY_LOGS=false` and ensure `BANGUI_LOG_LEVEL=debug`:
|
|
|
|
```bash
|
|
BANGUI_SUPPRESS_THIRD_PARTY_LOGS=false \
|
|
BANGUI_LOG_LEVEL=debug \
|
|
python -m uvicorn app.main:create_app
|
|
```
|
|
|
|
This allows APScheduler and aiosqlite to inherit the application log level without editing code.
|
|
|
|
### Code method (for permanent changes)
|
|
|
|
If you need to change the level for a specific library only, edit `backend/app/main.py` inside `_configure_logging()`:
|
|
|
|
```python
|
|
logging.getLogger("apscheduler").setLevel(logging.DEBUG)
|
|
```
|
|
|
|
Restart the application. You will see scheduler polling messages such as:
|
|
- `Looking for jobs to run`
|
|
- `Next wakeup is due at ...`
|
|
- `Running job ...`
|
|
|
|
### Reverting
|
|
|
|
Remove the environment variable or code change and restart. When suppression is re-enabled, the loggers return to `WARNING` level.
|
|
|
|
---
|
|
|
|
## Plain Text Logs Still Appearing
|
|
|
|
If `bangui.log` contains plain text lines that are not JSON, a library is bypassing structlog's `ProcessorFormatter`.
|
|
|
|
**Diagnosis:**
|
|
|
|
1. Identify the logger name in the plain text line (usually at the start of the line).
|
|
2. Check whether the logger is listed in `backend/app/main.py::_configure_logging()` under the third-party overrides.
|
|
3. Verify that `structlog.stdlib.ProcessorFormatter` is attached to all handlers:
|
|
```python
|
|
for handler in handlers:
|
|
handler.setFormatter(formatter)
|
|
```
|
|
|
|
**Common causes:**
|
|
|
|
| Cause | Fix |
|
|
|-------|-----|
|
|
| Library initializes its own handler after startup | Add `logging.getLogger("library_name").setLevel(logging.WARNING)` in `_configure_logging()`. |
|
|
| Custom handler added outside `_configure_logging()` | Ensure all handlers use `structlog.stdlib.ProcessorFormatter`. |
|
|
| Log emitted before `_configure_logging()` is called | Move logging configuration earlier in the lifespan or app factory. |
|
|
|
|
---
|
|
|
|
## Getting Help
|
|
|
|
If issues persist after following this guide:
|
|
|
|
1. Enable debug logging: `BANGUI_LOG_LEVEL=debug`
|
|
2. Collect logs around the failure time
|
|
3. Check `Docs/Deployment.md` for configuration guidance
|
|
4. Check `Docs/Observability.md` for monitoring setup
|