- Remove structlog dependency from backend/pyproject.toml - Add app.utils.logging_compat shim for keyword-arg logging API - Add app.utils.json_formatter for JSON log output with extra fields - Update all backend modules to use logging_compat.get_logger() - Update docstrings in log_sanitizer.py and json_formatter.py - Update test comment in test_async_utils.py - Record 406 failing tests in Docs/Tasks.md for tracking
15 KiB
Troubleshooting Guide
Scheduler Lock Issues
Lock Held by Crashed Instance (Orphaned Lock)
Symptom: Background tasks stop running. Logs show scheduler_lock_held_by_other_instance but no other instance is running.
Diagnosis:
sqlite3 /var/lib/bangui/bangui.db "SELECT pid, hostname, heartbeat_at FROM scheduler_lock;"
If heartbeat_at is older than 5 minutes and the PID no longer exists, the lock is orphaned.
Recovery:
sqlite3 /var/lib/bangui/bangui.db "DELETE FROM scheduler_lock;"
Restart the backend. It will acquire the lock fresh.
Prevention:
- Monitor
scheduler_lock_heartbeat_lostevents in logs - If >3 occurrences per hour, investigate database I/O performance
Two Instances Both Running Scheduler
Symptom: Duplicate blocklist imports, duplicate geo cache cleanups, or duplicate history syncs.
Cause: Both instances believe they hold the lock.
Diagnosis:
- Check which instance holds the lock:
SELECT pid, hostname FROM scheduler_lock; - Compare with running processes:
ps aux | grep bangui
Solution:
- Stop one instance immediately
- Clear lock:
DELETE FROM scheduler_lock; - Restart the remaining instance
Prevention:
- Ensure only one instance starts before heartbeat begins
- Check
BANGUI_SINGLE_INSTANCE=trueis set if single-instance operation is required
Heartbeat Update Failures
Symptom: Logs show scheduler_lock_heartbeat_lost repeatedly, then lock is lost.
Cause: Database writes failing or extremely slow (>5 seconds per write).
Diagnosis:
time sqlite3 /var/lib/bangui/bangui.db "UPDATE scheduler_lock SET heartbeat_at = unixepoch();"
If this takes >1 second, database I/O is degraded.
Solution:
- Check disk health:
sqlite3 /var/lib/bangui/bangui.db "PRAGMA integrity_check;" - Move database to faster storage (SSD)
- Check for other I/O bottlenecks on the host
Lock Not Acquired at Startup
Symptom: Instance fails to start with error "Could not acquire scheduler lock".
Cause: Another instance already holds the lock and appears healthy.
Diagnosis:
sqlite3 /var/lib/bangui/bangui.db "SELECT pid, hostname, heartbeat_at FROM scheduler_lock;"
ps aux | grep <pid>
Solution:
- If other instance is healthy and should run scheduler: this instance must wait
- If other instance is crashed:
DELETE FROM scheduler_lock;then restart this instance - If running single instance: ensure no other instances are running before startup
Rate Limiting
Getting 429 Too Many Requests
Symptom: API returns HTTP 429 with rate_limit_exceeded error code.
Cause: You have exceeded the per-IP rate limit for a specific operation.
Diagnosis:
- Check the
Retry-Afterheader in the response — this tells you how many seconds to wait - Look for the log event
*_rate_limit_exceededwhich shows the bucket and client IP
Rate limit buckets:
| Bucket | Limit | Window | Operations |
|---|---|---|---|
bans:ban |
100 | 1 minute | Ban IP addresses |
bans:unban |
100 | 1 minute | Unban IP addresses |
blocklist:import |
10 | 1 hour | Import blocklists |
config:update |
50 | 1 minute | Update configuration |
jail:update |
100 | 1 minute | Update jail config |
jail:create |
100 | 1 minute | Add log paths, assign filters/actions |
jail:delete |
100 | 1 minute | Remove log paths, actions |
jail:activate |
100 | 1 minute | Activate jails |
jail:deactivate |
100 | 1 minute | Deactivate jails |
filter:update |
50 | 1 minute | Update filters |
filter:create |
50 | 1 minute | Create filters |
filter:delete |
50 | 1 minute | Delete filters |
action:update |
50 | 1 minute | Update actions |
action:create |
50 | 1 minute | Create actions |
action:delete |
50 | 1 minute | Delete actions |
Solution:
- Wait for the
Retry-Afterperiod before retrying - If you hit the limit during legitimate bulk operations, consider batching requests
- For blocklist imports (10/hour), ensure automated imports are not more frequent
Prevention:
- Monitor
*_rate_limit_exceededlog events - Adjust limits via environment variables if needed (see
Docs/CONFIGURATION.md) - For bulk operations, implement client-side throttling
Note: If rate limiting triggers unexpectedly for legitimate use, check for:
- Internal monitoring scripts hitting endpoints too frequently
- Multiple users behind the same proxy IP
- Stale rate limit state after process restart (uses in-memory tracking)
Database Migration Failures
Application Won't Start After Upgrade
Symptom: Application fails to start. Logs show migration errors.
Cause: Migration failed mid-transaction. Database left in inconsistent state.
Diagnosis:
# Check current schema version
sqlite3 /var/lib/bangui/bangui.db "SELECT MAX(version) FROM schema_migrations;"
# List all tables
sqlite3 /var/lib/bangui/bangui.db "SELECT name FROM sqlite_master WHERE type='table';"
# Check logs for specific error
grep -i migration /var/log/bangui.log
Solution:
- If migration was auto-rolled back: Startup will retry the same migration. Run application again.
- If migration keeps failing: Check if table already exists:
If it exists, manually insert the migration record:
sqlite3 /var/lib/bangui/bangui.db "SELECT name FROM sqlite_master WHERE type='table' AND name='<table>';"sqlite3 /var/lib/bangui/bangui.db "INSERT INTO schema_migrations (version) VALUES (?);" - Full database reset (development only):
rm /var/lib/bangui/bangui.db /var/lib/bangui/bangui.db-wal /var/lib/bangui/bangui.db-shm
Prevention:
- Always backup before upgrades:
cp bangui.db bangui.db.backup - Never manually modify database schema
- Monitor
migrating_database_schemalog events during upgrades
Schema Version Mismatch
Symptom: Error: "database schema version X is newer than supported version Y"
Cause: Downgraded to older BanGUI version that doesn't support current schema.
Solution: Upgrade to a version compatible with the current schema, or restore from backup.
502 Bad Gateway Errors
Symptom: Nginx returns 502 Bad Gateway
Cause: The backend container is unreachable — either down, restarting, or not yet healthy.
Diagnosis:
# Check backend container status
docker ps -a | grep bangui-backend
# Check if backend is responding directly (on the container network)
docker exec bangui-frontend curl -f http://bangui-backend:8000/api/v1/health
# Check backend logs
docker logs bangui-backend --tail 50
Common causes and solutions:
| Cause | Diagnosis | Solution |
|---|---|---|
| Backend restarting | docker ps shows backend repeatedly restarting |
Check health check timing; may need longer start_period |
| Health check failing | Backend log shows socket errors | Verify fail2ban container is healthy before backend starts |
| Startup too slow | start_period: 40s not enough on slow hosts |
Increase start_period in compose file |
| Port misconfiguration | expose vs ports mismatch |
Ensure backend exposes 8000 and frontend proxies to it |
Prevention:
- The
depends_on: condition: service_healthyensures the backend is fully started before the frontend proxies requests. - The health check returns 503 when fail2ban is offline, triggering container restart automatically.
- Health check parameters are tuned for typical startup time — adjust
start_periodif the host is slow or resource-constrained.
Graceful Shutdown Issues
Container Killed Before Tasks Complete
Symptom: Logs show pending_tasks_timeout and tasks are cancelled mid-execution.
Cause: Docker's stop_grace_period is too short, or tasks take longer than the 25s graceful timeout.
Diagnosis:
# Check if container was killed by SIGKILL
docker inspect bangui-backend --format '{{.State.ExitCode}}'
# Exit code 137 = SIGKILL
Solution:
- Increase
stop_grace_periodindocker-compose.yml:backend: stop_grace_period: 60s - The Python graceful timeout is 25s (leaving margin before Docker kill)
- If tasks still timeout, check task code — long-running tasks should handle cancellation gracefully
Scheduler Lock Not Released
Symptom: After container restart, logs show Could not acquire scheduler lock.
Cause: Previous instance shut down without releasing the lock, or lock TTL hasn't expired.
Diagnosis:
sqlite3 /var/lib/bangui/bangui.db "SELECT * FROM scheduler_lock;"
Solution:
# Clear stale lock
sqlite3 /var/lib/bangui/bangui.db "DELETE FROM scheduler_lock;"
# Restart container
Prevention:
- Graceful shutdown releases lock immediately (not waiting for TTL expiry)
- Monitor logs for
scheduler_lock_releasedon clean shutdown
In-Flight Requests Dropped
Symptom: Client connections closed abruptly during shutdown.
Cause: Too short a graceful timeout, or clients not configured to retry.
Solution:
- Ensure clients implement proper retry logic with backoff
- For critical operations, use background tasks with status polling
- Increase graceful timeout if network latency is high
General Recovery Commands
Clear all locks:
sqlite3 /var/lib/bangui/bangui.db "DELETE FROM scheduler_lock;"
Check lock status:
sqlite3 /var/lib/bangui/bangui.db "SELECT * FROM scheduler_lock;"
Verify database integrity:
sqlite3 /var/lib/bangui/bangui.db "PRAGMA integrity_check;"
Regex Pattern Rejected
Symptom: Filter or action configuration fails with "Invalid regex" error
Cause: The regex pattern is either syntactically invalid or detected as a ReDoS (Regular Expression Denial of Service) vulnerability.
Diagnosis:
- Check the error message — it indicates whether the pattern is syntactically invalid or flagged as dangerous
- Look for log events:
regex_redos_detectedorregex_compilation_timeout
Common ReDoS patterns that are rejected:
| Pattern | Problem |
|---|---|
(a+)+b |
Nested quantifiers with overlap |
([a-z]+)*d |
Quantifier inside quantifier |
(x+)+y |
Nested plus operators |
Solution:
- Rewrite the pattern to avoid nested quantifiers on overlapping groups
- Use atomic groups or possessive quantifiers where possible:
(?>a+)+b - Simplify complex alternations
Prevention:
- Test regex patterns in isolation before deploying
- Avoid patterns with quantified groups inside other quantifiers
- Prefer explicit character classes over
.*where possible - Use regexploit to audit patterns
Configuration Validation at Startup
BanGUI validates configuration at startup. Errors raised here indicate misconfiguration that must be fixed before the application can start.
Database Parent Directory Does Not Exist
Symptom: Application fails to start with: Database parent directory does not exist: /path/to/parent
Cause: The parent directory of BANGUI_DATABASE_PATH does not exist.
Solution:
mkdir -p /path/to/parent
# Then restart BanGUI
Database Parent Directory Not Writable
Symptom: Application fails to start with: Database parent directory not writable: /path/to/parent
Cause: The process cannot write to the database parent directory.
Solution:
chmod 755 /path/to/parent
# Verify the user running BanGUI owns the directory or has write access
fail2ban Socket Not Readable
Symptom: Application fails to start with: fail2ban socket not readable: /path/to/socket
Cause: The socket file exists but is not readable by the BanGUI process.
Solution:
chmod 644 /path/to/socket
ls -la /path/to/socket
fail2ban Config Directory Does Not Exist
Symptom: Application fails to start with: fail2ban config directory does not exist: /path/to/config
Cause: BANGUI_FAIL2BAN_CONFIG_DIR points to a directory that does not exist.
Solution:
- Mount the fail2ban configuration directory at the expected path
- Or adjust
BANGUI_FAIL2BAN_CONFIG_DIRto point to the correct location - In Docker: add a volume mount for the fail2ban config directory
GeoIP Database File Does Not Exist
Symptom: Application fails to start with: GeoIP database file does not exist: /path/to/GeoLite2-Country.mmdb
Cause: BANGUI_GEOIP_DB_PATH points to a file that does not exist.
Solution:
- Download the MaxMind GeoLite2-Country database from https://dev.maxmind.com/geoip/geolite2-country
- Place it at the configured path, or update
BANGUI_GEOIP_DB_PATHto the correct location - Alternatively, set
BANGUI_GEOIP_DB_PATHtonullto disable GeoIP lookups
session_secret Too Short or Weak
Symptom: Application fails to start with: session_secret must be at least 32 characters or session_secret is too weak
Cause: BANGUI_SESSION_SECRET is missing, too short, or contains common weak words.
Solution:
# Generate a new secret
python -c "import secrets; print(secrets.token_hex(32))"
Then set it in your .env file or environment variables.
Enabling Debug Logs for Third-Party Libraries
BanGUI suppresses verbose DEBUG logs from APScheduler and aiosqlite by default (see Docs/Observability.md). When troubleshooting scheduler or database issues, you can temporarily re-enable these logs.
Quick method (environment variable)
Set BANGUI_SUPPRESS_THIRD_PARTY_LOGS=false and ensure BANGUI_LOG_LEVEL=debug:
BANGUI_SUPPRESS_THIRD_PARTY_LOGS=false \
BANGUI_LOG_LEVEL=debug \
python -m uvicorn app.main:create_app
This allows APScheduler and aiosqlite to inherit the application log level without editing code.
Code method (for permanent changes)
If you need to change the level for a specific library only, edit backend/app/main.py inside _configure_logging():
logging.getLogger("apscheduler").setLevel(logging.DEBUG)
Restart the application. You will see scheduler polling messages such as:
Looking for jobs to runNext wakeup is due at ...Running job ...
Reverting
Remove the environment variable or code change and restart. When suppression is re-enabled, the loggers return to WARNING level.
Plain Text Logs Still Appearing
If bangui.log contains plain text lines that are not JSON, a library is bypassing structlog's ProcessorFormatter.
Diagnosis:
- Identify the logger name in the plain text line (usually at the start of the line).
- Check whether the logger is listed in
backend/app/main.py::_configure_logging()under the third-party overrides. - Verify that
structlog.stdlib.ProcessorFormatteris attached to all handlers:for handler in handlers: handler.setFormatter(formatter)
Common causes:
| Cause | Fix |
|---|---|
| Library initializes its own handler after startup | Add logging.getLogger("library_name").setLevel(logging.WARNING) in _configure_logging(). |
Custom handler added outside _configure_logging() |
Ensure all handlers use structlog.stdlib.ProcessorFormatter. |
Log emitted before _configure_logging() is called |
Move logging configuration earlier in the lifespan or app factory. |
Getting Help
If issues persist after following this guide:
- Enable debug logging:
BANGUI_LOG_LEVEL=debug - Collect logs around the failure time
- Check
Docs/Deployment.mdfor configuration guidance - Check
Docs/Observability.mdfor monitoring setup