Files
BanGUI/Docs/TROUBLESHOOTING.md
Lukas 7ec80fdeec refactor(logging): replace structlog with stdlib logging compat layer
- Remove structlog dependency from backend/pyproject.toml
- Add app.utils.logging_compat shim for keyword-arg logging API
- Add app.utils.json_formatter for JSON log output with extra fields
- Update all backend modules to use logging_compat.get_logger()
- Update docstrings in log_sanitizer.py and json_formatter.py
- Update test comment in test_async_utils.py
- Record 406 failing tests in Docs/Tasks.md for tracking
2026-05-10 13:37:54 +02:00

15 KiB

Troubleshooting Guide

Scheduler Lock Issues

Lock Held by Crashed Instance (Orphaned Lock)

Symptom: Background tasks stop running. Logs show scheduler_lock_held_by_other_instance but no other instance is running.

Diagnosis:

sqlite3 /var/lib/bangui/bangui.db "SELECT pid, hostname, heartbeat_at FROM scheduler_lock;"

If heartbeat_at is older than 5 minutes and the PID no longer exists, the lock is orphaned.

Recovery:

sqlite3 /var/lib/bangui/bangui.db "DELETE FROM scheduler_lock;"

Restart the backend. It will acquire the lock fresh.

Prevention:

  • Monitor scheduler_lock_heartbeat_lost events in logs
  • If >3 occurrences per hour, investigate database I/O performance

Two Instances Both Running Scheduler

Symptom: Duplicate blocklist imports, duplicate geo cache cleanups, or duplicate history syncs.

Cause: Both instances believe they hold the lock.

Diagnosis:

  1. Check which instance holds the lock: SELECT pid, hostname FROM scheduler_lock;
  2. Compare with running processes: ps aux | grep bangui

Solution:

  1. Stop one instance immediately
  2. Clear lock: DELETE FROM scheduler_lock;
  3. Restart the remaining instance

Prevention:

  • Ensure only one instance starts before heartbeat begins
  • Check BANGUI_SINGLE_INSTANCE=true is set if single-instance operation is required

Heartbeat Update Failures

Symptom: Logs show scheduler_lock_heartbeat_lost repeatedly, then lock is lost.

Cause: Database writes failing or extremely slow (>5 seconds per write).

Diagnosis:

time sqlite3 /var/lib/bangui/bangui.db "UPDATE scheduler_lock SET heartbeat_at = unixepoch();"

If this takes >1 second, database I/O is degraded.

Solution:

  1. Check disk health: sqlite3 /var/lib/bangui/bangui.db "PRAGMA integrity_check;"
  2. Move database to faster storage (SSD)
  3. Check for other I/O bottlenecks on the host

Lock Not Acquired at Startup

Symptom: Instance fails to start with error "Could not acquire scheduler lock".

Cause: Another instance already holds the lock and appears healthy.

Diagnosis:

sqlite3 /var/lib/bangui/bangui.db "SELECT pid, hostname, heartbeat_at FROM scheduler_lock;"
ps aux | grep <pid>

Solution:

  • If other instance is healthy and should run scheduler: this instance must wait
  • If other instance is crashed: DELETE FROM scheduler_lock; then restart this instance
  • If running single instance: ensure no other instances are running before startup

Rate Limiting

Getting 429 Too Many Requests

Symptom: API returns HTTP 429 with rate_limit_exceeded error code.

Cause: You have exceeded the per-IP rate limit for a specific operation.

Diagnosis:

  1. Check the Retry-After header in the response — this tells you how many seconds to wait
  2. Look for the log event *_rate_limit_exceeded which shows the bucket and client IP

Rate limit buckets:

Bucket Limit Window Operations
bans:ban 100 1 minute Ban IP addresses
bans:unban 100 1 minute Unban IP addresses
blocklist:import 10 1 hour Import blocklists
config:update 50 1 minute Update configuration
jail:update 100 1 minute Update jail config
jail:create 100 1 minute Add log paths, assign filters/actions
jail:delete 100 1 minute Remove log paths, actions
jail:activate 100 1 minute Activate jails
jail:deactivate 100 1 minute Deactivate jails
filter:update 50 1 minute Update filters
filter:create 50 1 minute Create filters
filter:delete 50 1 minute Delete filters
action:update 50 1 minute Update actions
action:create 50 1 minute Create actions
action:delete 50 1 minute Delete actions

Solution:

  1. Wait for the Retry-After period before retrying
  2. If you hit the limit during legitimate bulk operations, consider batching requests
  3. For blocklist imports (10/hour), ensure automated imports are not more frequent

Prevention:

  • Monitor *_rate_limit_exceeded log events
  • Adjust limits via environment variables if needed (see Docs/CONFIGURATION.md)
  • For bulk operations, implement client-side throttling

Note: If rate limiting triggers unexpectedly for legitimate use, check for:

  • Internal monitoring scripts hitting endpoints too frequently
  • Multiple users behind the same proxy IP
  • Stale rate limit state after process restart (uses in-memory tracking)

Database Migration Failures

Application Won't Start After Upgrade

Symptom: Application fails to start. Logs show migration errors.

Cause: Migration failed mid-transaction. Database left in inconsistent state.

Diagnosis:

# Check current schema version
sqlite3 /var/lib/bangui/bangui.db "SELECT MAX(version) FROM schema_migrations;"

# List all tables
sqlite3 /var/lib/bangui/bangui.db "SELECT name FROM sqlite_master WHERE type='table';"

# Check logs for specific error
grep -i migration /var/log/bangui.log

Solution:

  1. If migration was auto-rolled back: Startup will retry the same migration. Run application again.
  2. If migration keeps failing: Check if table already exists:
    sqlite3 /var/lib/bangui/bangui.db "SELECT name FROM sqlite_master WHERE type='table' AND name='<table>';"
    
    If it exists, manually insert the migration record:
    sqlite3 /var/lib/bangui/bangui.db "INSERT INTO schema_migrations (version) VALUES (?);"
    
  3. Full database reset (development only):
    rm /var/lib/bangui/bangui.db /var/lib/bangui/bangui.db-wal /var/lib/bangui/bangui.db-shm
    

Prevention:

  • Always backup before upgrades: cp bangui.db bangui.db.backup
  • Never manually modify database schema
  • Monitor migrating_database_schema log events during upgrades

Schema Version Mismatch

Symptom: Error: "database schema version X is newer than supported version Y"

Cause: Downgraded to older BanGUI version that doesn't support current schema.

Solution: Upgrade to a version compatible with the current schema, or restore from backup.


502 Bad Gateway Errors

Symptom: Nginx returns 502 Bad Gateway

Cause: The backend container is unreachable — either down, restarting, or not yet healthy.

Diagnosis:

# Check backend container status
docker ps -a | grep bangui-backend

# Check if backend is responding directly (on the container network)
docker exec bangui-frontend curl -f http://bangui-backend:8000/api/v1/health

# Check backend logs
docker logs bangui-backend --tail 50

Common causes and solutions:

Cause Diagnosis Solution
Backend restarting docker ps shows backend repeatedly restarting Check health check timing; may need longer start_period
Health check failing Backend log shows socket errors Verify fail2ban container is healthy before backend starts
Startup too slow start_period: 40s not enough on slow hosts Increase start_period in compose file
Port misconfiguration expose vs ports mismatch Ensure backend exposes 8000 and frontend proxies to it

Prevention:

  • The depends_on: condition: service_healthy ensures the backend is fully started before the frontend proxies requests.
  • The health check returns 503 when fail2ban is offline, triggering container restart automatically.
  • Health check parameters are tuned for typical startup time — adjust start_period if the host is slow or resource-constrained.

Graceful Shutdown Issues

Container Killed Before Tasks Complete

Symptom: Logs show pending_tasks_timeout and tasks are cancelled mid-execution.

Cause: Docker's stop_grace_period is too short, or tasks take longer than the 25s graceful timeout.

Diagnosis:

# Check if container was killed by SIGKILL
docker inspect bangui-backend --format '{{.State.ExitCode}}'
# Exit code 137 = SIGKILL

Solution:

  1. Increase stop_grace_period in docker-compose.yml:
    backend:
      stop_grace_period: 60s
    
  2. The Python graceful timeout is 25s (leaving margin before Docker kill)
  3. If tasks still timeout, check task code — long-running tasks should handle cancellation gracefully

Scheduler Lock Not Released

Symptom: After container restart, logs show Could not acquire scheduler lock.

Cause: Previous instance shut down without releasing the lock, or lock TTL hasn't expired.

Diagnosis:

sqlite3 /var/lib/bangui/bangui.db "SELECT * FROM scheduler_lock;"

Solution:

# Clear stale lock
sqlite3 /var/lib/bangui/bangui.db "DELETE FROM scheduler_lock;"
# Restart container

Prevention:

  • Graceful shutdown releases lock immediately (not waiting for TTL expiry)
  • Monitor logs for scheduler_lock_released on clean shutdown

In-Flight Requests Dropped

Symptom: Client connections closed abruptly during shutdown.

Cause: Too short a graceful timeout, or clients not configured to retry.

Solution:

  1. Ensure clients implement proper retry logic with backoff
  2. For critical operations, use background tasks with status polling
  3. Increase graceful timeout if network latency is high

General Recovery Commands

Clear all locks:

sqlite3 /var/lib/bangui/bangui.db "DELETE FROM scheduler_lock;"

Check lock status:

sqlite3 /var/lib/bangui/bangui.db "SELECT * FROM scheduler_lock;"

Verify database integrity:

sqlite3 /var/lib/bangui/bangui.db "PRAGMA integrity_check;"

Regex Pattern Rejected

Symptom: Filter or action configuration fails with "Invalid regex" error

Cause: The regex pattern is either syntactically invalid or detected as a ReDoS (Regular Expression Denial of Service) vulnerability.

Diagnosis:

  1. Check the error message — it indicates whether the pattern is syntactically invalid or flagged as dangerous
  2. Look for log events: regex_redos_detected or regex_compilation_timeout

Common ReDoS patterns that are rejected:

Pattern Problem
(a+)+b Nested quantifiers with overlap
([a-z]+)*d Quantifier inside quantifier
(x+)+y Nested plus operators

Solution:

  1. Rewrite the pattern to avoid nested quantifiers on overlapping groups
  2. Use atomic groups or possessive quantifiers where possible: (?>a+)+b
  3. Simplify complex alternations

Prevention:

  • Test regex patterns in isolation before deploying
  • Avoid patterns with quantified groups inside other quantifiers
  • Prefer explicit character classes over .* where possible
  • Use regexploit to audit patterns

Configuration Validation at Startup

BanGUI validates configuration at startup. Errors raised here indicate misconfiguration that must be fixed before the application can start.

Database Parent Directory Does Not Exist

Symptom: Application fails to start with: Database parent directory does not exist: /path/to/parent

Cause: The parent directory of BANGUI_DATABASE_PATH does not exist.

Solution:

mkdir -p /path/to/parent
# Then restart BanGUI

Database Parent Directory Not Writable

Symptom: Application fails to start with: Database parent directory not writable: /path/to/parent

Cause: The process cannot write to the database parent directory.

Solution:

chmod 755 /path/to/parent
# Verify the user running BanGUI owns the directory or has write access

fail2ban Socket Not Readable

Symptom: Application fails to start with: fail2ban socket not readable: /path/to/socket

Cause: The socket file exists but is not readable by the BanGUI process.

Solution:

chmod 644 /path/to/socket
ls -la /path/to/socket

fail2ban Config Directory Does Not Exist

Symptom: Application fails to start with: fail2ban config directory does not exist: /path/to/config

Cause: BANGUI_FAIL2BAN_CONFIG_DIR points to a directory that does not exist.

Solution:

  • Mount the fail2ban configuration directory at the expected path
  • Or adjust BANGUI_FAIL2BAN_CONFIG_DIR to point to the correct location
  • In Docker: add a volume mount for the fail2ban config directory

GeoIP Database File Does Not Exist

Symptom: Application fails to start with: GeoIP database file does not exist: /path/to/GeoLite2-Country.mmdb

Cause: BANGUI_GEOIP_DB_PATH points to a file that does not exist.

Solution:

  1. Download the MaxMind GeoLite2-Country database from https://dev.maxmind.com/geoip/geolite2-country
  2. Place it at the configured path, or update BANGUI_GEOIP_DB_PATH to the correct location
  3. Alternatively, set BANGUI_GEOIP_DB_PATH to null to disable GeoIP lookups

session_secret Too Short or Weak

Symptom: Application fails to start with: session_secret must be at least 32 characters or session_secret is too weak

Cause: BANGUI_SESSION_SECRET is missing, too short, or contains common weak words.

Solution:

# Generate a new secret
python -c "import secrets; print(secrets.token_hex(32))"

Then set it in your .env file or environment variables.


Enabling Debug Logs for Third-Party Libraries

BanGUI suppresses verbose DEBUG logs from APScheduler and aiosqlite by default (see Docs/Observability.md). When troubleshooting scheduler or database issues, you can temporarily re-enable these logs.

Quick method (environment variable)

Set BANGUI_SUPPRESS_THIRD_PARTY_LOGS=false and ensure BANGUI_LOG_LEVEL=debug:

BANGUI_SUPPRESS_THIRD_PARTY_LOGS=false \
BANGUI_LOG_LEVEL=debug \
python -m uvicorn app.main:create_app

This allows APScheduler and aiosqlite to inherit the application log level without editing code.

Code method (for permanent changes)

If you need to change the level for a specific library only, edit backend/app/main.py inside _configure_logging():

logging.getLogger("apscheduler").setLevel(logging.DEBUG)

Restart the application. You will see scheduler polling messages such as:

  • Looking for jobs to run
  • Next wakeup is due at ...
  • Running job ...

Reverting

Remove the environment variable or code change and restart. When suppression is re-enabled, the loggers return to WARNING level.


Plain Text Logs Still Appearing

If bangui.log contains plain text lines that are not JSON, a library is bypassing structlog's ProcessorFormatter.

Diagnosis:

  1. Identify the logger name in the plain text line (usually at the start of the line).
  2. Check whether the logger is listed in backend/app/main.py::_configure_logging() under the third-party overrides.
  3. Verify that structlog.stdlib.ProcessorFormatter is attached to all handlers:
    for handler in handlers:
        handler.setFormatter(formatter)
    

Common causes:

Cause Fix
Library initializes its own handler after startup Add logging.getLogger("library_name").setLevel(logging.WARNING) in _configure_logging().
Custom handler added outside _configure_logging() Ensure all handlers use structlog.stdlib.ProcessorFormatter.
Log emitted before _configure_logging() is called Move logging configuration earlier in the lifespan or app factory.

Getting Help

If issues persist after following this guide:

  1. Enable debug logging: BANGUI_LOG_LEVEL=debug
  2. Collect logs around the failure time
  3. Check Docs/Deployment.md for configuration guidance
  4. Check Docs/Observability.md for monitoring setup