Files

Lukas 7ec80fdeec refactor(logging): replace structlog with stdlib logging compat layer

- Remove structlog dependency from backend/pyproject.toml
- Add app.utils.logging_compat shim for keyword-arg logging API
- Add app.utils.json_formatter for JSON log output with extra fields
- Update all backend modules to use logging_compat.get_logger()
- Update docstrings in log_sanitizer.py and json_formatter.py
- Update test comment in test_async_utils.py
- Record 406 failing tests in Docs/Tasks.md for tracking

2026-05-10 13:37:54 +02:00

15 KiB

Raw Blame History

Troubleshooting Guide

Scheduler Lock Issues

Lock Held by Crashed Instance (Orphaned Lock)

Symptom: Background tasks stop running. Logs show scheduler_lock_held_by_other_instance but no other instance is running.

Diagnosis:

sqlite3 /var/lib/bangui/bangui.db "SELECT pid, hostname, heartbeat_at FROM scheduler_lock;"

If heartbeat_at is older than 5 minutes and the PID no longer exists, the lock is orphaned.

Recovery:

sqlite3 /var/lib/bangui/bangui.db "DELETE FROM scheduler_lock;"

Restart the backend. It will acquire the lock fresh.

Prevention:

Monitor scheduler_lock_heartbeat_lost events in logs
If >3 occurrences per hour, investigate database I/O performance

Two Instances Both Running Scheduler

Symptom: Duplicate blocklist imports, duplicate geo cache cleanups, or duplicate history syncs.

Cause: Both instances believe they hold the lock.

Diagnosis:

Check which instance holds the lock: SELECT pid, hostname FROM scheduler_lock;
Compare with running processes: ps aux | grep bangui

Solution:

Stop one instance immediately
Clear lock: DELETE FROM scheduler_lock;
Restart the remaining instance

Prevention:

Ensure only one instance starts before heartbeat begins
Check BANGUI_SINGLE_INSTANCE=true is set if single-instance operation is required

Heartbeat Update Failures

Symptom: Logs show scheduler_lock_heartbeat_lost repeatedly, then lock is lost.

Cause: Database writes failing or extremely slow (>5 seconds per write).

Diagnosis:

time sqlite3 /var/lib/bangui/bangui.db "UPDATE scheduler_lock SET heartbeat_at = unixepoch();"

If this takes >1 second, database I/O is degraded.

Solution:

Check disk health: sqlite3 /var/lib/bangui/bangui.db "PRAGMA integrity_check;"
Move database to faster storage (SSD)
Check for other I/O bottlenecks on the host

Lock Not Acquired at Startup

Symptom: Instance fails to start with error "Could not acquire scheduler lock".

Cause: Another instance already holds the lock and appears healthy.

Diagnosis:

sqlite3 /var/lib/bangui/bangui.db "SELECT pid, hostname, heartbeat_at FROM scheduler_lock;"
ps aux | grep <pid>

Solution:

If other instance is healthy and should run scheduler: this instance must wait
If other instance is crashed: DELETE FROM scheduler_lock; then restart this instance
If running single instance: ensure no other instances are running before startup

Rate Limiting

Getting 429 Too Many Requests

Symptom: API returns HTTP 429 with rate_limit_exceeded error code.

Cause: You have exceeded the per-IP rate limit for a specific operation.

Diagnosis:

Check the Retry-After header in the response — this tells you how many seconds to wait
Look for the log event *_rate_limit_exceeded which shows the bucket and client IP

Rate limit buckets:

Bucket	Limit	Window	Operations
`bans:ban`	100	1 minute	Ban IP addresses
`bans:unban`	100	1 minute	Unban IP addresses
`blocklist:import`	10	1 hour	Import blocklists
`config:update`	50	1 minute	Update configuration
`jail:update`	100	1 minute	Update jail config
`jail:create`	100	1 minute	Add log paths, assign filters/actions
`jail:delete`	100	1 minute	Remove log paths, actions
`jail:activate`	100	1 minute	Activate jails
`jail:deactivate`	100	1 minute	Deactivate jails
`filter:update`	50	1 minute	Update filters
`filter:create`	50	1 minute	Create filters
`filter:delete`	50	1 minute	Delete filters
`action:update`	50	1 minute	Update actions
`action:create`	50	1 minute	Create actions
`action:delete`	50	1 minute	Delete actions

Solution:

Wait for the Retry-After period before retrying
If you hit the limit during legitimate bulk operations, consider batching requests
For blocklist imports (10/hour), ensure automated imports are not more frequent

Prevention:

Monitor *_rate_limit_exceeded log events
Adjust limits via environment variables if needed (see Docs/CONFIGURATION.md)
For bulk operations, implement client-side throttling

Note: If rate limiting triggers unexpectedly for legitimate use, check for:

Internal monitoring scripts hitting endpoints too frequently
Multiple users behind the same proxy IP
Stale rate limit state after process restart (uses in-memory tracking)

Database Migration Failures

Application Won't Start After Upgrade

Symptom: Application fails to start. Logs show migration errors.

Cause: Migration failed mid-transaction. Database left in inconsistent state.

Diagnosis:

# Check current schema version
sqlite3 /var/lib/bangui/bangui.db "SELECT MAX(version) FROM schema_migrations;"

# List all tables
sqlite3 /var/lib/bangui/bangui.db "SELECT name FROM sqlite_master WHERE type='table';"

# Check logs for specific error
grep -i migration /var/log/bangui.log

Solution:

If migration was auto-rolled back: Startup will retry the same migration. Run application again.

If migration keeps failing: Check if table already exists:

sqlite3 /var/lib/bangui/bangui.db "SELECT name FROM sqlite_master WHERE type='table' AND name='<table>';"

If it exists, manually insert the migration record:

sqlite3 /var/lib/bangui/bangui.db "INSERT INTO schema_migrations (version) VALUES (?);"

Full database reset (development only):

rm /var/lib/bangui/bangui.db /var/lib/bangui/bangui.db-wal /var/lib/bangui/bangui.db-shm

Prevention:

Always backup before upgrades: cp bangui.db bangui.db.backup
Never manually modify database schema
Monitor migrating_database_schema log events during upgrades

Schema Version Mismatch

Symptom: Error: "database schema version X is newer than supported version Y"

Cause: Downgraded to older BanGUI version that doesn't support current schema.

Solution: Upgrade to a version compatible with the current schema, or restore from backup.

502 Bad Gateway Errors

Symptom: Nginx returns 502 Bad Gateway

Cause: The backend container is unreachable — either down, restarting, or not yet healthy.

Diagnosis:

# Check backend container status
docker ps -a | grep bangui-backend

# Check if backend is responding directly (on the container network)
docker exec bangui-frontend curl -f http://bangui-backend:8000/api/v1/health

# Check backend logs
docker logs bangui-backend --tail 50

Common causes and solutions:

Cause	Diagnosis	Solution
Backend restarting	`docker ps` shows backend repeatedly restarting	Check health check timing; may need longer `start_period`
Health check failing	Backend log shows socket errors	Verify fail2ban container is healthy before backend starts
Startup too slow	`start_period: 40s` not enough on slow hosts	Increase `start_period` in compose file
Port misconfiguration	`expose` vs `ports` mismatch	Ensure backend exposes 8000 and frontend proxies to it

Prevention:

The depends_on: condition: service_healthy ensures the backend is fully started before the frontend proxies requests.
The health check returns 503 when fail2ban is offline, triggering container restart automatically.
Health check parameters are tuned for typical startup time — adjust start_period if the host is slow or resource-constrained.

Graceful Shutdown Issues

Container Killed Before Tasks Complete

Symptom: Logs show pending_tasks_timeout and tasks are cancelled mid-execution.

Cause: Docker's stop_grace_period is too short, or tasks take longer than the 25s graceful timeout.

Diagnosis:

# Check if container was killed by SIGKILL
docker inspect bangui-backend --format '{{.State.ExitCode}}'
# Exit code 137 = SIGKILL

Solution:

Increase stop_grace_period in docker-compose.yml:
```
backend:
  stop_grace_period: 60s
```
The Python graceful timeout is 25s (leaving margin before Docker kill)
If tasks still timeout, check task code — long-running tasks should handle cancellation gracefully

Scheduler Lock Not Released

Symptom: After container restart, logs show Could not acquire scheduler lock.

Cause: Previous instance shut down without releasing the lock, or lock TTL hasn't expired.

Diagnosis:

sqlite3 /var/lib/bangui/bangui.db "SELECT * FROM scheduler_lock;"

Solution:

# Clear stale lock
sqlite3 /var/lib/bangui/bangui.db "DELETE FROM scheduler_lock;"
# Restart container

Prevention:

Graceful shutdown releases lock immediately (not waiting for TTL expiry)
Monitor logs for scheduler_lock_released on clean shutdown

In-Flight Requests Dropped

Symptom: Client connections closed abruptly during shutdown.

Cause: Too short a graceful timeout, or clients not configured to retry.

Solution:

Ensure clients implement proper retry logic with backoff
For critical operations, use background tasks with status polling
Increase graceful timeout if network latency is high

General Recovery Commands

Clear all locks:

sqlite3 /var/lib/bangui/bangui.db "DELETE FROM scheduler_lock;"

Check lock status:

sqlite3 /var/lib/bangui/bangui.db "SELECT * FROM scheduler_lock;"

Verify database integrity:

sqlite3 /var/lib/bangui/bangui.db "PRAGMA integrity_check;"

Regex Pattern Rejected

Symptom: Filter or action configuration fails with "Invalid regex" error

Cause: The regex pattern is either syntactically invalid or detected as a ReDoS (Regular Expression Denial of Service) vulnerability.

Diagnosis:

Check the error message — it indicates whether the pattern is syntactically invalid or flagged as dangerous
Look for log events: regex_redos_detected or regex_compilation_timeout

Common ReDoS patterns that are rejected:

Pattern	Problem
`(a+)+b`	Nested quantifiers with overlap
`([a-z]+)*d`	Quantifier inside quantifier
`(x+)+y`	Nested plus operators

Solution:

Rewrite the pattern to avoid nested quantifiers on overlapping groups
Use atomic groups or possessive quantifiers where possible: (?>a+)+b
Simplify complex alternations

Prevention:

Test regex patterns in isolation before deploying
Avoid patterns with quantified groups inside other quantifiers
Prefer explicit character classes over .* where possible
Use regexploit to audit patterns

Configuration Validation at Startup

BanGUI validates configuration at startup. Errors raised here indicate misconfiguration that must be fixed before the application can start.

Database Parent Directory Does Not Exist

Symptom: Application fails to start with: Database parent directory does not exist: /path/to/parent

Cause: The parent directory of BANGUI_DATABASE_PATH does not exist.

Solution:

mkdir -p /path/to/parent
# Then restart BanGUI

Database Parent Directory Not Writable

Symptom: Application fails to start with: Database parent directory not writable: /path/to/parent

Cause: The process cannot write to the database parent directory.

Solution:

chmod 755 /path/to/parent
# Verify the user running BanGUI owns the directory or has write access

fail2ban Socket Not Readable

Symptom: Application fails to start with: fail2ban socket not readable: /path/to/socket

Cause: The socket file exists but is not readable by the BanGUI process.

Solution:

chmod 644 /path/to/socket
ls -la /path/to/socket

fail2ban Config Directory Does Not Exist

Symptom: Application fails to start with: fail2ban config directory does not exist: /path/to/config

Cause: BANGUI_FAIL2BAN_CONFIG_DIR points to a directory that does not exist.

Solution:

Mount the fail2ban configuration directory at the expected path
Or adjust BANGUI_FAIL2BAN_CONFIG_DIR to point to the correct location
In Docker: add a volume mount for the fail2ban config directory

GeoIP Database File Does Not Exist

Symptom: Application fails to start with: GeoIP database file does not exist: /path/to/GeoLite2-Country.mmdb

Cause: BANGUI_GEOIP_DB_PATH points to a file that does not exist.

Solution:

Download the MaxMind GeoLite2-Country database from https://dev.maxmind.com/geoip/geolite2-country
Place it at the configured path, or update BANGUI_GEOIP_DB_PATH to the correct location
Alternatively, set BANGUI_GEOIP_DB_PATH to null to disable GeoIP lookups

session_secret Too Short or Weak

Symptom: Application fails to start with: session_secret must be at least 32 characters or session_secret is too weak

Cause: BANGUI_SESSION_SECRET is missing, too short, or contains common weak words.

Solution:

# Generate a new secret
python -c "import secrets; print(secrets.token_hex(32))"

Then set it in your .env file or environment variables.

Enabling Debug Logs for Third-Party Libraries

BanGUI suppresses verbose DEBUG logs from APScheduler and aiosqlite by default (see Docs/Observability.md). When troubleshooting scheduler or database issues, you can temporarily re-enable these logs.

Quick method (environment variable)

Set BANGUI_SUPPRESS_THIRD_PARTY_LOGS=false and ensure BANGUI_LOG_LEVEL=debug:

BANGUI_SUPPRESS_THIRD_PARTY_LOGS=false \
BANGUI_LOG_LEVEL=debug \
python -m uvicorn app.main:create_app

This allows APScheduler and aiosqlite to inherit the application log level without editing code.

Code method (for permanent changes)

If you need to change the level for a specific library only, edit backend/app/main.py inside _configure_logging():

logging.getLogger("apscheduler").setLevel(logging.DEBUG)

Restart the application. You will see scheduler polling messages such as:

Looking for jobs to run
Next wakeup is due at ...
Running job ...

Reverting

Remove the environment variable or code change and restart. When suppression is re-enabled, the loggers return to WARNING level.

Plain Text Logs Still Appearing

If bangui.log contains plain text lines that are not JSON, a library is bypassing structlog's ProcessorFormatter.

Diagnosis:

Identify the logger name in the plain text line (usually at the start of the line).
Check whether the logger is listed in backend/app/main.py::_configure_logging() under the third-party overrides.
Verify that structlog.stdlib.ProcessorFormatter is attached to all handlers:
```
for handler in handlers:
    handler.setFormatter(formatter)
```

Common causes:

Cause	Fix
Library initializes its own handler after startup	Add `logging.getLogger("library_name").setLevel(logging.WARNING)` in `_configure_logging()`.
Custom handler added outside `_configure_logging()`	Ensure all handlers use `structlog.stdlib.ProcessorFormatter`.
Log emitted before `_configure_logging()` is called	Move logging configuration earlier in the lifespan or app factory.

Getting Help

If issues persist after following this guide:

Enable debug logging: BANGUI_LOG_LEVEL=debug
Collect logs around the failure time
Check Docs/Deployment.md for configuration guidance
Check Docs/Observability.md for monitoring setup

15 KiB Raw Blame History

Troubleshooting Guide

Scheduler Lock Issues

Lock Held by Crashed Instance (Orphaned Lock)

Two Instances Both Running Scheduler

Heartbeat Update Failures

Lock Not Acquired at Startup

Rate Limiting

Getting 429 Too Many Requests

Database Migration Failures

Application Won't Start After Upgrade

Schema Version Mismatch

502 Bad Gateway Errors

Symptom: Nginx returns 502 Bad Gateway

Graceful Shutdown Issues

Container Killed Before Tasks Complete

Scheduler Lock Not Released

In-Flight Requests Dropped

General Recovery Commands

Regex Pattern Rejected

Symptom: Filter or action configuration fails with "Invalid regex" error

Configuration Validation at Startup

Database Parent Directory Does Not Exist

Database Parent Directory Not Writable

fail2ban Socket Not Readable

fail2ban Config Directory Does Not Exist

GeoIP Database File Does Not Exist

session_secret Too Short or Weak

Enabling Debug Logs for Third-Party Libraries

Quick method (environment variable)

Code method (for permanent changes)

Reverting

Plain Text Logs Still Appearing

Getting Help

15 KiB

Raw Blame History