Files

Lukas eb339efcfd Add Kubernetes liveness/readiness probes and middleware order validation

- Split /health into /health/live (liveness) and /health/ready (readiness)
  following Kubernetes conventions. Combined /health retained for backward
  compatibility with existing Docker HEALTHCHECK definitions.
- Add ReadyCheck and ReadyResponse models for structured readiness output.
- Add _assert_middleware_order() startup check enforcing:
  RateLimit → Csrf → CorrelationId middleware chain.
- Register CorrelationIdMiddleware, CsrfMiddleware, RateLimitMiddleware
  in create_app() with documented required order (reverse of processing).
- Add correlation.py, csrf.py, rate_limit.py middleware modules.
- Add health probe tests in test_health_probes.py.
- Update test_main.py with middleware order assertion tests.
- Update frontend useFetchData hook tests.
- Docs: update Deployment.md with Kubernetes probe config examples.

2026-05-04 02:42:09 +02:00

37 KiB

Raw Blame History

Deployment Guide

Graceful Shutdown

BanGUI implements graceful shutdown to ensure in-flight operations complete before the process exits. This prevents:

Incomplete blocklist imports leaving stale data
Interrupted ban requests
Corrupted background job states
Unclean database connection closures

How It Works

SIGTERM received — Docker sends SIGTERM when docker stop is called
Uvicorn catches SIGTERM — Notifies the FastAPI lifespan handler
Lifespan shutdown begins — Scheduler stops accepting new jobs
In-flight tasks drain — Up to 25 seconds for running jobs to complete
Resources cleaned up — HTTP session, external logging, scheduler lock, DB connection

Docker Configuration

backend:
  stop_grace_period: 30s  # Give lifespan 30s to complete before SIGKILL

The stop_grace_period of 30s gives the Python code a 25s graceful timeout, leaving a 5s safety margin before Docker sends SIGKILL.

Shutdown Sequence

Step	Action	Timeout
1	Scheduler stops accepting new jobs	Immediate
2	Wait for pending background tasks	25s max
3	Close HTTP session	Immediate
4	Flush external logging handler	Immediate
5	Release scheduler lock	Immediate
6	Close database connection	Immediate

Background Tasks That Drain

Blocklist imports
Geo IP cache resolutions
History sync operations
Geo cache cleanup
Geo cache flush
Session cleanup
Rate limiter cleanup
Scheduler lock heartbeat

Monitoring Shutdown

Logs during shutdown:

bangui_shutting_down timeout_seconds=25.0
scheduler_stopped_accepting_jobs
waiting_for_pending_tasks count=3 timeout_seconds=25.0
pending_tasks_completed
http_session_closed
external_logging_shutdown_complete
scheduler_lock_released
bangui_shut_down

If tasks exceed the timeout:

pending_tasks_timeout cancelled_count=3

Rolling Deployments

During rolling deployments:

Old instance releases scheduler lock immediately on shutdown
New instance acquires lock without waiting for TTL expiry
Zero downtime for background job execution

Health Checks

The backend container includes three health check endpoints:

Combined Health Check — `GET /api/v1/health`

Reports application and component status for Docker HEALTHCHECK and legacy monitoring integration:

HTTP 200 with {"status": "ok", ...} — all components healthy
HTTP 200 with {"status": "degraded", ...} — some components unhealthy (e.g., database error) but fail2ban reachable
HTTP 503 with {"status": "unavailable", ...} — fail2ban is unreachable (backend will restart)

Component checks performed:

Component	Check	Notes
fail2ban	Socket ping via cached status	Returns 503 when offline
database	Opens and closes a test connection	Returns degraded when failing
scheduler	`scheduler.running` attribute	Returns degraded when stopped
cache	Session cache presence	Returns degraded when not initialised

Kubernetes Probes — Liveness and Readiness

Two separate probes following Kubernetes conventions:

Endpoint	Purpose	HTTP Code	Kubernetes Action
`GET /api/v1/health/live`	Process alive	Always 200	Restart container if non-2xx
`GET /api/v1/health/ready`	All subsystems ready	200 (all pass) / 503 (any fail)	Stop routing traffic if non-2xx

/health/live — Liveness probe: Returns 200 when the Python process and event loop are responsive. No subsystem checks are performed — this endpoint is always fast. Use for Kubernetes livenessProbe.

/health/ready — Readiness probe: Verifies all critical sub-systems are reachable before routing traffic. Returns 200 only when all pass; returns 503 with a JSON body listing every failed check otherwise.

Subsystem	Check	Timeout
database	Opens and closes a test connection	2 s
fail2ban	Socket reachability via cached server status	N/A (instant)
config_dir	Config directory read access (`os.R_OK`)	2 s
scheduler	`scheduler.running` attribute	N/A (instant)

Readiness response example (all healthy — HTTP 200):

{
  "status": "ok",
  "checks": [
    {"name": "database", "healthy": true},
    {"name": "fail2ban", "healthy": true},
    {"name": "config_dir", "healthy": true},
    {"name": "scheduler", "healthy": true}
  ],
  "failed_count": 0
}

Readiness response example (fail2ban offline — HTTP 503):

{
  "status": "error",
  "checks": [
    {"name": "database", "healthy": true},
    {"name": "fail2ban", "healthy": false, "message": "Socket not reachable"},
    {"name": "config_dir", "healthy": true},
    {"name": "scheduler", "healthy": true}
  ],
  "failed_count": 1
}

Why separate liveness and readiness? Liveness (/health/live) must be cheap — a slow or hanging liveness probe causes Kubernetes to restart a perfectly healthy container. Readiness (/health/ready) can afford to check sub-systems because traffic is only held back temporarily while a pod recovers.

Docker Health Check:

The Dockerfile includes a HEALTHCHECK that queries the endpoint. Docker interprets HTTP 503 as unhealthy and restarts the container after 3 consecutive failures (90 seconds by default).

Why 503 for offline fail2ban?

If fail2ban goes offline but the backend always returns 200, Docker treats the container as healthy. This masks infrastructure failures. By returning 503 when fail2ban is unreachable, orchestration tools (Docker, Kubernetes, Docker Swarm) automatically restart the backend container until fail2ban recovers.

Docker Compose health check parameters:

Parameter	Value	Rationale
`interval`	30s	Balance between responsiveness and load
`timeout`	10s	Allows for slow probe on busy system
`retries`	3	~90 seconds before restart (3 × 30s)
`start_period`	40s	Allows app and fail2ban to fully start

Rate Limiting

Rate limiting is enforced at two levels:

Global middleware — Per-IP request rate limit across all endpoints (default: 200 requests/minute per IP)

Per-bucket limits — Stricter limits on specific operations:

Bucket	Limit	Window	Purpose
`bans:ban`	100/min	60s	Ban operations
`bans:unban`	100/min	60s	Unban operations
`blocklist:import`	10/hour	3600s	Import operations
`config:update`	50/min	60s	Config write operations
`jail:*`	100/min	60s	Jail management
`filter:*`	50/min	60s	Filter management
`action:*`	50/min	60s	Action management

Process-Local Scope

Current implementation is process-local. Each worker maintains independent in-memory counters. In a multi-worker deployment (N workers), an attacker can send up to N × limit requests before any single worker triggers a block — effectively multiplying the allowed request rate by the number of workers.

Short-term mitigation: The scheduler lock enforces single-worker mode. The startup warning log (rate_limiting_process_local_only) documents this constraint. Deploy with one worker.

Long-term solution: Replace the in-process GlobalRateLimiter with a Redis-backed adapter. The check_allowed() and check_allowed_for_bucket() interfaces are designed for a drop-in replacement using atomic INCR + EXPIRE semantics — no changes needed in middleware or router code.

Redis Migration (Future)

When migrating to Redis, replace the in-memory deque store with:

# Atomic increment with expiry (pseudo-code)
count = redis.incr(f"rl:{ip}")
if count == 1:  # First request, set expiry
    redis.expire(f"rl:{ip}", window_seconds)
if count > max_requests:
    return False, window_seconds - redis.ttl(f"rl:{ip}")
return True, 0

The bucket variants use INCR + EXPIRE on rl:{bucket}:{ip} keys. This preserves the sliding-window semantics while providing shared state across all workers.

Monitoring

Check logs for these events:

global_rate_limit_exceeded — Global middleware blocked a request (WARNING)
rate_limiting_process_local_only — Startup warning about multi-worker limitation (WARNING)
rate_limiter_cleanup — Periodic cleanup of expired entries (DEBUG)

CORS Configuration

Cross-Origin Resource Sharing (CORS) must be explicitly configured when the frontend and backend are served from different origins.

Development

By default, the backend allows requests from common localhost development origins:

http://localhost:5173
http://127.0.0.1:5173
https://localhost:5173
https://127.0.0.1:5173

No additional configuration is needed for local development — just run the frontend and backend normally.

Production

In production, override the default with your actual frontend origin(s):

Docker Compose:

environment:
  BANGUI_CORS_ALLOWED_ORIGINS: "https://example.com,https://www.example.com"

Environment File (.env):

BANGUI_CORS_ALLOWED_ORIGINS=https://example.com,https://www.example.com

Multiple Origins: Separate multiple allowed origins with commas (no spaces):

BANGUI_CORS_ALLOWED_ORIGINS=https://example.com,https://app.example.com,https://admin.example.com

Disable CORS: To disable CORS entirely (e.g., when the frontend is served from the same origin as the backend):

BANGUI_CORS_ALLOWED_ORIGINS=

Security Considerations

Always specify exact origins — never use wildcard * in production, especially with allow_credentials=true (credentials mode is required for the session cookie).
Use HTTPS in production — the backend enforces the Secure cookie flag, which requires HTTPS (or localhost for development).
Validate in reverse proxy — if using Nginx or a CDN reverse proxy, validate the Origin header before forwarding requests to ensure only legitimate origins reach the backend.

Troubleshooting

Symptom	Cause	Solution
`Access-Control-Allow-Origin` header missing from response	CORS not configured or origin not whitelisted	Check `BANGUI_CORS_ALLOWED_ORIGINS` and ensure your frontend origin is included
Browser blocks requests with CORS error	Credentials mode enabled but origin not exactly whitelisted	Ensure `BANGUI_CORS_ALLOWED_ORIGINS` includes the exact origin (protocol + domain + port) of your frontend
Works in development but fails in production	Default localhost origins used instead of production frontend domain	Override `BANGUI_CORS_ALLOWED_ORIGINS` in production environment

In multi-instance deployments (e.g., Kubernetes, Docker Swarm), the scheduler lock prevents duplicate execution of background tasks by ensuring only one instance runs the scheduler at a time.

How It Works

The lock is stored in the SQLite database and enforced via:

Lock Acquisition — At startup, each instance tries to insert a lock record. Only one succeeds; others reject startup with a clear error message.
Heartbeat — The lock-holding instance sends a heartbeat every 5 seconds to prove it's still alive.
Stale Lock Cleanup — On startup, any lock older than 60 seconds (without a heartbeat) is automatically deleted, allowing recovery from instance crashes.

Configuration

Parameter	Value	Rationale
Heartbeat Interval	5 seconds	Allows ~12 missed heartbeats before lock expires
Lock TTL	60 seconds	Time before a lock without heartbeat is considered abandoned
Min Safe Ratio	12x (TTL / interval)	Robust protection against temporary delays or high load

With a 60-second TTL and 5-second heartbeat interval, the lock survives even if the instance becomes unresponsive for up to ~55 seconds. This provides strong protection against false positives while still detecting genuine crashes.

Monitoring

Check logs for these key events:

scheduler_lock_acquired — Lock successfully acquired at startup (INFO)
scheduler_lock_heartbeat_updated — Heartbeat successfully updated (DEBUG)
scheduler_lock_heartbeat_failed — Heartbeat update failed; lock may be lost (WARNING)
scheduler_lock_heartbeat_timeout — Heartbeat exceeded 5-second timeout (ERROR)
scheduler_lock_held_by_other_instance — Another instance holds the lock (WARNING at startup)

Troubleshooting: "Blocklist import runs twice"

Symptom: Blocklist import task executes simultaneously in two instances, causing duplicate entries or data corruption.

Cause: The scheduler lock was released prematurely (e.g., instance crash, database timeout) while a task was still running.

Solution:

Check heartbeat timing — Ensure the instance isn't hanging for >60 seconds (monitor CPU/memory/disk).
Verify database health — Run SELECT * FROM scheduler_lock; to see if a stale lock exists. If present, delete it: DELETE FROM scheduler_lock;
Review logs — Look for scheduler_lock_heartbeat_failed or scheduler_lock_heartbeat_timeout errors in the time window when duplication occurred.
Increase resource limits — If the backend is memory/CPU constrained, increase limits in docker-compose.yml to prevent slowdowns that trigger false lock timeouts.
Check database performance — Slow database queries can delay heartbeat updates. Run PRAGMA integrity_check; to check for corruption.

If duplication occurs frequently, consider migrating to Redis-backed locking (see Advanced section below) for higher reliability.

Troubleshooting: "Scheduler stops completely"

Symptom: Background tasks (blocklist import, geo cache cleanup, history sync, session cleanup) stop running. No errors in logs but tasks don't execute.

Cause: Instance holding the scheduler lock crashed without releasing it, or heartbeat is failing silently.

Diagnosis:

Check if lock exists: SELECT * FROM scheduler_lock;
If lock exists with a PID that no longer runs, it's orphaned
Check logs for scheduler_lock_heartbeat_lost warnings

Solution:

Clear the orphaned lock: DELETE FROM scheduler_lock;
Restart the instance that should hold the lock
Verify lock acquisition: grep "scheduler_lock_acquired" logs
If heartbeat keeps failing, check database latency (SQLite heartbeats should be <100ms)

Prevention:

Monitor scheduler_lock_heartbeat_lost events — more than 3 in an hour indicates a problem
Ensure database I/O is not bottlenecked (SSD recommended for SQLite)
Consider reducing heartbeat interval if network latency causes false timeouts

Advanced: Migrating to Redis

For very high-traffic deployments with strict data consistency requirements, you can replace the SQLite-backed lock with Redis:

Why: Redis is single-threaded and atomic by design; clock skew and timeout issues are eliminated.
How: Install redlock-py or aioredis, replace scheduler_lock.py with a Redis implementation, update heartbeat interval to 2-3 seconds.
Trade-off: Adds a Redis dependency but eliminates database lock contention and provides microsecond-precision atomicity.

This is not required for typical deployments but is recommended if you see frequent scheduler conflicts in logs.

All containers have hard limits (max usage) and soft reservations (guaranteed allocation). This ensures:

Isolation: A misbehaving container cannot crash others or the host
Predictability: Reservations guarantee minimum resources even under load
Efficiency: Unused reserved capacity can be borrowed by other containers

Container Resource Limits

Container	Limit CPU	Limit Memory	Reserved CPU	Reserved Memory	Purpose
fail2ban	0.5	128M	0.1	64M	Monitors logs, bans IPs—typically idle
backend	2.0	512M	1.0	256M	Core app: database, fail2ban API, config management
frontend	0.5	128M	0.25	64M	Nginx: serves SPA + API proxy

Rationale

fail2ban: Lightweight log monitoring. Occasionally CPU spikes during ban processing but memory usage is minimal.
backend: Heavy lifting—Python runtime, SQLite database, background jobs. May need extra memory for large blocklists. Reservation of 1.0 CPU ensures responsive API even when frontend is busy.
frontend: Nginx is efficient. Limit of 0.5 CPU and 128M memory is more than sufficient for reverse proxy duties.

Memory Considerations

Backend Memory Requirements

The backend typically runs in 256–512M under normal load. Memory usage depends on:

Blocklist size: Large blocklists (>1M entries) require more heap space
Cache warmth: First query after startup may require more memory as caches fill
Concurrent connections: Each active user session uses a small amount of memory

Tuning: If you see OOM kills in logs, increase backend limits and reservations (e.g., 1024M limit). Test under realistic load before finalizing.

Frontend Memory Usage

Nginx is typically <50M. If you see memory pressure on frontend, check for:

Misconfigured cache headers on static assets
Large log volumes (nginx access logs)

Docker Swarm & Kubernetes

For production deployments using orchestration platforms:

Docker Swarm

The deploy sections in docker-compose.yml are compatible with docker stack deploy:

docker stack deploy -c Docker/docker-compose.yml bangui

Swarm respects the same limits and reservations fields.

Kubernetes

For Kubernetes, translate resource constraints to equivalent resources fields in your deployment manifests:

containers:
  - name: backend
    image: git.lpl-mind.de/lukas.pupkalipinski/bangui/backend:latest
    resources:
      limits:
        cpu: "2"
        memory: "512Mi"
      requests:
        cpu: "1"
        memory: "256Mi"

Kubernetes equivalent mappings:

Docker deploy.limits → Kubernetes resources.limits
Docker deploy.reservations → Kubernetes resources.requests

Monitoring Resource Usage

Docker Compose (Development)

docker stats

Shows real-time CPU and memory usage for all running containers.

Production (Docker Swarm / Kubernetes)

Use native monitoring:

Docker Swarm: Prometheus + Grafana
Kubernetes: Metrics Server + dashboard or Prometheus

Configuration

All runtime settings are documented in CONFIGURATION.md, including database, session, fail2ban, HTTP client, geolocation, CORS, logging, rate limiting, and observability options.

Environment Variables

Resource limits are configured in Docker/docker-compose.yml and cannot be overridden via environment variables. To adjust limits:

Edit Docker/docker-compose.yml
Modify the deploy.limits and deploy.reservations sections
Restart containers: make down && make up

Troubleshooting

Issue	Symptom	Solution
Backend OOM kills	"Exit code 137" in logs	Increase backend `memory` limit
Throttling	CPU at 100%, requests slow	Increase CPU limit or optimize code
Service startup timeout	Containers not becoming "healthy"	Increase reservation to guarantee capacity at startup
Host unresponsive	System-wide lag	Reduce container limits to prevent host starvation

Disaster Recovery

Database Migration Failures

If a migration fails mid-transaction, the application refuses to start. This is intentional — it prevents inconsistent schema states.

Diagnosis:

Check current schema version:

sqlite3 /var/lib/bangui/bangui.db "SELECT MAX(version) FROM schema_migrations;"

Check which tables exist:

sqlite3 /var/lib/bangui/bangui.db "SELECT name FROM sqlite_master WHERE type='table';"

Check application logs for the specific error.

Recovery Options:

Automatic rollback: Next startup re-applies the same migration from scratch

Manual completion: Apply the migration manually, then insert the version record:

sqlite3 /var/lib/bangui/bangui.db "BEGIN IMMEDIATE;"
-- Run your SQL here
sqlite3 /var/lib/bangui/bangui.db "INSERT INTO schema_migrations (version) VALUES (?);"
sqlite3 /var/lib/bangui/bangui.db "COMMIT;"

Full reset (development only): rm bangui.db bangui.db-wal bangui.db-shm

Prevention:

Never modify bangui.db manually during running instance
Always backup before major migrations
Monitor startup logs for migrating_database_schema events

Orphaned WAL Files

After crashes, SQLite WAL mode may leave orphaned .wal files. The database auto-recovers on next open. If you see WAL-related errors:

# Check for orphaned WAL files
ls -la /var/lib/bangui/bangui.db*

# Force checkpoint to merge WAL into main database
sqlite3 /var/lib/bangui/bangui.db "PRAGMA wal_checkpoint(FULL);"

See Docs/DATABASE_MIGRATIONS.md for full recovery procedures.

Next Steps

Development: Run make up to start with default limits
Staging: Test with realistic data volumes and monitor resource usage
Production: Adjust limits based on observed usage patterns, then commit changes

Security Best Practices

Secrets Management

Never hard-code secrets. All secrets must be injected at runtime via environment variables.

Secret	Purpose	Generation
`BANGUI_SESSION_SECRET`	Signs session cookies	`python -c 'import secrets; print(secrets.token_hex(32))'`
fail2ban credentials	jail config access	From fail2ban configuration

Store secrets in a secrets manager (e.g., Docker secrets, Kubernetes Secrets, HashiCorp Vault)
Rotate BANGUI_SESSION_SECRET periodically — sessions become invalid, users must re-login
Never log or expose session secrets

Container Security Hardening

Non-root user: Backend runs as bangui:bangui (UID 1000). Frontend runs as nginx default. This limits container breakout damage.

Filesystem permissions:

# Data directory (SQLite DB) — only bangui user rw
chmod 700 /data
chown 1000:1000 /data

# Config directory — read-only for backend (it reads fail2ban config)
# Write access only for config management operations via BanGUI
chmod 755 /config

Capabilities: fail2ban container requires NET_ADMIN and NET_RAW for raw socket manipulation and iptables interaction. No additional capabilities needed for app containers.

No privileged mode: BanGUI containers must not run --privileged. The fail2ban container needs only specific capabilities, not full host access.

Network Security

Internal network only: All BanGUI containers communicate on bangui-net. Only the frontend port (default 8080) is exposed to the host.
fail2ban socket: Mounted read-only (ro) from host — backend reads status only
fail2ban config: Mounted read-write — BanGUI modifies jail configurations as requested

Drop traffic between containers: Use Docker network isolation to prevent lateral movement:

networks:
  bangui-net:
    driver: bridge
    internal: false  # Allow external only for frontend

TLS / HTTPS

BanGUI does not terminate TLS. Handle TLS at the reverse proxy or load balancer level:

Nginx (existing frontend container):

server {
    listen 443 ssl http2;
    server_name bangui.example.com;

    ssl_certificate     /etc/ssl/certs/bangui.crt;
    ssl_certificate_key /etc/ssl/private/bangui.key;
    ssl_protocols       TLSv1.2 TLSv1.3;
    ssl_ciphers         ECDHE-ECDSA-AES128-GCM-SHA256:ECDHE-RSA-AES128-GCM-SHA256;

    # Proxy to existing frontend container
    location / {
        proxy_pass http://bangui-frontend:80;
        ...
    }
}

Security headers (already in nginx.conf):

CSP, X-Frame-Options, X-Content-Type-Options, Referrer-Policy, Permissions-Policy
Uncomment HSTS header when HTTPS is fully configured

HTTP to HTTPS redirect: Add in your TLS terminator:

server {
    listen 80;
    server_name bangui.example.com;
    return 301 https://$host$request_uri;
}

Dependency Scanning

Scan base images for vulnerabilities regularly:

# Trivy (Docker/Podman compatible)
trivy image python:3.12-slim
trivy image nginx:1.27-alpine
trivy image node:22-alpine

# CI integration
trivy image --exit-code 1 --severity HIGH,CRITICAL git.lpl-mind.de/lukas.pupkalipinski/bangui/backend:latest

Update base images quarterly or when CVEs are published.

Rate Limiting at Deployment Level

The application-level rate limiter (BANGUI_RATE_LIMIT_* env vars) handles API requests. Add deployment-level protection:

Nginx (existing reverse proxy):

# Limit concurrent connections per IP
limit_conn_zone $binary_remote_addr zone=conn_limit:10m;
server {
    limit_conn conn_limit 100;
}

Fail2ban (already running):

BanGUI manages fail2ban jails
Additional deployment-level rate limits should target infrastructure endpoints (SSH, management UIs), not BanGUI itself

Audit Logging

All authentication events are logged via structlog:

Event	Log Key	Severity
Login success	`auth_login_success`	INFO
Login failure	`auth_login_failure`	WARNING
Session created	`session_created`	INFO
Session destroyed	`session_destroyed`	INFO
Session expired	`session_expired`	INFO

Forward these logs to a SIEM or log aggregator for security monitoring. See Structured Logging below.

Performance Tuning

SQLite Performance

SQLite is single-writer. Under write-heavy load (blocklist imports, history writes), writes may queue.

WAL mode (default, do not disable):

PRAGMA journal_mode=WAL;  -- Already enabled by default

Synchronous mode for production:

PRAGMA synchronous=NORMAL;  -- Balanced (not FULL, not OFF)

This survives process crashes without corruption while maintaining good write performance.

Cache size (increase for production):

# In-memory cache: 64MB (adjust based on available RAM)
PRAGMA cache_size=-65536;  -- negative = KB

temp_store for large sorts:

PRAGMA temp_store=MEMORY;

Read performance:

Most reads are point queries by IP or jail name — indexes handle this efficiently
Large history scans (dashboard) — paginate, use LIMIT/OFFSET
Avoid SELECT * on large tables — always specify needed columns

Gzip Compression

Already enabled in nginx.conf. Verify effective compression:

curl -H "Accept-Encoding: gzip" -I http://localhost:8080/api/v1/dashboard/status
# Should show: Content-Encoding: gzip

Backend Performance

Startup warm-up: On first request after start, caches are cold. First blocklist query may be slower. This is normal — subsequent requests hit cache.

Memory tuning:

# docker-compose.yml — increase if OOM
backend:
  deploy:
    limits:
      memory: 1024M  # Up from 512M for large blocklists

Single worker enforced: The session cache is process-local. Multiple workers would cause random logouts. This is intentional — scale horizontally via orchestration, not vertically via workers.

Single-Worker Requirement

BanGUI enforces single-worker mode at startup. It fails immediately with a clear error if more than one worker is configured.

Why this matters:

In-memory session cache — each worker has its own cache copy. A session cached in worker A is invisible to worker B. A user validated by A may be rejected by B.
Rate-limit windows — per-IP counters are process-local. With 4 workers, a client hitting different workers gets 4× the intended rate limit.
Runtime state — fail2ban status, pending recovery records, and jail service capability flags are all per-process. Dashboard queries to different workers return inconsistent data.
Background scheduler — the database lock ensures only one instance runs scheduled jobs, but each worker's scheduler still fires. With multi-worker, the same job runs N times.

Detection:

The check runs at application startup in create_app():

WEB_CONCURRENCY env var — set by gunicorn, and by uvicorn in recent versions when --workers N is passed
BANGUI_WORKERS env var — explicit override (discouraged)

If either is set to a value > 1, RuntimeError is raised with instructions and a reference to this document.

Test mode:

The check is automatically skipped when TESTING=1 is set. This allows the test suite to run with an arbitrary number of workers.

What to do instead of multi-worker:

Scale horizontally via container orchestration — run multiple containers behind a load balancer. Each container runs a single worker. The database lock ensures only one container runs background jobs at a time.

Frontend Performance

Static asset caching (already configured):

location /assets/ {
    expires 1y;
    add_header Cache-Control "public, immutable";
}

Bundle size: Production build uses esbuild minification. Monitor with:

du -sh frontend/dist/
ls -lh frontend/dist/assets/*.js

Database Maintenance

Periodic checkpoint (production, monthly or after large blocklist imports):

sqlite3 /data/bangui.db "PRAGMA wal_checkpoint(FULL);"

Analyze for query planner (after bulk inserts/deletes):

sqlite3 /data/bangui.db "ANALYZE;"

Monitoring Setup

Health Check Endpoints

Combined health check — GET /api/v1/health — primary monitoring target for Docker HEALTHCHECK.

Status	HTTP Code	Meaning
`ok`	200	All components healthy
`degraded`	200	Some components unhealthy — investigate
`unavailable`	503	fail2ban unreachable — container will be restarted

Kubernetes probes:

GET /api/v1/health/live — Liveness probe. Always returns 200 if the process is alive.

GET /api/v1/health/ready — Readiness probe. Returns 200 when all subsystems pass, 503 otherwise.

Probe	URL	Success	Failure
Liveness	`/api/v1/health/live`	200	Non-2xx → restart
Readiness	`/api/v1/health/ready`	200	Non-2xx → stop traffic

Structured Logging

All logs are structured (JSON via structlog). Key fields:

Log field	Description
`event`	Event name (e.g., `auth_login_success`)
`request_id`	Per-request correlation ID
`user_id`	Session user (if authenticated)
`duration_ms`	Request duration
`component`	Component name (e.g., `scheduler`, `database`)

Log levels:

Level	Use
DEBUG	Detailed debugging (query SQL, cache hits)
INFO	Operational events (startup, shutdown, login, ban action)
WARNING	Recoverable issues (cache miss, lock contention)
ERROR	Failures requiring attention (DB error, fail2ban offline)

Configure via env:

BANGUI_LOG_LEVEL=info   # debug, info, warning, error

Log Aggregation

Docker Compose — forward container logs to aggregator:

services:
  backend:
    logging:
      driver: "json-file"
      options:
        max-size: "10m"
        max-file: "3"

External aggregators:

# Fluentd example
services:
  backend:
    logging:
      driver: fluentd
      options:
        fluentd-address localhost:24224
        tag bangui-backend

ELK Stack — send JSON logs directly to Logstash or via Filebeat.

Metrics to Monitor

Metric	Source	Alert Threshold
Health check failures	`/api/v1/health`	3 consecutive → container restart
Backend memory	`docker stats`	>450M (of 512M limit)
Backend CPU	`docker stats`	>80% sustained
Disk usage (`/data`)	`df -h`	>80%
fail2ban container restarts	`docker ps`	>2/hour
Backend container restarts	`docker ps`	>2/hour
Database file size	`ls -lh /data/bangui.db`	Grows >10MB/day indicates issue
Session count	`/api/v1/sessions`	Sudden drop indicates cache issue
Blocklist import duration	Logs (`blocklist_import_completed`)	>5 minutes may indicate performance issue

Uptime Monitoring

External checks:

Monitor https://your-domain.com/api/v1/health from multiple geographic locations
Use services: Better Uptime, UptimeRobot, Pingdom
Alert on: HTTP 503, HTTP 200 + degraded status, connection timeout

Alerting

Critical (PagerDuty / immediate):

Health check HTTP 503 for >30 seconds
Backend OOM kill (exit code 137)
fail2ban offline for >5 minutes

Warning (Slack / email):

Health check returns degraded
Disk usage >80%
Memory usage >450M
Backend restarts >2/hour

Scaling Guidelines

Horizontal Scaling

BanGUI is designed for horizontal scaling via container orchestration (not multiple workers):

┌─────────────────────────────────────────────────┐
│              Load Balancer                      │
│         (nginx, HAProxy, Traefik)               │
└──────────────────┬─────────────────────────────┘
                   │
      ┌─────────────┼─────────────┐
      ▼            ▼            ▼
┌──────────┐ ┌──────────┐ ┌──────────┐
│ Backend  │ │ Backend  │ │ Backend  │
│ (inst 1) │ │ (inst 2) │ │ (inst 3) │
└────┬─────┘ └────┬─────┘ └────┬─────┘
     │            │            │
     └────────────┼────────────┘
                  ▼
         ┌───────────────┐
         │  Scheduler    │
         │  Lock (DB)    │ ← Only one instance runs jobs
         └───────────────┘
                  │
                  ▼
         ┌───────────────┐
         │    SQLite    │
         │  (shared fs) │
         └───────────────┘

How it works:

Scheduler lock ensures only one instance runs background jobs
Session cache is per-instance — use sticky sessions at load balancer, OR configure BANGUI_SESSION_CACHE=redis for shared sessions
SQLite on shared storage — use network file system (NFS, GlusterFS) or block storage (AWS EBS)

Stateless Design

For true stateless scaling without sticky sessions, migrate session cache to Redis:

# docker-compose.yml
backend:
  environment:
    - BANGUI_SESSION_CACHE=redis
    - BANGUI_REDIS_URL=redis://redis:6379/0
  depends_on:
    redis:
      condition: service_healthy

  redis:
    image: docker.io/library/redis:7-alpine
    deploy:
      limits:
        cpus: '0.5'
        memory: 256M

Benefits:

Sessions shared across all instances → no sticky sessions needed
Load balancer can distribute freely
Scales linearly

Trade-offs:

Redis is another dependency to monitor
Redis persistence required for session survival across Redis restarts
Redis failure causes mass logouts

Database Scaling

SQLite does not support read replicas. Scaling reads is limited.

Read scaling (if needed):

Cache aggressively — BanGUI caches blocklist data in-memory
Add read-only views for dashboard queries
Consider periodic snapshot exports to separate read-optimized store

Write scaling:

Single writer only — SQLite WAL helps but doesn't parallelize writes
If write throughput becomes a bottleneck, consider:
- Periodic batching (already used for blocklist imports)
- Sharding by jail (separate DB per jail) — architectural change
- Migration to PostgreSQL — significant effort

CDN for Static Assets

For large-scale deployments, serve /assets/ from a CDN:

# Replace /assets/ proxy with CDN origin
location /assets/ {
    proxy_pass https://your-cdn.cloudfront.net/assets/;
    proxy_cache_valid 1y;
    add_header Cache-Control "public, immutable";
}

Benefits:

Reduces frontend container load
Assets served from edge locations close to users
Reduces bandwidth costs

Autoscaling

Docker Swarm: Use the labels + update_config pattern for rolling updates. Autoscaling requires external metrics (Prometheus + VPA or similar).

Kubernetes: HorizontalPodAutoscaler (HPA) based on CPU/memory:

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: bangui-backend
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: bangui-backend
  minReplicas: 2
  maxReplicas: 10
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 70
  - type: Resource
    resource:
      name: memory
      target:
        type: Utilization
        averageUtilization: 80

Load Balancer Configuration

Health check:

# HAProxy example
backend-check:
    option httpchk GET /api/v1/health
    http-check expect status 200

Sticky sessions (if NOT using Redis):

# HAProxy
appsession _SESSION_ID len 64 timeout 24h

Connection limits:

# Per-backend limit to prevent overload
server backend1 backend:8000 maxconn 50

37 KiB Raw Blame History Unescape Escape

Deployment Guide

Graceful Shutdown

How It Works

Docker Configuration

Shutdown Sequence

Background Tasks That Drain

Monitoring Shutdown

Rolling Deployments

Health Checks

Combined Health Check — GET /api/v1/health

Kubernetes Probes — Liveness and Readiness

Rate Limiting

Process-Local Scope

Redis Migration (Future)

Monitoring

CORS Configuration

Development

Production

Security Considerations

Troubleshooting

How It Works

Configuration

Monitoring

Troubleshooting: "Blocklist import runs twice"

Troubleshooting: "Scheduler stops completely"

Advanced: Migrating to Redis

Container Resource Limits

Rationale

Memory Considerations

Backend Memory Requirements

Frontend Memory Usage

Docker Swarm & Kubernetes

Docker Swarm

Kubernetes

Monitoring Resource Usage

Docker Compose (Development)

Production (Docker Swarm / Kubernetes)

Configuration

Environment Variables

Troubleshooting

Disaster Recovery

Database Migration Failures

Orphaned WAL Files

Next Steps

Security Best Practices

Secrets Management

Container Security Hardening

Network Security

TLS / HTTPS

Dependency Scanning

Rate Limiting at Deployment Level

Audit Logging

Performance Tuning

SQLite Performance

Gzip Compression

Backend Performance

Single-Worker Requirement

Frontend Performance

Database Maintenance

Monitoring Setup

Health Check Endpoints

Structured Logging

Log Aggregation

Metrics to Monitor

Uptime Monitoring

Alerting

Scaling Guidelines

Horizontal Scaling

Stateless Design

Database Scaling

CDN for Static Assets

Autoscaling

Load Balancer Configuration

Next Steps

37 KiB

Raw Blame History

Combined Health Check — `GET /api/v1/health`