Fail with RuntimeError when WEB_CONCURRENCY or BANGUI_WORKERS > 1. In-memory session cache, rate-limit windows, and runtime state are process-local. Multi-worker silently causes stale limits, ghost sessions, inconsistent status. Skipped when TESTING=1. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
32 KiB
Deployment Guide
Graceful Shutdown
BanGUI implements graceful shutdown to ensure in-flight operations complete before the process exits. This prevents:
- Incomplete blocklist imports leaving stale data
- Interrupted ban requests
- Corrupted background job states
- Unclean database connection closures
How It Works
- SIGTERM received — Docker sends SIGTERM when
docker stopis called - Uvicorn catches SIGTERM — Notifies the FastAPI lifespan handler
- Lifespan shutdown begins — Scheduler stops accepting new jobs
- In-flight tasks drain — Up to 25 seconds for running jobs to complete
- Resources cleaned up — HTTP session, external logging, scheduler lock, DB connection
Docker Configuration
backend:
stop_grace_period: 30s # Give lifespan 30s to complete before SIGKILL
The stop_grace_period of 30s gives the Python code a 25s graceful timeout, leaving a 5s safety margin before Docker sends SIGKILL.
Shutdown Sequence
| Step | Action | Timeout |
|---|---|---|
| 1 | Scheduler stops accepting new jobs | Immediate |
| 2 | Wait for pending background tasks | 25s max |
| 3 | Close HTTP session | Immediate |
| 4 | Flush external logging handler | Immediate |
| 5 | Release scheduler lock | Immediate |
| 6 | Close database connection | Immediate |
Background Tasks That Drain
- Blocklist imports
- Geo IP cache resolutions
- History sync operations
- Geo cache cleanup
- Geo cache flush
- Session cleanup
- Rate limiter cleanup
- Scheduler lock heartbeat
Monitoring Shutdown
Logs during shutdown:
bangui_shutting_down timeout_seconds=25.0
scheduler_stopped_accepting_jobs
waiting_for_pending_tasks count=3 timeout_seconds=25.0
pending_tasks_completed
http_session_closed
external_logging_shutdown_complete
scheduler_lock_released
bangui_shut_down
If tasks exceed the timeout:
pending_tasks_timeout cancelled_count=3
Rolling Deployments
During rolling deployments:
- Old instance releases scheduler lock immediately on shutdown
- New instance acquires lock without waiting for TTL expiry
- Zero downtime for background job execution
Health Checks
The backend container includes a health check endpoint at GET /api/v1/health that reports application and component status:
- HTTP 200 with
{"status": "ok", ...}— all components healthy - HTTP 200 with
{"status": "degraded", ...}— some components unhealthy (e.g., database error) but fail2ban reachable - HTTP 503 with
{"status": "unavailable", ...}— fail2ban is unreachable (backend will restart)
Component checks performed:
| Component | Check | Notes |
|---|---|---|
| fail2ban | Socket ping via cached status | Returns 503 when offline |
| database | Opens and closes a test connection | Returns degraded when failing |
| scheduler | scheduler.running attribute |
Returns degraded when stopped |
| cache | Session cache presence | Returns degraded when not initialised |
Docker Health Check:
The Dockerfile includes a HEALTHCHECK that queries the endpoint. Docker interprets HTTP 503 as unhealthy and restarts the container after 3 consecutive failures (90 seconds by default).
Why 503 for offline fail2ban?
If fail2ban goes offline but the backend always returns 200, Docker treats the container as healthy. This masks infrastructure failures. By returning 503 when fail2ban is unreachable, orchestration tools (Docker, Kubernetes, Docker Swarm) automatically restart the backend container until fail2ban recovers.
Docker Compose health check parameters:
| Parameter | Value | Rationale |
|---|---|---|
interval |
30s | Balance between responsiveness and load |
timeout |
10s | Allows for slow probe on busy system |
retries |
3 | ~90 seconds before restart (3 × 30s) |
start_period |
40s | Allows app and fail2ban to fully start |
CORS Configuration
Cross-Origin Resource Sharing (CORS) must be explicitly configured when the frontend and backend are served from different origins.
Development
By default, the backend allows requests from common localhost development origins:
http://localhost:5173http://127.0.0.1:5173https://localhost:5173https://127.0.0.1:5173
No additional configuration is needed for local development — just run the frontend and backend normally.
Production
In production, override the default with your actual frontend origin(s):
Docker Compose:
environment:
BANGUI_CORS_ALLOWED_ORIGINS: "https://example.com,https://www.example.com"
Environment File (.env):
BANGUI_CORS_ALLOWED_ORIGINS=https://example.com,https://www.example.com
Multiple Origins: Separate multiple allowed origins with commas (no spaces):
BANGUI_CORS_ALLOWED_ORIGINS=https://example.com,https://app.example.com,https://admin.example.com
Disable CORS: To disable CORS entirely (e.g., when the frontend is served from the same origin as the backend):
BANGUI_CORS_ALLOWED_ORIGINS=
Security Considerations
- Always specify exact origins — never use wildcard
*in production, especially withallow_credentials=true(credentials mode is required for the session cookie). - Use HTTPS in production — the backend enforces the Secure cookie flag, which requires HTTPS (or localhost for development).
- Validate in reverse proxy — if using Nginx or a CDN reverse proxy, validate the
Originheader before forwarding requests to ensure only legitimate origins reach the backend.
Troubleshooting
| Symptom | Cause | Solution |
|---|---|---|
Access-Control-Allow-Origin header missing from response |
CORS not configured or origin not whitelisted | Check BANGUI_CORS_ALLOWED_ORIGINS and ensure your frontend origin is included |
| Browser blocks requests with CORS error | Credentials mode enabled but origin not exactly whitelisted | Ensure BANGUI_CORS_ALLOWED_ORIGINS includes the exact origin (protocol + domain + port) of your frontend |
| Works in development but fails in production | Default localhost origins used instead of production frontend domain | Override BANGUI_CORS_ALLOWED_ORIGINS in production environment |
In multi-instance deployments (e.g., Kubernetes, Docker Swarm), the scheduler lock prevents duplicate execution of background tasks by ensuring only one instance runs the scheduler at a time.
How It Works
The lock is stored in the SQLite database and enforced via:
- Lock Acquisition — At startup, each instance tries to insert a lock record. Only one succeeds; others reject startup with a clear error message.
- Heartbeat — The lock-holding instance sends a heartbeat every 5 seconds to prove it's still alive.
- Stale Lock Cleanup — On startup, any lock older than 60 seconds (without a heartbeat) is automatically deleted, allowing recovery from instance crashes.
Configuration
| Parameter | Value | Rationale |
|---|---|---|
| Heartbeat Interval | 5 seconds | Allows ~12 missed heartbeats before lock expires |
| Lock TTL | 60 seconds | Time before a lock without heartbeat is considered abandoned |
| Min Safe Ratio | 12x (TTL / interval) | Robust protection against temporary delays or high load |
With a 60-second TTL and 5-second heartbeat interval, the lock survives even if the instance becomes unresponsive for up to ~55 seconds. This provides strong protection against false positives while still detecting genuine crashes.
Monitoring
Check logs for these key events:
scheduler_lock_acquired— Lock successfully acquired at startup (INFO)scheduler_lock_heartbeat_updated— Heartbeat successfully updated (DEBUG)scheduler_lock_heartbeat_failed— Heartbeat update failed; lock may be lost (WARNING)scheduler_lock_heartbeat_timeout— Heartbeat exceeded 5-second timeout (ERROR)scheduler_lock_held_by_other_instance— Another instance holds the lock (WARNING at startup)
Troubleshooting: "Blocklist import runs twice"
Symptom: Blocklist import task executes simultaneously in two instances, causing duplicate entries or data corruption.
Cause: The scheduler lock was released prematurely (e.g., instance crash, database timeout) while a task was still running.
Solution:
- Check heartbeat timing — Ensure the instance isn't hanging for >60 seconds (monitor CPU/memory/disk).
- Verify database health — Run
SELECT * FROM scheduler_lock;to see if a stale lock exists. If present, delete it:DELETE FROM scheduler_lock; - Review logs — Look for
scheduler_lock_heartbeat_failedorscheduler_lock_heartbeat_timeouterrors in the time window when duplication occurred. - Increase resource limits — If the backend is memory/CPU constrained, increase limits in
docker-compose.ymlto prevent slowdowns that trigger false lock timeouts. - Check database performance — Slow database queries can delay heartbeat updates. Run
PRAGMA integrity_check;to check for corruption.
If duplication occurs frequently, consider migrating to Redis-backed locking (see Advanced section below) for higher reliability.
Troubleshooting: "Scheduler stops completely"
Symptom: Background tasks (blocklist import, geo cache cleanup, history sync, session cleanup) stop running. No errors in logs but tasks don't execute.
Cause: Instance holding the scheduler lock crashed without releasing it, or heartbeat is failing silently.
Diagnosis:
- Check if lock exists:
SELECT * FROM scheduler_lock; - If lock exists with a PID that no longer runs, it's orphaned
- Check logs for
scheduler_lock_heartbeat_lostwarnings
Solution:
- Clear the orphaned lock:
DELETE FROM scheduler_lock; - Restart the instance that should hold the lock
- Verify lock acquisition:
grep "scheduler_lock_acquired" logs - If heartbeat keeps failing, check database latency (SQLite heartbeats should be <100ms)
Prevention:
- Monitor
scheduler_lock_heartbeat_lostevents — more than 3 in an hour indicates a problem - Ensure database I/O is not bottlenecked (SSD recommended for SQLite)
- Consider reducing heartbeat interval if network latency causes false timeouts
Advanced: Migrating to Redis
For very high-traffic deployments with strict data consistency requirements, you can replace the SQLite-backed lock with Redis:
- Why: Redis is single-threaded and atomic by design; clock skew and timeout issues are eliminated.
- How: Install
redlock-pyoraioredis, replacescheduler_lock.pywith a Redis implementation, update heartbeat interval to 2-3 seconds. - Trade-off: Adds a Redis dependency but eliminates database lock contention and provides microsecond-precision atomicity.
This is not required for typical deployments but is recommended if you see frequent scheduler conflicts in logs.
All containers have hard limits (max usage) and soft reservations (guaranteed allocation). This ensures:
- Isolation: A misbehaving container cannot crash others or the host
- Predictability: Reservations guarantee minimum resources even under load
- Efficiency: Unused reserved capacity can be borrowed by other containers
Container Resource Limits
| Container | Limit CPU | Limit Memory | Reserved CPU | Reserved Memory | Purpose |
|---|---|---|---|---|---|
| fail2ban | 0.5 | 128M | 0.1 | 64M | Monitors logs, bans IPs—typically idle |
| backend | 2.0 | 512M | 1.0 | 256M | Core app: database, fail2ban API, config management |
| frontend | 0.5 | 128M | 0.25 | 64M | Nginx: serves SPA + API proxy |
Rationale
- fail2ban: Lightweight log monitoring. Occasionally CPU spikes during ban processing but memory usage is minimal.
- backend: Heavy lifting—Python runtime, SQLite database, background jobs. May need extra memory for large blocklists. Reservation of 1.0 CPU ensures responsive API even when frontend is busy.
- frontend: Nginx is efficient. Limit of 0.5 CPU and 128M memory is more than sufficient for reverse proxy duties.
Memory Considerations
Backend Memory Requirements
The backend typically runs in 256–512M under normal load. Memory usage depends on:
- Blocklist size: Large blocklists (>1M entries) require more heap space
- Cache warmth: First query after startup may require more memory as caches fill
- Concurrent connections: Each active user session uses a small amount of memory
Tuning: If you see OOM kills in logs, increase backend limits and reservations (e.g., 1024M limit). Test under realistic load before finalizing.
Frontend Memory Usage
Nginx is typically <50M. If you see memory pressure on frontend, check for:
- Misconfigured cache headers on static assets
- Large log volumes (nginx access logs)
Docker Swarm & Kubernetes
For production deployments using orchestration platforms:
Docker Swarm
The deploy sections in docker-compose.yml are compatible with docker stack deploy:
docker stack deploy -c Docker/docker-compose.yml bangui
Swarm respects the same limits and reservations fields.
Kubernetes
For Kubernetes, translate resource constraints to equivalent resources fields in your deployment manifests:
containers:
- name: backend
image: git.lpl-mind.de/lukas.pupkalipinski/bangui/backend:latest
resources:
limits:
cpu: "2"
memory: "512Mi"
requests:
cpu: "1"
memory: "256Mi"
Kubernetes equivalent mappings:
- Docker
deploy.limits→ Kubernetesresources.limits - Docker
deploy.reservations→ Kubernetesresources.requests
Monitoring Resource Usage
Docker Compose (Development)
docker stats
Shows real-time CPU and memory usage for all running containers.
Production (Docker Swarm / Kubernetes)
Use native monitoring:
- Docker Swarm: Prometheus + Grafana
- Kubernetes: Metrics Server + dashboard or Prometheus
Configuration
All runtime settings are documented in CONFIGURATION.md, including database, session, fail2ban, HTTP client, geolocation, CORS, logging, rate limiting, and observability options.
Environment Variables
Resource limits are configured in Docker/docker-compose.yml and cannot be overridden via environment variables. To adjust limits:
- Edit
Docker/docker-compose.yml - Modify the
deploy.limitsanddeploy.reservationssections - Restart containers:
make down && make up
Troubleshooting
| Issue | Symptom | Solution |
|---|---|---|
| Backend OOM kills | "Exit code 137" in logs | Increase backend memory limit |
| Throttling | CPU at 100%, requests slow | Increase CPU limit or optimize code |
| Service startup timeout | Containers not becoming "healthy" | Increase reservation to guarantee capacity at startup |
| Host unresponsive | System-wide lag | Reduce container limits to prevent host starvation |
Disaster Recovery
Database Migration Failures
If a migration fails mid-transaction, the application refuses to start. This is intentional — it prevents inconsistent schema states.
Diagnosis:
-
Check current schema version:
sqlite3 /var/lib/bangui/bangui.db "SELECT MAX(version) FROM schema_migrations;" -
Check which tables exist:
sqlite3 /var/lib/bangui/bangui.db "SELECT name FROM sqlite_master WHERE type='table';" -
Check application logs for the specific error.
Recovery Options:
- Automatic rollback: Next startup re-applies the same migration from scratch
- Manual completion: Apply the migration manually, then insert the version record:
sqlite3 /var/lib/bangui/bangui.db "BEGIN IMMEDIATE;" -- Run your SQL here sqlite3 /var/lib/bangui/bangui.db "INSERT INTO schema_migrations (version) VALUES (?);" sqlite3 /var/lib/bangui/bangui.db "COMMIT;" - Full reset (development only):
rm bangui.db bangui.db-wal bangui.db-shm
Prevention:
- Never modify
bangui.dbmanually during running instance - Always backup before major migrations
- Monitor startup logs for
migrating_database_schemaevents
Orphaned WAL Files
After crashes, SQLite WAL mode may leave orphaned .wal files. The database auto-recovers on next open. If you see WAL-related errors:
# Check for orphaned WAL files
ls -la /var/lib/bangui/bangui.db*
# Force checkpoint to merge WAL into main database
sqlite3 /var/lib/bangui/bangui.db "PRAGMA wal_checkpoint(FULL);"
See Docs/DATABASE_MIGRATIONS.md for full recovery procedures.
Next Steps
- Development: Run
make upto start with default limits - Staging: Test with realistic data volumes and monitor resource usage
- Production: Adjust limits based on observed usage patterns, then commit changes
Security Best Practices
Secrets Management
Never hard-code secrets. All secrets must be injected at runtime via environment variables.
| Secret | Purpose | Generation |
|---|---|---|
BANGUI_SESSION_SECRET |
Signs session cookies | python -c 'import secrets; print(secrets.token_hex(32))' |
| fail2ban credentials | jail config access | From fail2ban configuration |
- Store secrets in a secrets manager (e.g., Docker secrets, Kubernetes Secrets, HashiCorp Vault)
- Rotate
BANGUI_SESSION_SECRETperiodically — sessions become invalid, users must re-login - Never log or expose session secrets
Container Security Hardening
Non-root user: Backend runs as bangui:bangui (UID 1000). Frontend runs as nginx default. This limits container breakout damage.
Filesystem permissions:
# Data directory (SQLite DB) — only bangui user rw
chmod 700 /data
chown 1000:1000 /data
# Config directory — read-only for backend (it reads fail2ban config)
# Write access only for config management operations via BanGUI
chmod 755 /config
Capabilities: fail2ban container requires NET_ADMIN and NET_RAW for raw socket manipulation and iptables interaction. No additional capabilities needed for app containers.
No privileged mode: BanGUI containers must not run --privileged. The fail2ban container needs only specific capabilities, not full host access.
Network Security
- Internal network only: All BanGUI containers communicate on
bangui-net. Only the frontend port (default 8080) is exposed to the host. - fail2ban socket: Mounted read-only (
ro) from host — backend reads status only - fail2ban config: Mounted read-write — BanGUI modifies jail configurations as requested
- Drop traffic between containers: Use Docker network isolation to prevent lateral movement:
networks: bangui-net: driver: bridge internal: false # Allow external only for frontend
TLS / HTTPS
BanGUI does not terminate TLS. Handle TLS at the reverse proxy or load balancer level:
Nginx (existing frontend container):
server {
listen 443 ssl http2;
server_name bangui.example.com;
ssl_certificate /etc/ssl/certs/bangui.crt;
ssl_certificate_key /etc/ssl/private/bangui.key;
ssl_protocols TLSv1.2 TLSv1.3;
ssl_ciphers ECDHE-ECDSA-AES128-GCM-SHA256:ECDHE-RSA-AES128-GCM-SHA256;
# Proxy to existing frontend container
location / {
proxy_pass http://bangui-frontend:80;
...
}
}
Security headers (already in nginx.conf):
- CSP, X-Frame-Options, X-Content-Type-Options, Referrer-Policy, Permissions-Policy
- Uncomment HSTS header when HTTPS is fully configured
HTTP to HTTPS redirect: Add in your TLS terminator:
server {
listen 80;
server_name bangui.example.com;
return 301 https://$host$request_uri;
}
Dependency Scanning
Scan base images for vulnerabilities regularly:
# Trivy (Docker/Podman compatible)
trivy image python:3.12-slim
trivy image nginx:1.27-alpine
trivy image node:22-alpine
# CI integration
trivy image --exit-code 1 --severity HIGH,CRITICAL git.lpl-mind.de/lukas.pupkalipinski/bangui/backend:latest
Update base images quarterly or when CVEs are published.
Rate Limiting at Deployment Level
The application-level rate limiter (BANGUI_RATE_LIMIT_* env vars) handles API requests. Add deployment-level protection:
Nginx (existing reverse proxy):
# Limit concurrent connections per IP
limit_conn_zone $binary_remote_addr zone=conn_limit:10m;
server {
limit_conn conn_limit 100;
}
Fail2ban (already running):
- BanGUI manages fail2ban jails
- Additional deployment-level rate limits should target infrastructure endpoints (SSH, management UIs), not BanGUI itself
Audit Logging
All authentication events are logged via structlog:
| Event | Log Key | Severity |
|---|---|---|
| Login success | auth_login_success |
INFO |
| Login failure | auth_login_failure |
WARNING |
| Session created | session_created |
INFO |
| Session destroyed | session_destroyed |
INFO |
| Session expired | session_expired |
INFO |
Forward these logs to a SIEM or log aggregator for security monitoring. See Structured Logging below.
Performance Tuning
SQLite Performance
SQLite is single-writer. Under write-heavy load (blocklist imports, history writes), writes may queue.
WAL mode (default, do not disable):
PRAGMA journal_mode=WAL; -- Already enabled by default
Synchronous mode for production:
PRAGMA synchronous=NORMAL; -- Balanced (not FULL, not OFF)
This survives process crashes without corruption while maintaining good write performance.
Cache size (increase for production):
# In-memory cache: 64MB (adjust based on available RAM)
PRAGMA cache_size=-65536; -- negative = KB
temp_store for large sorts:
PRAGMA temp_store=MEMORY;
Read performance:
- Most reads are point queries by IP or jail name — indexes handle this efficiently
- Large history scans (dashboard) — paginate, use
LIMIT/OFFSET - Avoid
SELECT *on large tables — always specify needed columns
Gzip Compression
Already enabled in nginx.conf. Verify effective compression:
curl -H "Accept-Encoding: gzip" -I http://localhost:8080/api/v1/dashboard/status
# Should show: Content-Encoding: gzip
Backend Performance
Startup warm-up: On first request after start, caches are cold. First blocklist query may be slower. This is normal — subsequent requests hit cache.
Memory tuning:
# docker-compose.yml — increase if OOM
backend:
deploy:
limits:
memory: 1024M # Up from 512M for large blocklists
Single worker enforced: The session cache is process-local. Multiple workers would cause random logouts. This is intentional — scale horizontally via orchestration, not vertically via workers.
Single-Worker Requirement
BanGUI enforces single-worker mode at startup. It fails immediately with a clear error if more than one worker is configured.
Why this matters:
- In-memory session cache — each worker has its own cache copy. A session cached in worker A is invisible to worker B. A user validated by A may be rejected by B.
- Rate-limit windows — per-IP counters are process-local. With 4 workers, a client hitting different workers gets 4× the intended rate limit.
- Runtime state — fail2ban status, pending recovery records, and jail service capability flags are all per-process. Dashboard queries to different workers return inconsistent data.
- Background scheduler — the database lock ensures only one instance runs scheduled jobs, but each worker's scheduler still fires. With multi-worker, the same job runs N times.
Detection:
The check runs at application startup in create_app():
WEB_CONCURRENCYenv var — set by gunicorn, and by uvicorn in recent versions when--workers Nis passedBANGUI_WORKERSenv var — explicit override (discouraged)
If either is set to a value > 1, RuntimeError is raised with instructions and a reference to this document.
Test mode:
The check is automatically skipped when TESTING=1 is set. This allows the test suite to run with an arbitrary number of workers.
What to do instead of multi-worker:
Scale horizontally via container orchestration — run multiple containers behind a load balancer. Each container runs a single worker. The database lock ensures only one container runs background jobs at a time.
Frontend Performance
Static asset caching (already configured):
location /assets/ {
expires 1y;
add_header Cache-Control "public, immutable";
}
Bundle size: Production build uses esbuild minification. Monitor with:
du -sh frontend/dist/
ls -lh frontend/dist/assets/*.js
Database Maintenance
Periodic checkpoint (production, monthly or after large blocklist imports):
sqlite3 /data/bangui.db "PRAGMA wal_checkpoint(FULL);"
Analyze for query planner (after bulk inserts/deletes):
sqlite3 /data/bangui.db "ANALYZE;"
Monitoring Setup
Health Check Endpoint
GET /api/v1/health — primary monitoring target.
| Status | HTTP Code | Meaning |
|---|---|---|
ok |
200 | All components healthy |
degraded |
200 | Some components unhealthy — investigate |
unavailable |
503 | fail2ban unreachable — container will be restarted |
Structured Logging
All logs are structured (JSON via structlog). Key fields:
| Log field | Description |
|---|---|
event |
Event name (e.g., auth_login_success) |
request_id |
Per-request correlation ID |
user_id |
Session user (if authenticated) |
duration_ms |
Request duration |
component |
Component name (e.g., scheduler, database) |
Log levels:
| Level | Use |
|---|---|
| DEBUG | Detailed debugging (query SQL, cache hits) |
| INFO | Operational events (startup, shutdown, login, ban action) |
| WARNING | Recoverable issues (cache miss, lock contention) |
| ERROR | Failures requiring attention (DB error, fail2ban offline) |
Configure via env:
BANGUI_LOG_LEVEL=info # debug, info, warning, error
Log Aggregation
Docker Compose — forward container logs to aggregator:
services:
backend:
logging:
driver: "json-file"
options:
max-size: "10m"
max-file: "3"
External aggregators:
# Fluentd example
services:
backend:
logging:
driver: fluentd
options:
fluentd-address localhost:24224
tag bangui-backend
ELK Stack — send JSON logs directly to Logstash or via Filebeat.
Metrics to Monitor
| Metric | Source | Alert Threshold |
|---|---|---|
| Health check failures | /api/v1/health |
3 consecutive → container restart |
| Backend memory | docker stats |
>450M (of 512M limit) |
| Backend CPU | docker stats |
>80% sustained |
Disk usage (/data) |
df -h |
>80% |
| fail2ban container restarts | docker ps |
>2/hour |
| Backend container restarts | docker ps |
>2/hour |
| Database file size | ls -lh /data/bangui.db |
Grows >10MB/day indicates issue |
| Session count | /api/v1/sessions |
Sudden drop indicates cache issue |
| Blocklist import duration | Logs (blocklist_import_completed) |
>5 minutes may indicate performance issue |
Uptime Monitoring
External checks:
- Monitor
https://your-domain.com/api/v1/healthfrom multiple geographic locations - Use services: Better Uptime, UptimeRobot, Pingdom
- Alert on: HTTP 503, HTTP 200 +
degradedstatus, connection timeout
Alerting
Critical (PagerDuty / immediate):
- Health check HTTP 503 for >30 seconds
- Backend OOM kill (exit code 137)
- fail2ban offline for >5 minutes
Warning (Slack / email):
- Health check returns
degraded - Disk usage >80%
- Memory usage >450M
- Backend restarts >2/hour
Scaling Guidelines
Horizontal Scaling
BanGUI is designed for horizontal scaling via container orchestration (not multiple workers):
┌─────────────────────────────────────────────────┐
│ Load Balancer │
│ (nginx, HAProxy, Traefik) │
└──────────────────┬─────────────────────────────┘
│
┌─────────────┼─────────────┐
▼ ▼ ▼
┌──────────┐ ┌──────────┐ ┌──────────┐
│ Backend │ │ Backend │ │ Backend │
│ (inst 1) │ │ (inst 2) │ │ (inst 3) │
└────┬─────┘ └────┬─────┘ └────┬─────┘
│ │ │
└────────────┼────────────┘
▼
┌───────────────┐
│ Scheduler │
│ Lock (DB) │ ← Only one instance runs jobs
└───────────────┘
│
▼
┌───────────────┐
│ SQLite │
│ (shared fs) │
└───────────────┘
How it works:
- Scheduler lock ensures only one instance runs background jobs
- Session cache is per-instance — use sticky sessions at load balancer, OR configure
BANGUI_SESSION_CACHE=redisfor shared sessions - SQLite on shared storage — use network file system (NFS, GlusterFS) or block storage (AWS EBS)
Stateless Design
For true stateless scaling without sticky sessions, migrate session cache to Redis:
# docker-compose.yml
backend:
environment:
- BANGUI_SESSION_CACHE=redis
- BANGUI_REDIS_URL=redis://redis:6379/0
depends_on:
redis:
condition: service_healthy
redis:
image: docker.io/library/redis:7-alpine
deploy:
limits:
cpus: '0.5'
memory: 256M
Benefits:
- Sessions shared across all instances → no sticky sessions needed
- Load balancer can distribute freely
- Scales linearly
Trade-offs:
- Redis is another dependency to monitor
- Redis persistence required for session survival across Redis restarts
- Redis failure causes mass logouts
Database Scaling
SQLite does not support read replicas. Scaling reads is limited.
Read scaling (if needed):
- Cache aggressively — BanGUI caches blocklist data in-memory
- Add read-only views for dashboard queries
- Consider periodic snapshot exports to separate read-optimized store
Write scaling:
- Single writer only — SQLite WAL helps but doesn't parallelize writes
- If write throughput becomes a bottleneck, consider:
- Periodic batching (already used for blocklist imports)
- Sharding by jail (separate DB per jail) — architectural change
- Migration to PostgreSQL — significant effort
CDN for Static Assets
For large-scale deployments, serve /assets/ from a CDN:
# Replace /assets/ proxy with CDN origin
location /assets/ {
proxy_pass https://your-cdn.cloudfront.net/assets/;
proxy_cache_valid 1y;
add_header Cache-Control "public, immutable";
}
Benefits:
- Reduces frontend container load
- Assets served from edge locations close to users
- Reduces bandwidth costs
Autoscaling
Docker Swarm: Use the labels + update_config pattern for rolling updates. Autoscaling requires external metrics (Prometheus + VPA or similar).
Kubernetes: HorizontalPodAutoscaler (HPA) based on CPU/memory:
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: bangui-backend
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: bangui-backend
minReplicas: 2
maxReplicas: 10
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 70
- type: Resource
resource:
name: memory
target:
type: Utilization
averageUtilization: 80
Load Balancer Configuration
Health check:
# HAProxy example
backend-check:
option httpchk GET /api/v1/health
http-check expect status 200
Sticky sessions (if NOT using Redis):
# HAProxy
appsession _SESSION_ID len 64 timeout 24h
Connection limits:
# Per-backend limit to prevent overload
server backend1 backend:8000 maxconn 50