Issue #3 - Unbounded Query Results (OOM): - get_all_archived_history() now uses keyset pagination with bounded max_rows (50k default) - Added 'id' field to records from get_archived_history() and get_archived_history_keyset() - Protocol signature updated with page_size, max_rows, last_ban_id params Issue #7 - Docker Health Check Fails: - Added curl to Dockerfile.backend runtime image - HEALTHCHECK now uses 'curl -f http://localhost:8000/api/health' - compose.prod.yml: increased start_period to 40s, timeout to 10s - Frontend healthcheck proxies to backend /api/health Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
12 KiB
Deployment Guide
Health Checks
The backend container includes a health check endpoint at GET /api/health that reports application and fail2ban daemon status:
- HTTP 200 with
{"status": "ok", "fail2ban": "online"}— backend is healthy and fail2ban is reachable - HTTP 503 with
{"status": "unavailable", "fail2ban": "offline"}— fail2ban is unreachable (backend will restart)
Docker Health Check:
The Dockerfile includes a HEALTHCHECK that queries the endpoint. Docker interprets HTTP 503 as unhealthy and restarts the container after 3 consecutive failures (90 seconds by default).
Why 503 for offline fail2ban?
If fail2ban goes offline but the backend always returns 200, Docker treats the container as healthy. This can mask infrastructure failures. By returning 503 when fail2ban is unreachable, orchestration tools (Docker, Kubernetes, Docker Swarm) will automatically restart the backend container until fail2ban recovers.
CORS Configuration
Cross-Origin Resource Sharing (CORS) must be explicitly configured when the frontend and backend are served from different origins.
Development
By default, the backend allows requests from common localhost development origins:
http://localhost:5173http://127.0.0.1:5173https://localhost:5173https://127.0.0.1:5173
No additional configuration is needed for local development — just run the frontend and backend normally.
Production
In production, override the default with your actual frontend origin(s):
Docker Compose:
environment:
BANGUI_CORS_ALLOWED_ORIGINS: "https://example.com,https://www.example.com"
Environment File (.env):
BANGUI_CORS_ALLOWED_ORIGINS=https://example.com,https://www.example.com
Multiple Origins: Separate multiple allowed origins with commas (no spaces):
BANGUI_CORS_ALLOWED_ORIGINS=https://example.com,https://app.example.com,https://admin.example.com
Disable CORS: To disable CORS entirely (e.g., when the frontend is served from the same origin as the backend):
BANGUI_CORS_ALLOWED_ORIGINS=
Security Considerations
- Always specify exact origins — never use wildcard
*in production, especially withallow_credentials=true(credentials mode is required for the session cookie). - Use HTTPS in production — the backend enforces the Secure cookie flag, which requires HTTPS (or localhost for development).
- Validate in reverse proxy — if using Nginx or a CDN reverse proxy, validate the
Originheader before forwarding requests to ensure only legitimate origins reach the backend.
Troubleshooting
| Symptom | Cause | Solution |
|---|---|---|
Access-Control-Allow-Origin header missing from response |
CORS not configured or origin not whitelisted | Check BANGUI_CORS_ALLOWED_ORIGINS and ensure your frontend origin is included |
| Browser blocks requests with CORS error | Credentials mode enabled but origin not exactly whitelisted | Ensure BANGUI_CORS_ALLOWED_ORIGINS includes the exact origin (protocol + domain + port) of your frontend |
| Works in development but fails in production | Default localhost origins used instead of production frontend domain | Override BANGUI_CORS_ALLOWED_ORIGINS in production environment |
In multi-instance deployments (e.g., Kubernetes, Docker Swarm), the scheduler lock prevents duplicate execution of background tasks by ensuring only one instance runs the scheduler at a time.
How It Works
The lock is stored in the SQLite database and enforced via:
- Lock Acquisition — At startup, each instance tries to insert a lock record. Only one succeeds; others reject startup with a clear error message.
- Heartbeat — The lock-holding instance sends a heartbeat every 5 seconds to prove it's still alive.
- Stale Lock Cleanup — On startup, any lock older than 60 seconds (without a heartbeat) is automatically deleted, allowing recovery from instance crashes.
Configuration
| Parameter | Value | Rationale |
|---|---|---|
| Heartbeat Interval | 5 seconds | Allows ~12 missed heartbeats before lock expires |
| Lock TTL | 60 seconds | Time before a lock without heartbeat is considered abandoned |
| Min Safe Ratio | 12x (TTL / interval) | Robust protection against temporary delays or high load |
With a 60-second TTL and 5-second heartbeat interval, the lock survives even if the instance becomes unresponsive for up to ~55 seconds. This provides strong protection against false positives while still detecting genuine crashes.
Monitoring
Check logs for these key events:
scheduler_lock_acquired— Lock successfully acquired at startup (INFO)scheduler_lock_heartbeat_updated— Heartbeat successfully updated (DEBUG)scheduler_lock_heartbeat_failed— Heartbeat update failed; lock may be lost (WARNING)scheduler_lock_heartbeat_timeout— Heartbeat exceeded 5-second timeout (ERROR)scheduler_lock_held_by_other_instance— Another instance holds the lock (WARNING at startup)
Troubleshooting: "Blocklist import runs twice"
Symptom: Blocklist import task executes simultaneously in two instances, causing duplicate entries or data corruption.
Cause: The scheduler lock was released prematurely (e.g., instance crash, database timeout) while a task was still running.
Solution:
- Check heartbeat timing — Ensure the instance isn't hanging for >60 seconds (monitor CPU/memory/disk).
- Verify database health — Run
SELECT * FROM scheduler_lock;to see if a stale lock exists. If present, delete it:DELETE FROM scheduler_lock; - Review logs — Look for
scheduler_lock_heartbeat_failedorscheduler_lock_heartbeat_timeouterrors in the time window when duplication occurred. - Increase resource limits — If the backend is memory/CPU constrained, increase limits in
docker-compose.ymlto prevent slowdowns that trigger false lock timeouts. - Check database performance — Slow database queries can delay heartbeat updates. Run
PRAGMA integrity_check;to check for corruption.
If duplication occurs frequently, consider migrating to Redis-backed locking (see Advanced section below) for higher reliability.
Troubleshooting: "Scheduler stops completely"
Symptom: Background tasks (blocklist import, geo cache cleanup, history sync, session cleanup) stop running. No errors in logs but tasks don't execute.
Cause: Instance holding the scheduler lock crashed without releasing it, or heartbeat is failing silently.
Diagnosis:
- Check if lock exists:
SELECT * FROM scheduler_lock; - If lock exists with a PID that no longer runs, it's orphaned
- Check logs for
scheduler_lock_heartbeat_lostwarnings
Solution:
- Clear the orphaned lock:
DELETE FROM scheduler_lock; - Restart the instance that should hold the lock
- Verify lock acquisition:
grep "scheduler_lock_acquired" logs - If heartbeat keeps failing, check database latency (SQLite heartbeats should be <100ms)
Prevention:
- Monitor
scheduler_lock_heartbeat_lostevents — more than 3 in an hour indicates a problem - Ensure database I/O is not bottlenecked (SSD recommended for SQLite)
- Consider reducing heartbeat interval if network latency causes false timeouts
Advanced: Migrating to Redis
For very high-traffic deployments with strict data consistency requirements, you can replace the SQLite-backed lock with Redis:
- Why: Redis is single-threaded and atomic by design; clock skew and timeout issues are eliminated.
- How: Install
redlock-pyoraioredis, replacescheduler_lock.pywith a Redis implementation, update heartbeat interval to 2-3 seconds. - Trade-off: Adds a Redis dependency but eliminates database lock contention and provides microsecond-precision atomicity.
This is not required for typical deployments but is recommended if you see frequent scheduler conflicts in logs.
All containers have hard limits (max usage) and soft reservations (guaranteed allocation). This ensures:
- Isolation: A misbehaving container cannot crash others or the host
- Predictability: Reservations guarantee minimum resources even under load
- Efficiency: Unused reserved capacity can be borrowed by other containers
Container Resource Limits
| Container | Limit CPU | Limit Memory | Reserved CPU | Reserved Memory | Purpose |
|---|---|---|---|---|---|
| fail2ban | 0.5 | 128M | 0.1 | 64M | Monitors logs, bans IPs—typically idle |
| backend | 2.0 | 512M | 1.0 | 256M | Core app: database, fail2ban API, config management |
| frontend | 0.5 | 128M | 0.25 | 64M | Nginx: serves SPA + API proxy |
Rationale
- fail2ban: Lightweight log monitoring. Occasionally CPU spikes during ban processing but memory usage is minimal.
- backend: Heavy lifting—Python runtime, SQLite database, background jobs. May need extra memory for large blocklists. Reservation of 1.0 CPU ensures responsive API even when frontend is busy.
- frontend: Nginx is efficient. Limit of 0.5 CPU and 128M memory is more than sufficient for reverse proxy duties.
Memory Considerations
Backend Memory Requirements
The backend typically runs in 256–512M under normal load. Memory usage depends on:
- Blocklist size: Large blocklists (>1M entries) require more heap space
- Cache warmth: First query after startup may require more memory as caches fill
- Concurrent connections: Each active user session uses a small amount of memory
Tuning: If you see OOM kills in logs, increase backend limits and reservations (e.g., 1024M limit). Test under realistic load before finalizing.
Frontend Memory Usage
Nginx is typically <50M. If you see memory pressure on frontend, check for:
- Misconfigured cache headers on static assets
- Large log volumes (nginx access logs)
Docker Swarm & Kubernetes
For production deployments using orchestration platforms:
Docker Swarm
The deploy sections in docker-compose.yml are compatible with docker stack deploy:
docker stack deploy -c Docker/docker-compose.yml bangui
Swarm respects the same limits and reservations fields.
Kubernetes
For Kubernetes, translate resource constraints to equivalent resources fields in your deployment manifests:
containers:
- name: backend
image: git.lpl-mind.de/lukas.pupkalipinski/bangui/backend:latest
resources:
limits:
cpu: "2"
memory: "512Mi"
requests:
cpu: "1"
memory: "256Mi"
Kubernetes equivalent mappings:
- Docker
deploy.limits→ Kubernetesresources.limits - Docker
deploy.reservations→ Kubernetesresources.requests
Monitoring Resource Usage
Docker Compose (Development)
docker stats
Shows real-time CPU and memory usage for all running containers.
Production (Docker Swarm / Kubernetes)
Use native monitoring:
- Docker Swarm: Prometheus + Grafana
- Kubernetes: Metrics Server + dashboard or Prometheus
Environment Variables
Resource limits are configured in Docker/docker-compose.yml and cannot be overridden via environment variables. To adjust limits:
- Edit
Docker/docker-compose.yml - Modify the
deploy.limitsanddeploy.reservationssections - Restart containers:
make down && make up
Troubleshooting
| Issue | Symptom | Solution |
|---|---|---|
| Backend OOM kills | "Exit code 137" in logs | Increase backend memory limit |
| Throttling | CPU at 100%, requests slow | Increase CPU limit or optimize code |
| Service startup timeout | Containers not becoming "healthy" | Increase reservation to guarantee capacity at startup |
| Host unresponsive | System-wide lag | Reduce container limits to prevent host starvation |
Next Steps
- Development: Run
make upto start with default limits - Staging: Test with realistic data volumes and monitor resource usage
- Production: Adjust limits based on observed usage patterns, then commit changes