Files
BanGUI/Docs/Deployment.md
Lukas b631c1c546 feat(backend): implement graceful shutdown for container stop
Graceful shutdown ensures in-flight operations complete before process exits:
- Lifespan shutdown handler drains pending tasks with 25s timeout
- Scheduler stops accepting new jobs immediately
- HTTP session, external logging, scheduler lock, DB conn closed cleanly
- 25s Python timeout leaves 5s margin before Docker's 30s SIGKILL

Files changed:
- backend/app/main.py: enhanced _lifespan shutdown with task drain
- Docker/Dockerfile.backend: documented signal handling in header
- Docker/docker-compose.yml: added stop_grace_period: 30s
- Docker/compose.prod.yml: added stop_grace_period: 30s
- Docs/Deployment.md: new Graceful Shutdown section with sequence table
- Docs/TROUBLESHOOTING.md: new Graceful Shutdown Issues section

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-05-02 22:47:10 +02:00

16 KiB
Raw Blame History

Deployment Guide

Graceful Shutdown

BanGUI implements graceful shutdown to ensure in-flight operations complete before the process exits. This prevents:

  • Incomplete blocklist imports leaving stale data
  • Interrupted ban requests
  • Corrupted background job states
  • Unclean database connection closures

How It Works

  1. SIGTERM received — Docker sends SIGTERM when docker stop is called
  2. Uvicorn catches SIGTERM — Notifies the FastAPI lifespan handler
  3. Lifespan shutdown begins — Scheduler stops accepting new jobs
  4. In-flight tasks drain — Up to 25 seconds for running jobs to complete
  5. Resources cleaned up — HTTP session, external logging, scheduler lock, DB connection

Docker Configuration

backend:
  stop_grace_period: 30s  # Give lifespan 30s to complete before SIGKILL

The stop_grace_period of 30s gives the Python code a 25s graceful timeout, leaving a 5s safety margin before Docker sends SIGKILL.

Shutdown Sequence

Step Action Timeout
1 Scheduler stops accepting new jobs Immediate
2 Wait for pending background tasks 25s max
3 Close HTTP session Immediate
4 Flush external logging handler Immediate
5 Release scheduler lock Immediate
6 Close database connection Immediate

Background Tasks That Drain

  • Blocklist imports
  • Geo IP cache resolutions
  • History sync operations
  • Geo cache cleanup
  • Geo cache flush
  • Session cleanup
  • Rate limiter cleanup
  • Scheduler lock heartbeat

Monitoring Shutdown

Logs during shutdown:

bangui_shutting_down timeout_seconds=25.0
scheduler_stopped_accepting_jobs
waiting_for_pending_tasks count=3 timeout_seconds=25.0
pending_tasks_completed
http_session_closed
external_logging_shutdown_complete
scheduler_lock_released
bangui_shut_down

If tasks exceed the timeout:

pending_tasks_timeout cancelled_count=3

Rolling Deployments

During rolling deployments:

  1. Old instance releases scheduler lock immediately on shutdown
  2. New instance acquires lock without waiting for TTL expiry
  3. Zero downtime for background job execution

Health Checks

The backend container includes a health check endpoint at GET /api/v1/health that reports application and component status:

  • HTTP 200 with {"status": "ok", ...} — all components healthy
  • HTTP 200 with {"status": "degraded", ...} — some components unhealthy (e.g., database error) but fail2ban reachable
  • HTTP 503 with {"status": "unavailable", ...} — fail2ban is unreachable (backend will restart)

Component checks performed:

Component Check Notes
fail2ban Socket ping via cached status Returns 503 when offline
database Opens and closes a test connection Returns degraded when failing
scheduler scheduler.running attribute Returns degraded when stopped
cache Session cache presence Returns degraded when not initialised

Docker Health Check:

The Dockerfile includes a HEALTHCHECK that queries the endpoint. Docker interprets HTTP 503 as unhealthy and restarts the container after 3 consecutive failures (90 seconds by default).

Why 503 for offline fail2ban?

If fail2ban goes offline but the backend always returns 200, Docker treats the container as healthy. This masks infrastructure failures. By returning 503 when fail2ban is unreachable, orchestration tools (Docker, Kubernetes, Docker Swarm) automatically restart the backend container until fail2ban recovers.

Docker Compose health check parameters:

Parameter Value Rationale
interval 30s Balance between responsiveness and load
timeout 10s Allows for slow probe on busy system
retries 3 ~90 seconds before restart (3 × 30s)
start_period 40s Allows app and fail2ban to fully start

CORS Configuration

Cross-Origin Resource Sharing (CORS) must be explicitly configured when the frontend and backend are served from different origins.

Development

By default, the backend allows requests from common localhost development origins:

  • http://localhost:5173
  • http://127.0.0.1:5173
  • https://localhost:5173
  • https://127.0.0.1:5173

No additional configuration is needed for local development — just run the frontend and backend normally.

Production

In production, override the default with your actual frontend origin(s):

Docker Compose:

environment:
  BANGUI_CORS_ALLOWED_ORIGINS: "https://example.com,https://www.example.com"

Environment File (.env):

BANGUI_CORS_ALLOWED_ORIGINS=https://example.com,https://www.example.com

Multiple Origins: Separate multiple allowed origins with commas (no spaces):

BANGUI_CORS_ALLOWED_ORIGINS=https://example.com,https://app.example.com,https://admin.example.com

Disable CORS: To disable CORS entirely (e.g., when the frontend is served from the same origin as the backend):

BANGUI_CORS_ALLOWED_ORIGINS=

Security Considerations

  • Always specify exact origins — never use wildcard * in production, especially with allow_credentials=true (credentials mode is required for the session cookie).
  • Use HTTPS in production — the backend enforces the Secure cookie flag, which requires HTTPS (or localhost for development).
  • Validate in reverse proxy — if using Nginx or a CDN reverse proxy, validate the Origin header before forwarding requests to ensure only legitimate origins reach the backend.

Troubleshooting

Symptom Cause Solution
Access-Control-Allow-Origin header missing from response CORS not configured or origin not whitelisted Check BANGUI_CORS_ALLOWED_ORIGINS and ensure your frontend origin is included
Browser blocks requests with CORS error Credentials mode enabled but origin not exactly whitelisted Ensure BANGUI_CORS_ALLOWED_ORIGINS includes the exact origin (protocol + domain + port) of your frontend
Works in development but fails in production Default localhost origins used instead of production frontend domain Override BANGUI_CORS_ALLOWED_ORIGINS in production environment

In multi-instance deployments (e.g., Kubernetes, Docker Swarm), the scheduler lock prevents duplicate execution of background tasks by ensuring only one instance runs the scheduler at a time.

How It Works

The lock is stored in the SQLite database and enforced via:

  1. Lock Acquisition — At startup, each instance tries to insert a lock record. Only one succeeds; others reject startup with a clear error message.
  2. Heartbeat — The lock-holding instance sends a heartbeat every 5 seconds to prove it's still alive.
  3. Stale Lock Cleanup — On startup, any lock older than 60 seconds (without a heartbeat) is automatically deleted, allowing recovery from instance crashes.

Configuration

Parameter Value Rationale
Heartbeat Interval 5 seconds Allows ~12 missed heartbeats before lock expires
Lock TTL 60 seconds Time before a lock without heartbeat is considered abandoned
Min Safe Ratio 12x (TTL / interval) Robust protection against temporary delays or high load

With a 60-second TTL and 5-second heartbeat interval, the lock survives even if the instance becomes unresponsive for up to ~55 seconds. This provides strong protection against false positives while still detecting genuine crashes.

Monitoring

Check logs for these key events:

  • scheduler_lock_acquired — Lock successfully acquired at startup (INFO)
  • scheduler_lock_heartbeat_updated — Heartbeat successfully updated (DEBUG)
  • scheduler_lock_heartbeat_failed — Heartbeat update failed; lock may be lost (WARNING)
  • scheduler_lock_heartbeat_timeout — Heartbeat exceeded 5-second timeout (ERROR)
  • scheduler_lock_held_by_other_instance — Another instance holds the lock (WARNING at startup)

Troubleshooting: "Blocklist import runs twice"

Symptom: Blocklist import task executes simultaneously in two instances, causing duplicate entries or data corruption.

Cause: The scheduler lock was released prematurely (e.g., instance crash, database timeout) while a task was still running.

Solution:

  1. Check heartbeat timing — Ensure the instance isn't hanging for >60 seconds (monitor CPU/memory/disk).
  2. Verify database health — Run SELECT * FROM scheduler_lock; to see if a stale lock exists. If present, delete it: DELETE FROM scheduler_lock;
  3. Review logs — Look for scheduler_lock_heartbeat_failed or scheduler_lock_heartbeat_timeout errors in the time window when duplication occurred.
  4. Increase resource limits — If the backend is memory/CPU constrained, increase limits in docker-compose.yml to prevent slowdowns that trigger false lock timeouts.
  5. Check database performance — Slow database queries can delay heartbeat updates. Run PRAGMA integrity_check; to check for corruption.

If duplication occurs frequently, consider migrating to Redis-backed locking (see Advanced section below) for higher reliability.

Troubleshooting: "Scheduler stops completely"

Symptom: Background tasks (blocklist import, geo cache cleanup, history sync, session cleanup) stop running. No errors in logs but tasks don't execute.

Cause: Instance holding the scheduler lock crashed without releasing it, or heartbeat is failing silently.

Diagnosis:

  1. Check if lock exists: SELECT * FROM scheduler_lock;
  2. If lock exists with a PID that no longer runs, it's orphaned
  3. Check logs for scheduler_lock_heartbeat_lost warnings

Solution:

  1. Clear the orphaned lock: DELETE FROM scheduler_lock;
  2. Restart the instance that should hold the lock
  3. Verify lock acquisition: grep "scheduler_lock_acquired" logs
  4. If heartbeat keeps failing, check database latency (SQLite heartbeats should be <100ms)

Prevention:

  • Monitor scheduler_lock_heartbeat_lost events — more than 3 in an hour indicates a problem
  • Ensure database I/O is not bottlenecked (SSD recommended for SQLite)
  • Consider reducing heartbeat interval if network latency causes false timeouts

Advanced: Migrating to Redis

For very high-traffic deployments with strict data consistency requirements, you can replace the SQLite-backed lock with Redis:

  • Why: Redis is single-threaded and atomic by design; clock skew and timeout issues are eliminated.
  • How: Install redlock-py or aioredis, replace scheduler_lock.py with a Redis implementation, update heartbeat interval to 2-3 seconds.
  • Trade-off: Adds a Redis dependency but eliminates database lock contention and provides microsecond-precision atomicity.

This is not required for typical deployments but is recommended if you see frequent scheduler conflicts in logs.


All containers have hard limits (max usage) and soft reservations (guaranteed allocation). This ensures:

  • Isolation: A misbehaving container cannot crash others or the host
  • Predictability: Reservations guarantee minimum resources even under load
  • Efficiency: Unused reserved capacity can be borrowed by other containers

Container Resource Limits

Container Limit CPU Limit Memory Reserved CPU Reserved Memory Purpose
fail2ban 0.5 128M 0.1 64M Monitors logs, bans IPs—typically idle
backend 2.0 512M 1.0 256M Core app: database, fail2ban API, config management
frontend 0.5 128M 0.25 64M Nginx: serves SPA + API proxy

Rationale

  • fail2ban: Lightweight log monitoring. Occasionally CPU spikes during ban processing but memory usage is minimal.
  • backend: Heavy lifting—Python runtime, SQLite database, background jobs. May need extra memory for large blocklists. Reservation of 1.0 CPU ensures responsive API even when frontend is busy.
  • frontend: Nginx is efficient. Limit of 0.5 CPU and 128M memory is more than sufficient for reverse proxy duties.

Memory Considerations

Backend Memory Requirements

The backend typically runs in 256512M under normal load. Memory usage depends on:

  • Blocklist size: Large blocklists (>1M entries) require more heap space
  • Cache warmth: First query after startup may require more memory as caches fill
  • Concurrent connections: Each active user session uses a small amount of memory

Tuning: If you see OOM kills in logs, increase backend limits and reservations (e.g., 1024M limit). Test under realistic load before finalizing.

Frontend Memory Usage

Nginx is typically <50M. If you see memory pressure on frontend, check for:

  • Misconfigured cache headers on static assets
  • Large log volumes (nginx access logs)

Docker Swarm & Kubernetes

For production deployments using orchestration platforms:

Docker Swarm

The deploy sections in docker-compose.yml are compatible with docker stack deploy:

docker stack deploy -c Docker/docker-compose.yml bangui

Swarm respects the same limits and reservations fields.

Kubernetes

For Kubernetes, translate resource constraints to equivalent resources fields in your deployment manifests:

containers:
  - name: backend
    image: git.lpl-mind.de/lukas.pupkalipinski/bangui/backend:latest
    resources:
      limits:
        cpu: "2"
        memory: "512Mi"
      requests:
        cpu: "1"
        memory: "256Mi"

Kubernetes equivalent mappings:

  • Docker deploy.limits → Kubernetes resources.limits
  • Docker deploy.reservations → Kubernetes resources.requests

Monitoring Resource Usage

Docker Compose (Development)

docker stats

Shows real-time CPU and memory usage for all running containers.

Production (Docker Swarm / Kubernetes)

Use native monitoring:

  • Docker Swarm: Prometheus + Grafana
  • Kubernetes: Metrics Server + dashboard or Prometheus

Environment Variables

Resource limits are configured in Docker/docker-compose.yml and cannot be overridden via environment variables. To adjust limits:

  1. Edit Docker/docker-compose.yml
  2. Modify the deploy.limits and deploy.reservations sections
  3. Restart containers: make down && make up

Troubleshooting

Issue Symptom Solution
Backend OOM kills "Exit code 137" in logs Increase backend memory limit
Throttling CPU at 100%, requests slow Increase CPU limit or optimize code
Service startup timeout Containers not becoming "healthy" Increase reservation to guarantee capacity at startup
Host unresponsive System-wide lag Reduce container limits to prevent host starvation

Disaster Recovery

Database Migration Failures

If a migration fails mid-transaction, the application refuses to start. This is intentional — it prevents inconsistent schema states.

Diagnosis:

  1. Check current schema version:

    sqlite3 /var/lib/bangui/bangui.db "SELECT MAX(version) FROM schema_migrations;"
    
  2. Check which tables exist:

    sqlite3 /var/lib/bangui/bangui.db "SELECT name FROM sqlite_master WHERE type='table';"
    
  3. Check application logs for the specific error.

Recovery Options:

  • Automatic rollback: Next startup re-applies the same migration from scratch
  • Manual completion: Apply the migration manually, then insert the version record:
    sqlite3 /var/lib/bangui/bangui.db "BEGIN IMMEDIATE;"
    -- Run your SQL here
    sqlite3 /var/lib/bangui/bangui.db "INSERT INTO schema_migrations (version) VALUES (?);"
    sqlite3 /var/lib/bangui/bangui.db "COMMIT;"
    
  • Full reset (development only): rm bangui.db bangui.db-wal bangui.db-shm

Prevention:

  • Never modify bangui.db manually during running instance
  • Always backup before major migrations
  • Monitor startup logs for migrating_database_schema events

Orphaned WAL Files

After crashes, SQLite WAL mode may leave orphaned .wal files. The database auto-recovers on next open. If you see WAL-related errors:

# Check for orphaned WAL files
ls -la /var/lib/bangui/bangui.db*

# Force checkpoint to merge WAL into main database
sqlite3 /var/lib/bangui/bangui.db "PRAGMA wal_checkpoint(FULL);"

See Docs/DATABASE_MIGRATIONS.md for full recovery procedures.


Next Steps

  • Development: Run make up to start with default limits
  • Staging: Test with realistic data volumes and monitor resource usage
  • Production: Adjust limits based on observed usage patterns, then commit changes