docs: Add security best practices to Deployment.md

- Secrets management via environment variables - Container security hardening (non-root user, filesystem permissions, capabilities) - Network security and TLS termination guidance - Prune obsolete task tracking from Tasks.md Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-05-03 19:48:52 +02:00
parent 624f869f5b
commit 4d09d2538d
2 changed files with 484 additions and 782 deletions
--- a/Docs/Deployment.md
+++ b/Docs/Deployment.md
@@ -437,3 +437,486 @@ See `Docs/DATABASE_MIGRATIONS.md` for full recovery procedures.
 - **Development**: Run `make up` to start with default limits
 - **Staging**: Test with realistic data volumes and monitor resource usage
 - **Production**: Adjust limits based on observed usage patterns, then commit changes
+
+---
+
+## Security Best Practices
+
+### Secrets Management
+
+**Never hard-code secrets.** All secrets must be injected at runtime via environment variables.
+
+| Secret | Purpose | Generation |
+|--------|---------|------------|
+| `BANGUI_SESSION_SECRET` | Signs session cookies | `python -c 'import secrets; print(secrets.token_hex(32))'` |
+| fail2ban credentials | jail config access | From fail2ban configuration |
+
+- Store secrets in a secrets manager (e.g., Docker secrets, Kubernetes Secrets, HashiCorp Vault)
+- Rotate `BANGUI_SESSION_SECRET` periodically — sessions become invalid, users must re-login
+- Never log or expose session secrets
+
+### Container Security Hardening
+
+**Non-root user**: Backend runs as `bangui:bangui` (UID 1000). Frontend runs as nginx default. This limits container breakout damage.
+
+**Filesystem permissions**:
+```bash
+# Data directory (SQLite DB) — only bangui user rw
+chmod 700 /data
+chown 1000:1000 /data
+
+# Config directory — read-only for backend (it reads fail2ban config)
+# Write access only for config management operations via BanGUI
+chmod 755 /config
+```
+
+**Capabilities**: fail2ban container requires `NET_ADMIN` and `NET_RAW` for raw socket manipulation and iptables interaction. No additional capabilities needed for app containers.
+
+**No privileged mode**: BanGUI containers must not run `--privileged`. The fail2ban container needs only specific capabilities, not full host access.
+
+### Network Security
+
+- **Internal network only**: All BanGUI containers communicate on `bangui-net`. Only the frontend port (default 8080) is exposed to the host.
+- **fail2ban socket**: Mounted read-only (`ro`) from host — backend reads status only
+- **fail2ban config**: Mounted read-write — BanGUI modifies jail configurations as requested
+- **Drop traffic between containers**: Use Docker network isolation to prevent lateral movement:
+  ```yaml
+  networks:
+    bangui-net:
+      driver: bridge
+      internal: false  # Allow external only for frontend
+  ```
+
+### TLS / HTTPS
+
+BanGUI does not terminate TLS. Handle TLS at the reverse proxy or load balancer level:
+
+**Nginx (existing frontend container)**:
+```nginx
+server {
+    listen 443 ssl http2;
+    server_name bangui.example.com;
+
+    ssl_certificate     /etc/ssl/certs/bangui.crt;
+    ssl_certificate_key /etc/ssl/private/bangui.key;
+    ssl_protocols       TLSv1.2 TLSv1.3;
+    ssl_ciphers         ECDHE-ECDSA-AES128-GCM-SHA256:ECDHE-RSA-AES128-GCM-SHA256;
+
+    # Proxy to existing frontend container
+    location / {
+        proxy_pass http://bangui-frontend:80;
+        ...
+    }
+}
+```
+
+**Security headers** (already in nginx.conf):
+- CSP, X-Frame-Options, X-Content-Type-Options, Referrer-Policy, Permissions-Policy
+- Uncomment HSTS header when HTTPS is fully configured
+
+**HTTP to HTTPS redirect**: Add in your TLS terminator:
+```nginx
+server {
+    listen 80;
+    server_name bangui.example.com;
+    return 301 https://$host$request_uri;
+}
+```
+
+### Dependency Scanning
+
+Scan base images for vulnerabilities regularly:
+
+```bash
+# Trivy (Docker/Podman compatible)
+trivy image python:3.12-slim
+trivy image nginx:1.27-alpine
+trivy image node:22-alpine
+
+# CI integration
+trivy image --exit-code 1 --severity HIGH,CRITICAL git.lpl-mind.de/lukas.pupkalipinski/bangui/backend:latest
+```
+
+Update base images quarterly or when CVEs are published.
+
+### Rate Limiting at Deployment Level
+
+The application-level rate limiter (`BANGUI_RATE_LIMIT_*` env vars) handles API requests. Add deployment-level protection:
+
+**Nginx** (existing reverse proxy):
+```nginx
+# Limit concurrent connections per IP
+limit_conn_zone $binary_remote_addr zone=conn_limit:10m;
+server {
+    limit_conn conn_limit 100;
+}
+```
+
+**Fail2ban** (already running):
+- BanGUI manages fail2ban jails
+- Additional deployment-level rate limits should target infrastructure endpoints (SSH, management UIs), not BanGUI itself
+
+### Audit Logging
+
+All authentication events are logged via structlog:
+
+| Event | Log Key | Severity |
+|-------|---------|----------|
+| Login success | `auth_login_success` | INFO |
+| Login failure | `auth_login_failure` | WARNING |
+| Session created | `session_created` | INFO |
+| Session destroyed | `session_destroyed` | INFO |
+| Session expired | `session_expired` | INFO |
+
+Forward these logs to a SIEM or log aggregator for security monitoring. See [Structured Logging](#structured-logging) below.
+
+---
+
+## Performance Tuning
+
+### SQLite Performance
+
+SQLite is single-writer. Under write-heavy load (blocklist imports, history writes), writes may queue.
+
+**WAL mode** (default, do not disable):
+```
+PRAGMA journal_mode=WAL;  -- Already enabled by default
+```
+
+**Synchronous mode** for production:
+```
+PRAGMA synchronous=NORMAL;  -- Balanced (not FULL, not OFF)
+```
+This survives process crashes without corruption while maintaining good write performance.
+
+**Cache size** (increase for production):
+```bash
+# In-memory cache: 64MB (adjust based on available RAM)
+PRAGMA cache_size=-65536;  -- negative = KB
+```
+
+**temp_store** for large sorts:
+```
+PRAGMA temp_store=MEMORY;
+```
+
+**Read performance**:
+- Most reads are point queries by IP or jail name — indexes handle this efficiently
+- Large history scans (dashboard) — paginate, use `LIMIT/OFFSET`
+- Avoid `SELECT *` on large tables — always specify needed columns
+
+### Gzip Compression
+
+Already enabled in nginx.conf. Verify effective compression:
+```bash
+curl -H "Accept-Encoding: gzip" -I http://localhost:8080/api/v1/dashboard/status
+# Should show: Content-Encoding: gzip
+```
+
+### Backend Performance
+
+**Startup warm-up**: On first request after start, caches are cold. First blocklist query may be slower. This is normal — subsequent requests hit cache.
+
+**Memory tuning**:
+```yaml
+# docker-compose.yml — increase if OOM
+backend:
+  deploy:
+    limits:
+      memory: 1024M  # Up from 512M for large blocklists
+```
+
+**Single worker enforced**: The session cache is process-local. Multiple workers would cause random logouts. This is intentional — scale horizontally via orchestration, not vertically via workers.
+
+### Frontend Performance
+
+**Static asset caching** (already configured):
+```
+location /assets/ {
+    expires 1y;
+    add_header Cache-Control "public, immutable";
+}
+```
+
+**Bundle size**: Production build uses esbuild minification. Monitor with:
+```bash
+du -sh frontend/dist/
+ls -lh frontend/dist/assets/*.js
+```
+
+### Database Maintenance
+
+**Periodic checkpoint** (production, monthly or after large blocklist imports):
+```bash
+sqlite3 /data/bangui.db "PRAGMA wal_checkpoint(FULL);"
+```
+
+**Analyze for query planner** (after bulk inserts/deletes):
+```bash
+sqlite3 /data/bangui.db "ANALYZE;"
+```
+
+---
+
+## Monitoring Setup
+
+### Health Check Endpoint
+
+`GET /api/v1/health` — primary monitoring target.
+
+| Status | HTTP Code | Meaning |
+|--------|-----------|---------|
+| `ok` | 200 | All components healthy |
+| `degraded` | 200 | Some components unhealthy — investigate |
+| `unavailable` | 503 | fail2ban unreachable — container will be restarted |
+
+### Structured Logging
+
+All logs are structured (JSON via structlog). Key fields:
+
+| Log field | Description |
+|-----------|-------------|
+| `event` | Event name (e.g., `auth_login_success`) |
+| `request_id` | Per-request correlation ID |
+| `user_id` | Session user (if authenticated) |
+| `duration_ms` | Request duration |
+| `component` | Component name (e.g., `scheduler`, `database`) |
+
+**Log levels**:
+
+| Level | Use |
+|-------|-----|
+| DEBUG | Detailed debugging (query SQL, cache hits) |
+| INFO | Operational events (startup, shutdown, login, ban action) |
+| WARNING | Recoverable issues (cache miss, lock contention) |
+| ERROR | Failures requiring attention (DB error, fail2ban offline) |
+
+**Configure via env**:
+```
+BANGUI_LOG_LEVEL=info   # debug, info, warning, error
+```
+
+### Log Aggregation
+
+**Docker Compose** — forward container logs to aggregator:
+```yaml
+services:
+  backend:
+    logging:
+      driver: "json-file"
+      options:
+        max-size: "10m"
+        max-file: "3"
+```
+
+**External aggregators**:
+```yaml
+# Fluentd example
+services:
+  backend:
+    logging:
+      driver: fluentd
+      options:
+        fluentd-address localhost:24224
+        tag bangui-backend
+```
+
+**ELK Stack** — send JSON logs directly to Logstash or via Filebeat.
+
+### Metrics to Monitor
+
+| Metric | Source | Alert Threshold |
+|--------|--------|----------------|
+| Health check failures | `/api/v1/health` | 3 consecutive → container restart |
+| Backend memory | `docker stats` | >450M (of 512M limit) |
+| Backend CPU | `docker stats` | >80% sustained |
+| Disk usage (`/data`) | `df -h` | >80% |
+| fail2ban container restarts | `docker ps` | >2/hour |
+| Backend container restarts | `docker ps` | >2/hour |
+| Database file size | `ls -lh /data/bangui.db` | Grows >10MB/day indicates issue |
+| Session count | `/api/v1/sessions` | Sudden drop indicates cache issue |
+| Blocklist import duration | Logs (`blocklist_import_completed`) | >5 minutes may indicate performance issue |
+
+### Uptime Monitoring
+
+**External checks**:
+- Monitor `https://your-domain.com/api/v1/health` from multiple geographic locations
+- Use services: Better Uptime, UptimeRobot, Pingdom
+- Alert on: HTTP 503, HTTP 200 + `degraded` status, connection timeout
+
+### Alerting
+
+**Critical (PagerDuty / immediate)**:
+- Health check HTTP 503 for >30 seconds
+- Backend OOM kill (exit code 137)
+- fail2ban offline for >5 minutes
+
+**Warning (Slack / email)**:
+- Health check returns `degraded`
+- Disk usage >80%
+- Memory usage >450M
+- Backend restarts >2/hour
+
+---
+
+## Scaling Guidelines
+
+### Horizontal Scaling
+
+BanGUI is **designed for horizontal scaling** via container orchestration (not multiple workers):
+
+```
+┌─────────────────────────────────────────────────┐
+│              Load Balancer                      │
+│         (nginx, HAProxy, Traefik)               │
+└──────────────────┬─────────────────────────────┘
+                   │
+      ┌─────────────┼─────────────┐
+      ▼            ▼            ▼
+┌──────────┐ ┌──────────┐ ┌──────────┐
+│ Backend  │ │ Backend  │ │ Backend  │
+│ (inst 1) │ │ (inst 2) │ │ (inst 3) │
+└────┬─────┘ └────┬─────┘ └────┬─────┘
+     │            │            │
+     └────────────┼────────────┘
+                  ▼
+         ┌───────────────┐
+         │  Scheduler    │
+         │  Lock (DB)    │ ← Only one instance runs jobs
+         └───────────────┘
+                  │
+                  ▼
+         ┌───────────────┐
+         │    SQLite    │
+         │  (shared fs) │
+         └───────────────┘
+```
+
+**How it works**:
+- Scheduler lock ensures only one instance runs background jobs
+- Session cache is per-instance — use sticky sessions at load balancer, OR configure `BANGUI_SESSION_CACHE=redis` for shared sessions
+- SQLite on shared storage — use network file system (NFS, GlusterFS) or block storage (AWS EBS)
+
+### Stateless Design
+
+For true stateless scaling without sticky sessions, migrate session cache to Redis:
+
+```yaml
+# docker-compose.yml
+backend:
+  environment:
+    - BANGUI_SESSION_CACHE=redis
+    - BANGUI_REDIS_URL=redis://redis:6379/0
+  depends_on:
+    redis:
+      condition: service_healthy
+
+  redis:
+    image: docker.io/library/redis:7-alpine
+    deploy:
+      limits:
+        cpus: '0.5'
+        memory: 256M
+```
+
+Benefits:
+- Sessions shared across all instances → no sticky sessions needed
+- Load balancer can distribute freely
+- Scales linearly
+
+Trade-offs:
+- Redis is another dependency to monitor
+- Redis persistence required for session survival across Redis restarts
+- Redis failure causes mass logouts
+
+### Database Scaling
+
+SQLite does not support read replicas. Scaling reads is limited.
+
+**Read scaling** (if needed):
+- Cache aggressively — BanGUI caches blocklist data in-memory
+- Add read-only views for dashboard queries
+- Consider periodic snapshot exports to separate read-optimized store
+
+**Write scaling**:
+- Single writer only — SQLite WAL helps but doesn't parallelize writes
+- If write throughput becomes a bottleneck, consider:
+  - Periodic batching (already used for blocklist imports)
+  - Sharding by jail (separate DB per jail) — architectural change
+  - Migration to PostgreSQL — significant effort
+
+### CDN for Static Assets
+
+For large-scale deployments, serve `/assets/` from a CDN:
+
+```nginx
+# Replace /assets/ proxy with CDN origin
+location /assets/ {
+    proxy_pass https://your-cdn.cloudfront.net/assets/;
+    proxy_cache_valid 1y;
+    add_header Cache-Control "public, immutable";
+}
+```
+
+Benefits:
+- Reduces frontend container load
+- Assets served from edge locations close to users
+- Reduces bandwidth costs
+
+### Autoscaling
+
+**Docker Swarm**: Use the `labels` + `update_config` pattern for rolling updates. Autoscaling requires external metrics (Prometheus + VPA or similar).
+
+**Kubernetes**: HorizontalPodAutoscaler (HPA) based on CPU/memory:
+```yaml
+apiVersion: autoscaling/v2
+kind: HorizontalPodAutoscaler
+metadata:
+  name: bangui-backend
+spec:
+  scaleTargetRef:
+    apiVersion: apps/v1
+    kind: Deployment
+    name: bangui-backend
+  minReplicas: 2
+  maxReplicas: 10
+  metrics:
+  - type: Resource
+    resource:
+      name: cpu
+      target:
+        type: Utilization
+        averageUtilization: 70
+  - type: Resource
+    resource:
+      name: memory
+      target:
+        type: Utilization
+        averageUtilization: 80
+```
+
+### Load Balancer Configuration
+
+**Health check**:
+```yaml
+# HAProxy example
+backend-check:
+    option httpchk GET /api/v1/health
+    http-check expect status 200
+```
+
+**Sticky sessions** (if NOT using Redis):
+```yaml
+# HAProxy
+appsession _SESSION_ID len 64 timeout 24h
+```
+
+**Connection limits**:
+```yaml
+# Per-backend limit to prevent overload
+server backend1 backend:8000 maxconn 50
+```
+
+---
+
+## Next Steps