docs: Add security best practices to Deployment.md

- Secrets management via environment variables
- Container security hardening (non-root user, filesystem permissions, capabilities)
- Network security and TLS termination guidance
- Prune obsolete task tracking from Tasks.md

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
This commit is contained in:
2026-05-03 19:48:52 +02:00
parent 624f869f5b
commit 4d09d2538d
2 changed files with 484 additions and 782 deletions

View File

@@ -437,3 +437,486 @@ See `Docs/DATABASE_MIGRATIONS.md` for full recovery procedures.
- **Development**: Run `make up` to start with default limits
- **Staging**: Test with realistic data volumes and monitor resource usage
- **Production**: Adjust limits based on observed usage patterns, then commit changes
---
## Security Best Practices
### Secrets Management
**Never hard-code secrets.** All secrets must be injected at runtime via environment variables.
| Secret | Purpose | Generation |
|--------|---------|------------|
| `BANGUI_SESSION_SECRET` | Signs session cookies | `python -c 'import secrets; print(secrets.token_hex(32))'` |
| fail2ban credentials | jail config access | From fail2ban configuration |
- Store secrets in a secrets manager (e.g., Docker secrets, Kubernetes Secrets, HashiCorp Vault)
- Rotate `BANGUI_SESSION_SECRET` periodically — sessions become invalid, users must re-login
- Never log or expose session secrets
### Container Security Hardening
**Non-root user**: Backend runs as `bangui:bangui` (UID 1000). Frontend runs as nginx default. This limits container breakout damage.
**Filesystem permissions**:
```bash
# Data directory (SQLite DB) — only bangui user rw
chmod 700 /data
chown 1000:1000 /data
# Config directory — read-only for backend (it reads fail2ban config)
# Write access only for config management operations via BanGUI
chmod 755 /config
```
**Capabilities**: fail2ban container requires `NET_ADMIN` and `NET_RAW` for raw socket manipulation and iptables interaction. No additional capabilities needed for app containers.
**No privileged mode**: BanGUI containers must not run `--privileged`. The fail2ban container needs only specific capabilities, not full host access.
### Network Security
- **Internal network only**: All BanGUI containers communicate on `bangui-net`. Only the frontend port (default 8080) is exposed to the host.
- **fail2ban socket**: Mounted read-only (`ro`) from host — backend reads status only
- **fail2ban config**: Mounted read-write — BanGUI modifies jail configurations as requested
- **Drop traffic between containers**: Use Docker network isolation to prevent lateral movement:
```yaml
networks:
bangui-net:
driver: bridge
internal: false # Allow external only for frontend
```
### TLS / HTTPS
BanGUI does not terminate TLS. Handle TLS at the reverse proxy or load balancer level:
**Nginx (existing frontend container)**:
```nginx
server {
listen 443 ssl http2;
server_name bangui.example.com;
ssl_certificate /etc/ssl/certs/bangui.crt;
ssl_certificate_key /etc/ssl/private/bangui.key;
ssl_protocols TLSv1.2 TLSv1.3;
ssl_ciphers ECDHE-ECDSA-AES128-GCM-SHA256:ECDHE-RSA-AES128-GCM-SHA256;
# Proxy to existing frontend container
location / {
proxy_pass http://bangui-frontend:80;
...
}
}
```
**Security headers** (already in nginx.conf):
- CSP, X-Frame-Options, X-Content-Type-Options, Referrer-Policy, Permissions-Policy
- Uncomment HSTS header when HTTPS is fully configured
**HTTP to HTTPS redirect**: Add in your TLS terminator:
```nginx
server {
listen 80;
server_name bangui.example.com;
return 301 https://$host$request_uri;
}
```
### Dependency Scanning
Scan base images for vulnerabilities regularly:
```bash
# Trivy (Docker/Podman compatible)
trivy image python:3.12-slim
trivy image nginx:1.27-alpine
trivy image node:22-alpine
# CI integration
trivy image --exit-code 1 --severity HIGH,CRITICAL git.lpl-mind.de/lukas.pupkalipinski/bangui/backend:latest
```
Update base images quarterly or when CVEs are published.
### Rate Limiting at Deployment Level
The application-level rate limiter (`BANGUI_RATE_LIMIT_*` env vars) handles API requests. Add deployment-level protection:
**Nginx** (existing reverse proxy):
```nginx
# Limit concurrent connections per IP
limit_conn_zone $binary_remote_addr zone=conn_limit:10m;
server {
limit_conn conn_limit 100;
}
```
**Fail2ban** (already running):
- BanGUI manages fail2ban jails
- Additional deployment-level rate limits should target infrastructure endpoints (SSH, management UIs), not BanGUI itself
### Audit Logging
All authentication events are logged via structlog:
| Event | Log Key | Severity |
|-------|---------|----------|
| Login success | `auth_login_success` | INFO |
| Login failure | `auth_login_failure` | WARNING |
| Session created | `session_created` | INFO |
| Session destroyed | `session_destroyed` | INFO |
| Session expired | `session_expired` | INFO |
Forward these logs to a SIEM or log aggregator for security monitoring. See [Structured Logging](#structured-logging) below.
---
## Performance Tuning
### SQLite Performance
SQLite is single-writer. Under write-heavy load (blocklist imports, history writes), writes may queue.
**WAL mode** (default, do not disable):
```
PRAGMA journal_mode=WAL; -- Already enabled by default
```
**Synchronous mode** for production:
```
PRAGMA synchronous=NORMAL; -- Balanced (not FULL, not OFF)
```
This survives process crashes without corruption while maintaining good write performance.
**Cache size** (increase for production):
```bash
# In-memory cache: 64MB (adjust based on available RAM)
PRAGMA cache_size=-65536; -- negative = KB
```
**temp_store** for large sorts:
```
PRAGMA temp_store=MEMORY;
```
**Read performance**:
- Most reads are point queries by IP or jail name — indexes handle this efficiently
- Large history scans (dashboard) — paginate, use `LIMIT/OFFSET`
- Avoid `SELECT *` on large tables — always specify needed columns
### Gzip Compression
Already enabled in nginx.conf. Verify effective compression:
```bash
curl -H "Accept-Encoding: gzip" -I http://localhost:8080/api/v1/dashboard/status
# Should show: Content-Encoding: gzip
```
### Backend Performance
**Startup warm-up**: On first request after start, caches are cold. First blocklist query may be slower. This is normal — subsequent requests hit cache.
**Memory tuning**:
```yaml
# docker-compose.yml — increase if OOM
backend:
deploy:
limits:
memory: 1024M # Up from 512M for large blocklists
```
**Single worker enforced**: The session cache is process-local. Multiple workers would cause random logouts. This is intentional — scale horizontally via orchestration, not vertically via workers.
### Frontend Performance
**Static asset caching** (already configured):
```
location /assets/ {
expires 1y;
add_header Cache-Control "public, immutable";
}
```
**Bundle size**: Production build uses esbuild minification. Monitor with:
```bash
du -sh frontend/dist/
ls -lh frontend/dist/assets/*.js
```
### Database Maintenance
**Periodic checkpoint** (production, monthly or after large blocklist imports):
```bash
sqlite3 /data/bangui.db "PRAGMA wal_checkpoint(FULL);"
```
**Analyze for query planner** (after bulk inserts/deletes):
```bash
sqlite3 /data/bangui.db "ANALYZE;"
```
---
## Monitoring Setup
### Health Check Endpoint
`GET /api/v1/health` — primary monitoring target.
| Status | HTTP Code | Meaning |
|--------|-----------|---------|
| `ok` | 200 | All components healthy |
| `degraded` | 200 | Some components unhealthy — investigate |
| `unavailable` | 503 | fail2ban unreachable — container will be restarted |
### Structured Logging
All logs are structured (JSON via structlog). Key fields:
| Log field | Description |
|-----------|-------------|
| `event` | Event name (e.g., `auth_login_success`) |
| `request_id` | Per-request correlation ID |
| `user_id` | Session user (if authenticated) |
| `duration_ms` | Request duration |
| `component` | Component name (e.g., `scheduler`, `database`) |
**Log levels**:
| Level | Use |
|-------|-----|
| DEBUG | Detailed debugging (query SQL, cache hits) |
| INFO | Operational events (startup, shutdown, login, ban action) |
| WARNING | Recoverable issues (cache miss, lock contention) |
| ERROR | Failures requiring attention (DB error, fail2ban offline) |
**Configure via env**:
```
BANGUI_LOG_LEVEL=info # debug, info, warning, error
```
### Log Aggregation
**Docker Compose** — forward container logs to aggregator:
```yaml
services:
backend:
logging:
driver: "json-file"
options:
max-size: "10m"
max-file: "3"
```
**External aggregators**:
```yaml
# Fluentd example
services:
backend:
logging:
driver: fluentd
options:
fluentd-address localhost:24224
tag bangui-backend
```
**ELK Stack** — send JSON logs directly to Logstash or via Filebeat.
### Metrics to Monitor
| Metric | Source | Alert Threshold |
|--------|--------|----------------|
| Health check failures | `/api/v1/health` | 3 consecutive → container restart |
| Backend memory | `docker stats` | >450M (of 512M limit) |
| Backend CPU | `docker stats` | >80% sustained |
| Disk usage (`/data`) | `df -h` | >80% |
| fail2ban container restarts | `docker ps` | >2/hour |
| Backend container restarts | `docker ps` | >2/hour |
| Database file size | `ls -lh /data/bangui.db` | Grows >10MB/day indicates issue |
| Session count | `/api/v1/sessions` | Sudden drop indicates cache issue |
| Blocklist import duration | Logs (`blocklist_import_completed`) | >5 minutes may indicate performance issue |
### Uptime Monitoring
**External checks**:
- Monitor `https://your-domain.com/api/v1/health` from multiple geographic locations
- Use services: Better Uptime, UptimeRobot, Pingdom
- Alert on: HTTP 503, HTTP 200 + `degraded` status, connection timeout
### Alerting
**Critical (PagerDuty / immediate)**:
- Health check HTTP 503 for >30 seconds
- Backend OOM kill (exit code 137)
- fail2ban offline for >5 minutes
**Warning (Slack / email)**:
- Health check returns `degraded`
- Disk usage >80%
- Memory usage >450M
- Backend restarts >2/hour
---
## Scaling Guidelines
### Horizontal Scaling
BanGUI is **designed for horizontal scaling** via container orchestration (not multiple workers):
```
┌─────────────────────────────────────────────────┐
│ Load Balancer │
│ (nginx, HAProxy, Traefik) │
└──────────────────┬─────────────────────────────┘
┌─────────────┼─────────────┐
▼ ▼ ▼
┌──────────┐ ┌──────────┐ ┌──────────┐
│ Backend │ │ Backend │ │ Backend │
│ (inst 1) │ │ (inst 2) │ │ (inst 3) │
└────┬─────┘ └────┬─────┘ └────┬─────┘
│ │ │
└────────────┼────────────┘
┌───────────────┐
│ Scheduler │
│ Lock (DB) │ ← Only one instance runs jobs
└───────────────┘
┌───────────────┐
│ SQLite │
│ (shared fs) │
└───────────────┘
```
**How it works**:
- Scheduler lock ensures only one instance runs background jobs
- Session cache is per-instance — use sticky sessions at load balancer, OR configure `BANGUI_SESSION_CACHE=redis` for shared sessions
- SQLite on shared storage — use network file system (NFS, GlusterFS) or block storage (AWS EBS)
### Stateless Design
For true stateless scaling without sticky sessions, migrate session cache to Redis:
```yaml
# docker-compose.yml
backend:
environment:
- BANGUI_SESSION_CACHE=redis
- BANGUI_REDIS_URL=redis://redis:6379/0
depends_on:
redis:
condition: service_healthy
redis:
image: docker.io/library/redis:7-alpine
deploy:
limits:
cpus: '0.5'
memory: 256M
```
Benefits:
- Sessions shared across all instances → no sticky sessions needed
- Load balancer can distribute freely
- Scales linearly
Trade-offs:
- Redis is another dependency to monitor
- Redis persistence required for session survival across Redis restarts
- Redis failure causes mass logouts
### Database Scaling
SQLite does not support read replicas. Scaling reads is limited.
**Read scaling** (if needed):
- Cache aggressively — BanGUI caches blocklist data in-memory
- Add read-only views for dashboard queries
- Consider periodic snapshot exports to separate read-optimized store
**Write scaling**:
- Single writer only — SQLite WAL helps but doesn't parallelize writes
- If write throughput becomes a bottleneck, consider:
- Periodic batching (already used for blocklist imports)
- Sharding by jail (separate DB per jail) — architectural change
- Migration to PostgreSQL — significant effort
### CDN for Static Assets
For large-scale deployments, serve `/assets/` from a CDN:
```nginx
# Replace /assets/ proxy with CDN origin
location /assets/ {
proxy_pass https://your-cdn.cloudfront.net/assets/;
proxy_cache_valid 1y;
add_header Cache-Control "public, immutable";
}
```
Benefits:
- Reduces frontend container load
- Assets served from edge locations close to users
- Reduces bandwidth costs
### Autoscaling
**Docker Swarm**: Use the `labels` + `update_config` pattern for rolling updates. Autoscaling requires external metrics (Prometheus + VPA or similar).
**Kubernetes**: HorizontalPodAutoscaler (HPA) based on CPU/memory:
```yaml
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: bangui-backend
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: bangui-backend
minReplicas: 2
maxReplicas: 10
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 70
- type: Resource
resource:
name: memory
target:
type: Utilization
averageUtilization: 80
```
### Load Balancer Configuration
**Health check**:
```yaml
# HAProxy example
backend-check:
option httpchk GET /api/v1/health
http-check expect status 200
```
**Sticky sessions** (if NOT using Redis):
```yaml
# HAProxy
appsession _SESSION_ID len 64 timeout 24h
```
**Connection limits**:
```yaml
# Per-backend limit to prevent overload
server backend1 backend:8000 maxconn 50
```
---
## Next Steps