Files

Lukas 94d6352d1d Fix health check endpoint to return 503 when fail2ban is offline

The health check endpoint now properly indicates service unavailability:
- Returns HTTP 200 when fail2ban is online
- Returns HTTP 503 when fail2ban is offline

This allows Docker and other orchestration tools to correctly detect when
fail2ban is unreachable and automatically restart the backend container,
preventing the situation where Docker treats the container as healthy
despite fail2ban being down.

Changes:
- Update GET /api/health to return 503 on fail2ban offline
- Return appropriate JSON response bodies for each state
- Update tests to verify both online (200) and offline (503) scenarios
- Update Dockerfile HEALTHCHECK documentation
- Add Health Checks section to Deployment.md documentation

All tests pass with 100% coverage on health.py.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

2026-04-30 21:56:42 +02:00

5.1 KiB

Raw Blame History

Deployment Guide

Health Checks

The backend container includes a health check endpoint at GET /api/health that reports application and fail2ban daemon status:

HTTP 200 with {"status": "ok", "fail2ban": "online"} — backend is healthy and fail2ban is reachable
HTTP 503 with {"status": "unavailable", "fail2ban": "offline"} — fail2ban is unreachable (backend will restart)

Docker Health Check:

The Dockerfile includes a HEALTHCHECK that queries the endpoint. Docker interprets HTTP 503 as unhealthy and restarts the container after 3 consecutive failures (90 seconds by default).

Why 503 for offline fail2ban?

If fail2ban goes offline but the backend always returns 200, Docker treats the container as healthy. This can mask infrastructure failures. By returning 503 when fail2ban is unreachable, orchestration tools (Docker, Kubernetes, Docker Swarm) will automatically restart the backend container until fail2ban recovers.

Resource Allocation

All containers have hard limits (max usage) and soft reservations (guaranteed allocation). This ensures:

Isolation: A misbehaving container cannot crash others or the host
Predictability: Reservations guarantee minimum resources even under load
Efficiency: Unused reserved capacity can be borrowed by other containers

Container Resource Limits

Container	Limit CPU	Limit Memory	Reserved CPU	Reserved Memory	Purpose
fail2ban	0.5	128M	0.1	64M	Monitors logs, bans IPs—typically idle
backend	2.0	512M	1.0	256M	Core app: database, fail2ban API, config management
frontend	0.5	128M	0.25	64M	Nginx: serves SPA + API proxy

Rationale

fail2ban: Lightweight log monitoring. Occasionally CPU spikes during ban processing but memory usage is minimal.
backend: Heavy lifting—Python runtime, SQLite database, background jobs. May need extra memory for large blocklists. Reservation of 1.0 CPU ensures responsive API even when frontend is busy.
frontend: Nginx is efficient. Limit of 0.5 CPU and 128M memory is more than sufficient for reverse proxy duties.

Memory Considerations

Backend Memory Requirements

The backend typically runs in 256–512M under normal load. Memory usage depends on:

Blocklist size: Large blocklists (>1M entries) require more heap space
Cache warmth: First query after startup may require more memory as caches fill
Concurrent connections: Each active user session uses a small amount of memory

Tuning: If you see OOM kills in logs, increase backend limits and reservations (e.g., 1024M limit). Test under realistic load before finalizing.

Frontend Memory Usage

Nginx is typically <50M. If you see memory pressure on frontend, check for:

Misconfigured cache headers on static assets
Large log volumes (nginx access logs)

Docker Swarm & Kubernetes

For production deployments using orchestration platforms:

Docker Swarm

The deploy sections in docker-compose.yml are compatible with docker stack deploy:

docker stack deploy -c Docker/docker-compose.yml bangui

Swarm respects the same limits and reservations fields.

Kubernetes

For Kubernetes, translate resource constraints to equivalent resources fields in your deployment manifests:

containers:
  - name: backend
    image: git.lpl-mind.de/lukas.pupkalipinski/bangui/backend:latest
    resources:
      limits:
        cpu: "2"
        memory: "512Mi"
      requests:
        cpu: "1"
        memory: "256Mi"

Kubernetes equivalent mappings:

Docker deploy.limits → Kubernetes resources.limits
Docker deploy.reservations → Kubernetes resources.requests

Monitoring Resource Usage

Docker Compose (Development)

docker stats

Shows real-time CPU and memory usage for all running containers.

Production (Docker Swarm / Kubernetes)

Use native monitoring:

Docker Swarm: Prometheus + Grafana
Kubernetes: Metrics Server + dashboard or Prometheus

Environment Variables

Resource limits are configured in Docker/docker-compose.yml and cannot be overridden via environment variables. To adjust limits:

Edit Docker/docker-compose.yml
Modify the deploy.limits and deploy.reservations sections
Restart containers: make down && make up

Troubleshooting

Issue	Symptom	Solution
Backend OOM kills	"Exit code 137" in logs	Increase backend `memory` limit
Throttling	CPU at 100%, requests slow	Increase CPU limit or optimize code
Service startup timeout	Containers not becoming "healthy"	Increase reservation to guarantee capacity at startup
Host unresponsive	System-wide lag	Reduce container limits to prevent host starvation

Next Steps

Development: Run make up to start with default limits
Staging: Test with realistic data volumes and monitor resource usage
Production: Adjust limits based on observed usage patterns, then commit changes

5.1 KiB Raw Blame History Unescape Escape