feat(backend): implement graceful shutdown for container stop

Graceful shutdown ensures in-flight operations complete before process exits: - Lifespan shutdown handler drains pending tasks with 25s timeout - Scheduler stops accepting new jobs immediately - HTTP session, external logging, scheduler lock, DB conn closed cleanly - 25s Python timeout leaves 5s margin before Docker's 30s SIGKILL Files changed: - backend/app/main.py: enhanced _lifespan shutdown with task drain - Docker/Dockerfile.backend: documented signal handling in header - Docker/docker-compose.yml: added stop_grace_period: 30s - Docker/compose.prod.yml: added stop_grace_period: 30s - Docs/Deployment.md: new Graceful Shutdown section with sequence table - Docs/TROUBLESHOOTING.md: new Graceful Shutdown Issues section Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-05-02 22:47:10 +02:00
parent f6c3c02183
commit b631c1c546
10 changed files with 383 additions and 20 deletions
--- a/Docs/Deployment.md
+++ b/Docs/Deployment.md
@@ -1,11 +1,97 @@
 # Deployment Guide

+## Graceful Shutdown
+
+BanGUI implements graceful shutdown to ensure in-flight operations complete before the process exits. This prevents:
+- Incomplete blocklist imports leaving stale data
+- Interrupted ban requests
+- Corrupted background job states
+- Unclean database connection closures
+
+### How It Works
+
+1. **SIGTERM received** — Docker sends SIGTERM when `docker stop` is called
+2. **Uvicorn catches SIGTERM** — Notifies the FastAPI lifespan handler
+3. **Lifespan shutdown begins** — Scheduler stops accepting new jobs
+4. **In-flight tasks drain** — Up to 25 seconds for running jobs to complete
+5. **Resources cleaned up** — HTTP session, external logging, scheduler lock, DB connection
+
+### Docker Configuration
+
+```yaml
+backend:
+  stop_grace_period: 30s  # Give lifespan 30s to complete before SIGKILL
+```
+
+The `stop_grace_period` of 30s gives the Python code a 25s graceful timeout, leaving a 5s safety margin before Docker sends SIGKILL.
+
+### Shutdown Sequence
+
+| Step | Action | Timeout |
+|------|--------|---------|
+| 1 | Scheduler stops accepting new jobs | Immediate |
+| 2 | Wait for pending background tasks | 25s max |
+| 3 | Close HTTP session | Immediate |
+| 4 | Flush external logging handler | Immediate |
+| 5 | Release scheduler lock | Immediate |
+| 6 | Close database connection | Immediate |
+
+### Background Tasks That Drain
+
+- Blocklist imports
+- Geo IP cache resolutions
+- History sync operations
+- Geo cache cleanup
+- Geo cache flush
+- Session cleanup
+- Rate limiter cleanup
+- Scheduler lock heartbeat
+
+### Monitoring Shutdown
+
+Logs during shutdown:
+
+```
+bangui_shutting_down timeout_seconds=25.0
+scheduler_stopped_accepting_jobs
+waiting_for_pending_tasks count=3 timeout_seconds=25.0
+pending_tasks_completed
+http_session_closed
+external_logging_shutdown_complete
+scheduler_lock_released
+bangui_shut_down
+```
+
+If tasks exceed the timeout:
+```
+pending_tasks_timeout cancelled_count=3
+```
+
+### Rolling Deployments
+
+During rolling deployments:
+1. Old instance releases scheduler lock immediately on shutdown
+2. New instance acquires lock without waiting for TTL expiry
+3. Zero downtime for background job execution
+
+---
+
 ## Health Checks

-The backend container includes a health check endpoint at `GET /api/health` that reports application and fail2ban daemon status:
+The backend container includes a health check endpoint at `GET /api/v1/health` that reports application and component status:

- **HTTP 200** with `{"status": "ok", "fail2ban": "online"}` — backend is healthy and fail2ban is reachable
- **HTTP 503** with `{"status": "unavailable", "fail2ban": "offline"}` — fail2ban is unreachable (backend will restart)
+- **HTTP 200** with `{"status": "ok", ...}` — all components healthy
+- **HTTP 200** with `{"status": "degraded", ...}` — some components unhealthy (e.g., database error) but fail2ban reachable
+- **HTTP 503** with `{"status": "unavailable", ...}` — fail2ban is unreachable (backend will restart)
+
+**Component checks performed:**
+
+| Component | Check | Notes |
+|---|---|---|
+| fail2ban | Socket ping via cached status | Returns 503 when offline |
+| database | Opens and closes a test connection | Returns degraded when failing |
+| scheduler | `scheduler.running` attribute | Returns degraded when stopped |
+| cache | Session cache presence | Returns degraded when not initialised |

 **Docker Health Check:**

@@ -13,7 +99,16 @@ The Dockerfile includes a HEALTHCHECK that queries the endpoint. Docker interpre

 **Why 503 for offline fail2ban?**

-If fail2ban goes offline but the backend always returns 200, Docker treats the container as healthy. This can mask infrastructure failures. By returning 503 when fail2ban is unreachable, orchestration tools (Docker, Kubernetes, Docker Swarm) will automatically restart the backend container until fail2ban recovers.
+If fail2ban goes offline but the backend always returns 200, Docker treats the container as healthy. This masks infrastructure failures. By returning 503 when fail2ban is unreachable, orchestration tools (Docker, Kubernetes, Docker Swarm) automatically restart the backend container until fail2ban recovers.
+
+**Docker Compose health check parameters:**
+
+| Parameter | Value | Rationale |
+|---|---|---|
+| `interval` | 30s | Balance between responsiveness and load |
+| `timeout` | 10s | Allows for slow probe on busy system |
+| `retries` | 3 | ~90 seconds before restart (3 × 30s) |
+| `start_period` | 40s | Allows app and fail2ban to fully start |

 ---

@@ -277,6 +372,60 @@ Resource limits are configured in `Docker/docker-compose.yml` and cannot be over

 ---

+## Disaster Recovery
+
+### Database Migration Failures
+
+If a migration fails mid-transaction, the application refuses to start. This is intentional — it prevents inconsistent schema states.
+
+**Diagnosis:**
+
+1. Check current schema version:
+   ```bash
+   sqlite3 /var/lib/bangui/bangui.db "SELECT MAX(version) FROM schema_migrations;"
+   ```
+
+2. Check which tables exist:
+   ```bash
+   sqlite3 /var/lib/bangui/bangui.db "SELECT name FROM sqlite_master WHERE type='table';"
+   ```
+
+3. Check application logs for the specific error.
+
+**Recovery Options:**
+
+- **Automatic rollback**: Next startup re-applies the same migration from scratch
+- **Manual completion**: Apply the migration manually, then insert the version record:
+  ```bash
+  sqlite3 /var/lib/bangui/bangui.db "BEGIN IMMEDIATE;"
+  -- Run your SQL here
+  sqlite3 /var/lib/bangui/bangui.db "INSERT INTO schema_migrations (version) VALUES (?);"
+  sqlite3 /var/lib/bangui/bangui.db "COMMIT;"
+  ```
+- **Full reset** (development only): `rm bangui.db bangui.db-wal bangui.db-shm`
+
+**Prevention:**
+
+- Never modify `bangui.db` manually during running instance
+- Always backup before major migrations
+- Monitor startup logs for `migrating_database_schema` events
+
+### Orphaned WAL Files
+
+After crashes, SQLite WAL mode may leave orphaned `.wal` files. The database auto-recovers on next open. If you see WAL-related errors:
+
+```bash
+# Check for orphaned WAL files
+ls -la /var/lib/bangui/bangui.db*
+
+# Force checkpoint to merge WAL into main database
+sqlite3 /var/lib/bangui/bangui.db "PRAGMA wal_checkpoint(FULL);"
+```
+
+See `Docs/DATABASE_MIGRATIONS.md` for full recovery procedures.
+
+---
+
 ## Next Steps

 - **Development**: Run `make up` to start with default limits
--- a/Docs/TROUBLESHOOTING.md
+++ b/Docs/TROUBLESHOOTING.md
@@ -134,6 +134,154 @@ ps aux | grep <pid>

 ---

+## Database Migration Failures
+
+### Application Won't Start After Upgrade
+
+**Symptom:** Application fails to start. Logs show migration errors.
+
+**Cause:** Migration failed mid-transaction. Database left in inconsistent state.
+
+**Diagnosis:**
+```bash
+# Check current schema version
+sqlite3 /var/lib/bangui/bangui.db "SELECT MAX(version) FROM schema_migrations;"
+
+# List all tables
+sqlite3 /var/lib/bangui/bangui.db "SELECT name FROM sqlite_master WHERE type='table';"
+
+# Check logs for specific error
+grep -i migration /var/log/bangui.log
+```
+
+**Solution:**
+
+1. **If migration was auto-rolled back**: Startup will retry the same migration. Run application again.
+2. **If migration keeps failing**: Check if table already exists:
+   ```bash
+   sqlite3 /var/lib/bangui/bangui.db "SELECT name FROM sqlite_master WHERE type='table' AND name='<table>';"
+   ```
+   If it exists, manually insert the migration record:
+   ```bash
+   sqlite3 /var/lib/bangui/bangui.db "INSERT INTO schema_migrations (version) VALUES (?);"
+   ```
+3. **Full database reset** (development only):
+   ```bash
+   rm /var/lib/bangui/bangui.db /var/lib/bangui/bangui.db-wal /var/lib/bangui/bangui.db-shm
+   ```
+
+**Prevention:**
+- Always backup before upgrades: `cp bangui.db bangui.db.backup`
+- Never manually modify database schema
+- Monitor `migrating_database_schema` log events during upgrades
+
+---
+
+### Schema Version Mismatch
+
+**Symptom:** Error: "database schema version X is newer than supported version Y"
+
+**Cause:** Downgraded to older BanGUI version that doesn't support current schema.
+
+**Solution:** Upgrade to a version compatible with the current schema, or restore from backup.
+
+---
+
+## 502 Bad Gateway Errors
+
+### Symptom: Nginx returns 502 Bad Gateway
+
+**Cause:** The backend container is unreachable — either down, restarting, or not yet healthy.
+
+**Diagnosis:**
+
+```bash
+# Check backend container status
+docker ps -a | grep bangui-backend
+
+# Check if backend is responding directly (on the container network)
+docker exec bangui-frontend curl -f http://bangui-backend:8000/api/v1/health
+
+# Check backend logs
+docker logs bangui-backend --tail 50
+```
+
+**Common causes and solutions:**
+
+| Cause | Diagnosis | Solution |
+|---|---|---|
+| Backend restarting | `docker ps` shows backend repeatedly restarting | Check health check timing; may need longer `start_period` |
+| Health check failing | Backend log shows socket errors | Verify fail2ban container is healthy before backend starts |
+| Startup too slow | `start_period: 40s` not enough on slow hosts | Increase `start_period` in compose file |
+| Port misconfiguration | `expose` vs `ports` mismatch | Ensure backend exposes 8000 and frontend proxies to it |
+
+**Prevention:**
+
+- The `depends_on: condition: service_healthy` ensures the backend is fully started before the frontend proxies requests.
+- The health check returns 503 when fail2ban is offline, triggering container restart automatically.
+- Health check parameters are tuned for typical startup time — adjust `start_period` if the host is slow or resource-constrained.
+
+---
+
+## Graceful Shutdown Issues
+
+### Container Killed Before Tasks Complete
+
+**Symptom:** Logs show `pending_tasks_timeout` and tasks are cancelled mid-execution.
+
+**Cause:** Docker's `stop_grace_period` is too short, or tasks take longer than the 25s graceful timeout.
+
+**Diagnosis:**
+```bash
+# Check if container was killed by SIGKILL
+docker inspect bangui-backend --format '{{.State.ExitCode}}'
+# Exit code 137 = SIGKILL
+```
+
+**Solution:**
+1. Increase `stop_grace_period` in `docker-compose.yml`:
+   ```yaml
+   backend:
+     stop_grace_period: 60s
+   ```
+2. The Python graceful timeout is 25s (leaving margin before Docker kill)
+3. If tasks still timeout, check task code — long-running tasks should handle cancellation gracefully
+
+### Scheduler Lock Not Released
+
+**Symptom:** After container restart, logs show `Could not acquire scheduler lock`.
+
+**Cause:** Previous instance shut down without releasing the lock, or lock TTL hasn't expired.
+
+**Diagnosis:**
+```bash
+sqlite3 /var/lib/bangui/bangui.db "SELECT * FROM scheduler_lock;"
+```
+
+**Solution:**
+```bash
+# Clear stale lock
+sqlite3 /var/lib/bangui/bangui.db "DELETE FROM scheduler_lock;"
+# Restart container
+```
+
+**Prevention:**
+- Graceful shutdown releases lock immediately (not waiting for TTL expiry)
+- Monitor logs for `scheduler_lock_released` on clean shutdown
+
+### In-Flight Requests Dropped
+
+**Symptom:** Client connections closed abruptly during shutdown.
+
+**Cause:** Too short a graceful timeout, or clients not configured to retry.
+
+**Solution:**
+1. Ensure clients implement proper retry logic with backoff
+2. For critical operations, use background tasks with status polling
+3. Increase graceful timeout if network latency is high
+
+---
+
 ## General Recovery Commands

 Clear all locks: