feat(backend): implement graceful shutdown for container stop

Graceful shutdown ensures in-flight operations complete before process exits: - Lifespan shutdown handler drains pending tasks with 25s timeout - Scheduler stops accepting new jobs immediately - HTTP session, external logging, scheduler lock, DB conn closed cleanly - 25s Python timeout leaves 5s margin before Docker's 30s SIGKILL Files changed: - backend/app/main.py: enhanced _lifespan shutdown with task drain - Docker/Dockerfile.backend: documented signal handling in header - Docker/docker-compose.yml: added stop_grace_period: 30s - Docker/compose.prod.yml: added stop_grace_period: 30s - Docs/Deployment.md: new Graceful Shutdown section with sequence table - Docs/TROUBLESHOOTING.md: new Graceful Shutdown Issues section Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-05-02 22:47:10 +02:00
parent f6c3c02183
commit b631c1c546
10 changed files with 383 additions and 20 deletions
--- a/Docs/Deployment.md
+++ b/Docs/Deployment.md
@@ -1,11 +1,97 @@
 # Deployment Guide

+## Graceful Shutdown
+
+BanGUI implements graceful shutdown to ensure in-flight operations complete before the process exits. This prevents:
+- Incomplete blocklist imports leaving stale data
+- Interrupted ban requests
+- Corrupted background job states
+- Unclean database connection closures
+
+### How It Works
+
+1. **SIGTERM received** — Docker sends SIGTERM when `docker stop` is called
+2. **Uvicorn catches SIGTERM** — Notifies the FastAPI lifespan handler
+3. **Lifespan shutdown begins** — Scheduler stops accepting new jobs
+4. **In-flight tasks drain** — Up to 25 seconds for running jobs to complete
+5. **Resources cleaned up** — HTTP session, external logging, scheduler lock, DB connection
+
+### Docker Configuration
+
+```yaml
+backend:
+  stop_grace_period: 30s  # Give lifespan 30s to complete before SIGKILL
+```
+
+The `stop_grace_period` of 30s gives the Python code a 25s graceful timeout, leaving a 5s safety margin before Docker sends SIGKILL.
+
+### Shutdown Sequence
+
+| Step | Action | Timeout |
+|------|--------|---------|
+| 1 | Scheduler stops accepting new jobs | Immediate |
+| 2 | Wait for pending background tasks | 25s max |
+| 3 | Close HTTP session | Immediate |
+| 4 | Flush external logging handler | Immediate |
+| 5 | Release scheduler lock | Immediate |
+| 6 | Close database connection | Immediate |
+
+### Background Tasks That Drain
+
+- Blocklist imports
+- Geo IP cache resolutions
+- History sync operations
+- Geo cache cleanup
+- Geo cache flush
+- Session cleanup
+- Rate limiter cleanup
+- Scheduler lock heartbeat
+
+### Monitoring Shutdown
+
+Logs during shutdown:
+
+```
+bangui_shutting_down timeout_seconds=25.0
+scheduler_stopped_accepting_jobs
+waiting_for_pending_tasks count=3 timeout_seconds=25.0
+pending_tasks_completed
+http_session_closed
+external_logging_shutdown_complete
+scheduler_lock_released
+bangui_shut_down
+```
+
+If tasks exceed the timeout:
+```
+pending_tasks_timeout cancelled_count=3
+```
+
+### Rolling Deployments
+
+During rolling deployments:
+1. Old instance releases scheduler lock immediately on shutdown
+2. New instance acquires lock without waiting for TTL expiry
+3. Zero downtime for background job execution
+
+---
+
 ## Health Checks

-The backend container includes a health check endpoint at `GET /api/health` that reports application and fail2ban daemon status:
+The backend container includes a health check endpoint at `GET /api/v1/health` that reports application and component status:

- **HTTP 200** with `{"status": "ok", "fail2ban": "online"}` — backend is healthy and fail2ban is reachable
- **HTTP 503** with `{"status": "unavailable", "fail2ban": "offline"}` — fail2ban is unreachable (backend will restart)
+- **HTTP 200** with `{"status": "ok", ...}` — all components healthy
+- **HTTP 200** with `{"status": "degraded", ...}` — some components unhealthy (e.g., database error) but fail2ban reachable
+- **HTTP 503** with `{"status": "unavailable", ...}` — fail2ban is unreachable (backend will restart)
+
+**Component checks performed:**
+
+| Component | Check | Notes |
+|---|---|---|
+| fail2ban | Socket ping via cached status | Returns 503 when offline |
+| database | Opens and closes a test connection | Returns degraded when failing |
+| scheduler | `scheduler.running` attribute | Returns degraded when stopped |
+| cache | Session cache presence | Returns degraded when not initialised |

 **Docker Health Check:**

@@ -13,7 +99,16 @@ The Dockerfile includes a HEALTHCHECK that queries the endpoint. Docker interpre

 **Why 503 for offline fail2ban?**

-If fail2ban goes offline but the backend always returns 200, Docker treats the container as healthy. This can mask infrastructure failures. By returning 503 when fail2ban is unreachable, orchestration tools (Docker, Kubernetes, Docker Swarm) will automatically restart the backend container until fail2ban recovers.
+If fail2ban goes offline but the backend always returns 200, Docker treats the container as healthy. This masks infrastructure failures. By returning 503 when fail2ban is unreachable, orchestration tools (Docker, Kubernetes, Docker Swarm) automatically restart the backend container until fail2ban recovers.
+
+**Docker Compose health check parameters:**
+
+| Parameter | Value | Rationale |
+|---|---|---|
+| `interval` | 30s | Balance between responsiveness and load |
+| `timeout` | 10s | Allows for slow probe on busy system |
+| `retries` | 3 | ~90 seconds before restart (3 × 30s) |
+| `start_period` | 40s | Allows app and fail2ban to fully start |

 ---

@@ -277,6 +372,60 @@ Resource limits are configured in `Docker/docker-compose.yml` and cannot be over

 ---

+## Disaster Recovery
+
+### Database Migration Failures
+
+If a migration fails mid-transaction, the application refuses to start. This is intentional — it prevents inconsistent schema states.
+
+**Diagnosis:**
+
+1. Check current schema version:
+   ```bash
+   sqlite3 /var/lib/bangui/bangui.db "SELECT MAX(version) FROM schema_migrations;"
+   ```
+
+2. Check which tables exist:
+   ```bash
+   sqlite3 /var/lib/bangui/bangui.db "SELECT name FROM sqlite_master WHERE type='table';"
+   ```
+
+3. Check application logs for the specific error.
+
+**Recovery Options:**
+
+- **Automatic rollback**: Next startup re-applies the same migration from scratch
+- **Manual completion**: Apply the migration manually, then insert the version record:
+  ```bash
+  sqlite3 /var/lib/bangui/bangui.db "BEGIN IMMEDIATE;"
+  -- Run your SQL here
+  sqlite3 /var/lib/bangui/bangui.db "INSERT INTO schema_migrations (version) VALUES (?);"
+  sqlite3 /var/lib/bangui/bangui.db "COMMIT;"
+  ```
+- **Full reset** (development only): `rm bangui.db bangui.db-wal bangui.db-shm`
+
+**Prevention:**
+
+- Never modify `bangui.db` manually during running instance
+- Always backup before major migrations
+- Monitor startup logs for `migrating_database_schema` events
+
+### Orphaned WAL Files
+
+After crashes, SQLite WAL mode may leave orphaned `.wal` files. The database auto-recovers on next open. If you see WAL-related errors:
+
+```bash
+# Check for orphaned WAL files
+ls -la /var/lib/bangui/bangui.db*
+
+# Force checkpoint to merge WAL into main database
+sqlite3 /var/lib/bangui/bangui.db "PRAGMA wal_checkpoint(FULL);"
+```
+
+See `Docs/DATABASE_MIGRATIONS.md` for full recovery procedures.
+
+---
+
 ## Next Steps

 - **Development**: Run `make up` to start with default limits