feat(backend): implement graceful shutdown for container stop

Graceful shutdown ensures in-flight operations complete before process exits:
- Lifespan shutdown handler drains pending tasks with 25s timeout
- Scheduler stops accepting new jobs immediately
- HTTP session, external logging, scheduler lock, DB conn closed cleanly
- 25s Python timeout leaves 5s margin before Docker's 30s SIGKILL

Files changed:
- backend/app/main.py: enhanced _lifespan shutdown with task drain
- Docker/Dockerfile.backend: documented signal handling in header
- Docker/docker-compose.yml: added stop_grace_period: 30s
- Docker/compose.prod.yml: added stop_grace_period: 30s
- Docs/Deployment.md: new Graceful Shutdown section with sequence table
- Docs/TROUBLESHOOTING.md: new Graceful Shutdown Issues section

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
This commit is contained in:
2026-05-02 22:47:10 +02:00
parent f6c3c02183
commit b631c1c546
10 changed files with 383 additions and 20 deletions

View File

@@ -1,11 +1,97 @@
# Deployment Guide
## Graceful Shutdown
BanGUI implements graceful shutdown to ensure in-flight operations complete before the process exits. This prevents:
- Incomplete blocklist imports leaving stale data
- Interrupted ban requests
- Corrupted background job states
- Unclean database connection closures
### How It Works
1. **SIGTERM received** — Docker sends SIGTERM when `docker stop` is called
2. **Uvicorn catches SIGTERM** — Notifies the FastAPI lifespan handler
3. **Lifespan shutdown begins** — Scheduler stops accepting new jobs
4. **In-flight tasks drain** — Up to 25 seconds for running jobs to complete
5. **Resources cleaned up** — HTTP session, external logging, scheduler lock, DB connection
### Docker Configuration
```yaml
backend:
stop_grace_period: 30s # Give lifespan 30s to complete before SIGKILL
```
The `stop_grace_period` of 30s gives the Python code a 25s graceful timeout, leaving a 5s safety margin before Docker sends SIGKILL.
### Shutdown Sequence
| Step | Action | Timeout |
|------|--------|---------|
| 1 | Scheduler stops accepting new jobs | Immediate |
| 2 | Wait for pending background tasks | 25s max |
| 3 | Close HTTP session | Immediate |
| 4 | Flush external logging handler | Immediate |
| 5 | Release scheduler lock | Immediate |
| 6 | Close database connection | Immediate |
### Background Tasks That Drain
- Blocklist imports
- Geo IP cache resolutions
- History sync operations
- Geo cache cleanup
- Geo cache flush
- Session cleanup
- Rate limiter cleanup
- Scheduler lock heartbeat
### Monitoring Shutdown
Logs during shutdown:
```
bangui_shutting_down timeout_seconds=25.0
scheduler_stopped_accepting_jobs
waiting_for_pending_tasks count=3 timeout_seconds=25.0
pending_tasks_completed
http_session_closed
external_logging_shutdown_complete
scheduler_lock_released
bangui_shut_down
```
If tasks exceed the timeout:
```
pending_tasks_timeout cancelled_count=3
```
### Rolling Deployments
During rolling deployments:
1. Old instance releases scheduler lock immediately on shutdown
2. New instance acquires lock without waiting for TTL expiry
3. Zero downtime for background job execution
---
## Health Checks
The backend container includes a health check endpoint at `GET /api/health` that reports application and fail2ban daemon status:
The backend container includes a health check endpoint at `GET /api/v1/health` that reports application and component status:
- **HTTP 200** with `{"status": "ok", "fail2ban": "online"}` — backend is healthy and fail2ban is reachable
- **HTTP 503** with `{"status": "unavailable", "fail2ban": "offline"}` fail2ban is unreachable (backend will restart)
- **HTTP 200** with `{"status": "ok", ...}` — all components healthy
- **HTTP 200** with `{"status": "degraded", ...}` — some components unhealthy (e.g., database error) but fail2ban reachable
- **HTTP 503** with `{"status": "unavailable", ...}` — fail2ban is unreachable (backend will restart)
**Component checks performed:**
| Component | Check | Notes |
|---|---|---|
| fail2ban | Socket ping via cached status | Returns 503 when offline |
| database | Opens and closes a test connection | Returns degraded when failing |
| scheduler | `scheduler.running` attribute | Returns degraded when stopped |
| cache | Session cache presence | Returns degraded when not initialised |
**Docker Health Check:**
@@ -13,7 +99,16 @@ The Dockerfile includes a HEALTHCHECK that queries the endpoint. Docker interpre
**Why 503 for offline fail2ban?**
If fail2ban goes offline but the backend always returns 200, Docker treats the container as healthy. This can mask infrastructure failures. By returning 503 when fail2ban is unreachable, orchestration tools (Docker, Kubernetes, Docker Swarm) will automatically restart the backend container until fail2ban recovers.
If fail2ban goes offline but the backend always returns 200, Docker treats the container as healthy. This masks infrastructure failures. By returning 503 when fail2ban is unreachable, orchestration tools (Docker, Kubernetes, Docker Swarm) automatically restart the backend container until fail2ban recovers.
**Docker Compose health check parameters:**
| Parameter | Value | Rationale |
|---|---|---|
| `interval` | 30s | Balance between responsiveness and load |
| `timeout` | 10s | Allows for slow probe on busy system |
| `retries` | 3 | ~90 seconds before restart (3 × 30s) |
| `start_period` | 40s | Allows app and fail2ban to fully start |
---
@@ -277,6 +372,60 @@ Resource limits are configured in `Docker/docker-compose.yml` and cannot be over
---
## Disaster Recovery
### Database Migration Failures
If a migration fails mid-transaction, the application refuses to start. This is intentional — it prevents inconsistent schema states.
**Diagnosis:**
1. Check current schema version:
```bash
sqlite3 /var/lib/bangui/bangui.db "SELECT MAX(version) FROM schema_migrations;"
```
2. Check which tables exist:
```bash
sqlite3 /var/lib/bangui/bangui.db "SELECT name FROM sqlite_master WHERE type='table';"
```
3. Check application logs for the specific error.
**Recovery Options:**
- **Automatic rollback**: Next startup re-applies the same migration from scratch
- **Manual completion**: Apply the migration manually, then insert the version record:
```bash
sqlite3 /var/lib/bangui/bangui.db "BEGIN IMMEDIATE;"
-- Run your SQL here
sqlite3 /var/lib/bangui/bangui.db "INSERT INTO schema_migrations (version) VALUES (?);"
sqlite3 /var/lib/bangui/bangui.db "COMMIT;"
```
- **Full reset** (development only): `rm bangui.db bangui.db-wal bangui.db-shm`
**Prevention:**
- Never modify `bangui.db` manually during running instance
- Always backup before major migrations
- Monitor startup logs for `migrating_database_schema` events
### Orphaned WAL Files
After crashes, SQLite WAL mode may leave orphaned `.wal` files. The database auto-recovers on next open. If you see WAL-related errors:
```bash
# Check for orphaned WAL files
ls -la /var/lib/bangui/bangui.db*
# Force checkpoint to merge WAL into main database
sqlite3 /var/lib/bangui/bangui.db "PRAGMA wal_checkpoint(FULL);"
```
See `Docs/DATABASE_MIGRATIONS.md` for full recovery procedures.
---
## Next Steps
- **Development**: Run `make up` to start with default limits