feat(backend): implement graceful shutdown for container stop

Graceful shutdown ensures in-flight operations complete before process exits: - Lifespan shutdown handler drains pending tasks with 25s timeout - Scheduler stops accepting new jobs immediately - HTTP session, external logging, scheduler lock, DB conn closed cleanly - 25s Python timeout leaves 5s margin before Docker's 30s SIGKILL Files changed: - backend/app/main.py: enhanced _lifespan shutdown with task drain - Docker/Dockerfile.backend: documented signal handling in header - Docker/docker-compose.yml: added stop_grace_period: 30s - Docker/compose.prod.yml: added stop_grace_period: 30s - Docs/Deployment.md: new Graceful Shutdown section with sequence table - Docs/TROUBLESHOOTING.md: new Graceful Shutdown Issues section Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-05-02 22:47:10 +02:00
parent f6c3c02183
commit b631c1c546
10 changed files with 383 additions and 20 deletions
--- a/Docs/TROUBLESHOOTING.md
+++ b/Docs/TROUBLESHOOTING.md
@@ -134,6 +134,154 @@ ps aux | grep <pid>

 ---

+## Database Migration Failures
+
+### Application Won't Start After Upgrade
+
+**Symptom:** Application fails to start. Logs show migration errors.
+
+**Cause:** Migration failed mid-transaction. Database left in inconsistent state.
+
+**Diagnosis:**
+```bash
+# Check current schema version
+sqlite3 /var/lib/bangui/bangui.db "SELECT MAX(version) FROM schema_migrations;"
+
+# List all tables
+sqlite3 /var/lib/bangui/bangui.db "SELECT name FROM sqlite_master WHERE type='table';"
+
+# Check logs for specific error
+grep -i migration /var/log/bangui.log
+```
+
+**Solution:**
+
+1. **If migration was auto-rolled back**: Startup will retry the same migration. Run application again.
+2. **If migration keeps failing**: Check if table already exists:
+   ```bash
+   sqlite3 /var/lib/bangui/bangui.db "SELECT name FROM sqlite_master WHERE type='table' AND name='<table>';"
+   ```
+   If it exists, manually insert the migration record:
+   ```bash
+   sqlite3 /var/lib/bangui/bangui.db "INSERT INTO schema_migrations (version) VALUES (?);"
+   ```
+3. **Full database reset** (development only):
+   ```bash
+   rm /var/lib/bangui/bangui.db /var/lib/bangui/bangui.db-wal /var/lib/bangui/bangui.db-shm
+   ```
+
+**Prevention:**
+- Always backup before upgrades: `cp bangui.db bangui.db.backup`
+- Never manually modify database schema
+- Monitor `migrating_database_schema` log events during upgrades
+
+---
+
+### Schema Version Mismatch
+
+**Symptom:** Error: "database schema version X is newer than supported version Y"
+
+**Cause:** Downgraded to older BanGUI version that doesn't support current schema.
+
+**Solution:** Upgrade to a version compatible with the current schema, or restore from backup.
+
+---
+
+## 502 Bad Gateway Errors
+
+### Symptom: Nginx returns 502 Bad Gateway
+
+**Cause:** The backend container is unreachable — either down, restarting, or not yet healthy.
+
+**Diagnosis:**
+
+```bash
+# Check backend container status
+docker ps -a | grep bangui-backend
+
+# Check if backend is responding directly (on the container network)
+docker exec bangui-frontend curl -f http://bangui-backend:8000/api/v1/health
+
+# Check backend logs
+docker logs bangui-backend --tail 50
+```
+
+**Common causes and solutions:**
+
+| Cause | Diagnosis | Solution |
+|---|---|---|
+| Backend restarting | `docker ps` shows backend repeatedly restarting | Check health check timing; may need longer `start_period` |
+| Health check failing | Backend log shows socket errors | Verify fail2ban container is healthy before backend starts |
+| Startup too slow | `start_period: 40s` not enough on slow hosts | Increase `start_period` in compose file |
+| Port misconfiguration | `expose` vs `ports` mismatch | Ensure backend exposes 8000 and frontend proxies to it |
+
+**Prevention:**
+
+- The `depends_on: condition: service_healthy` ensures the backend is fully started before the frontend proxies requests.
+- The health check returns 503 when fail2ban is offline, triggering container restart automatically.
+- Health check parameters are tuned for typical startup time — adjust `start_period` if the host is slow or resource-constrained.
+
+---
+
+## Graceful Shutdown Issues
+
+### Container Killed Before Tasks Complete
+
+**Symptom:** Logs show `pending_tasks_timeout` and tasks are cancelled mid-execution.
+
+**Cause:** Docker's `stop_grace_period` is too short, or tasks take longer than the 25s graceful timeout.
+
+**Diagnosis:**
+```bash
+# Check if container was killed by SIGKILL
+docker inspect bangui-backend --format '{{.State.ExitCode}}'
+# Exit code 137 = SIGKILL
+```
+
+**Solution:**
+1. Increase `stop_grace_period` in `docker-compose.yml`:
+   ```yaml
+   backend:
+     stop_grace_period: 60s
+   ```
+2. The Python graceful timeout is 25s (leaving margin before Docker kill)
+3. If tasks still timeout, check task code — long-running tasks should handle cancellation gracefully
+
+### Scheduler Lock Not Released
+
+**Symptom:** After container restart, logs show `Could not acquire scheduler lock`.
+
+**Cause:** Previous instance shut down without releasing the lock, or lock TTL hasn't expired.
+
+**Diagnosis:**
+```bash
+sqlite3 /var/lib/bangui/bangui.db "SELECT * FROM scheduler_lock;"
+```
+
+**Solution:**
+```bash
+# Clear stale lock
+sqlite3 /var/lib/bangui/bangui.db "DELETE FROM scheduler_lock;"
+# Restart container
+```
+
+**Prevention:**
+- Graceful shutdown releases lock immediately (not waiting for TTL expiry)
+- Monitor logs for `scheduler_lock_released` on clean shutdown
+
+### In-Flight Requests Dropped
+
+**Symptom:** Client connections closed abruptly during shutdown.
+
+**Cause:** Too short a graceful timeout, or clients not configured to retry.
+
+**Solution:**
+1. Ensure clients implement proper retry logic with backoff
+2. For critical operations, use background tasks with status polling
+3. Increase graceful timeout if network latency is high
+
+---
+
 ## General Recovery Commands

 Clear all locks: