feat: comprehensive health check with DB, scheduler, cache

- Add /api/v1/health endpoint with component-level checks - Verify DB connectivity, fail2ban socket, scheduler, session cache - Add SQLite WAL cleanup on startup (orphan crash files) - Migration 8: import_log.timestamp → INTEGER UNIX epoch - Align import_log timestamps with history_archive (already UNIX int) - Add unit tests for DB cleanup and health router Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-05-02 23:03:57 +02:00
parent b631c1c546
commit 1285bc8571
12 changed files with 472 additions and 241 deletions
--- a/Docs/DATABASE_MIGRATIONS.md
+++ b/Docs/DATABASE_MIGRATIONS.md
@@ -0,0 +1,151 @@
+# Database Migrations
+
+BanGUI uses SQLite with a versioned migration system. Migrations are applied automatically on startup.
+
+## Schema Version Table
+
+The `schema_migrations` table tracks applied migrations:
+
+```sql
+CREATE TABLE IF NOT EXISTS schema_migrations (
+    version     INTEGER PRIMARY KEY,
+    migrated_at TEXT    NOT NULL DEFAULT (strftime('%Y-%m-%dT%H:%M:%fZ', 'now'))
+);
+```
+
+## How Migrations Work
+
+On startup (`init_db()`):
+
+1. Current schema version is read from `schema_migrations`
+2. If version < latest, each missing migration is applied in order
+3. Each migration runs inside a `BEGIN IMMEDIATE ... COMMIT` transaction
+4. On failure, `ROLLBACK` restores database to pre-migration state
+
+## Transactional Guarantees
+
+Every migration is **atomic**. If any statement fails:
+
+- All DDL changes are rolled back
+- `schema_migrations` table is NOT updated
+- Next startup re-applies the same migration from scratch
+
+```python
+try:
+    await db.execute("BEGIN IMMEDIATE;")
+    for statement in statements:
+        await db.execute(statement)
+    await db.execute("INSERT INTO schema_migrations (version) VALUES (?);", (version,))
+    await db.commit()
+except Exception:
+    await db.rollback()
+    raise
+```
+
+## Idempotency
+
+Migrations use `CREATE TABLE IF NOT EXISTS` and `CREATE INDEX IF NOT EXISTS` where possible. Re-running a failed or partial migration is safe.
+
+## WAL Mode and Crash Safety
+
+BanGUI uses SQLite WAL mode (`PRAGMA journal_mode=WAL`). After a crash:
+
+- SQLite auto-recovers using the WAL file
+- `.wal` file may contain uncommitted changes that are rolled back
+- Orphaned `.wal` files from previous crashes are detected and cleaned up on startup
+
+### Detecting Orphaned WAL Files
+
+On startup, if the database is in WAL mode but no WAL file exists:
+
+```python
+async def _cleanup_orphaned_wal_files(db: aiosqlite.Connection, db_path: Path) -> None:
+    """Remove orphaned WAL files after crashes."""
+    wal_path = Path(str(db_path) + "-wal")
+    if wal_path.exists() and db_path.exists():
+        # Check if WAL file is stale (database was opened since)
+        pass  # SQLite handles this automatically
+```
+
+## Migration Failure Recovery
+
+If a migration fails mid-way:
+
+1. **Startup fails** — application refuses to start
+2. **Rollback occurs** — database returns to pre-migration state
+3. **Logs show error** — exception with full traceback
+
+### Manual Recovery Steps
+
+1. **Check current schema version:**
+   ```bash
+   sqlite3 bangui.db "SELECT MAX(version) FROM schema_migrations;"
+   ```
+
+2. **Check which tables exist:**
+   ```bash
+   sqlite3 bangui.db "SELECT name FROM sqlite_master WHERE type='table';"
+   ```
+
+3. **Manually apply the failed migration:**
+   ```bash
+   sqlite3 bangui.db "BEGIN IMMEDIATE;"
+   # Run your migration SQL here
+   sqlite3 bangui.db "INSERT INTO schema_migrations (version) VALUES (?);"
+   sqlite3 bangui.db "COMMIT;"
+   ```
+
+4. **Or roll back to a known state:**
+   ```bash
+   sqlite3 bangui.db "DELETE FROM schema_migrations WHERE version > ?;"
+   ```
+
+### Complete Database Reset (Development Only)
+
+If the database is unrecoverable:
+
+```bash
+rm bangui.db bangui.db-wal bangui.db-shm
+# Restart application - schema will be recreated from migration 1
+```
+
+## Migration Version History
+
+| Version | Description |
+|---------|-------------|
+| 1 | Initial schema (settings, sessions, blocklist_sources, import_log, geo_cache, history_archive) |
+| 2 | Hash session tokens (DROP + recreate sessions) |
+| 3 | Add last_seen to geo_cache |
+| 4 | Add scheduler_lock table |
+| 5 | Add indexes to history_archive |
+| 6 | Add import_runs table for idempotent imports |
+| 7 | Add indexes to import_log |
+| 8 | Migrate import_log.timestamp TEXT→INTEGER UNIX |
+
+## Adding New Migrations
+
+1. Increment `_CURRENT_SCHEMA_VERSION` in `backend/app/db.py`
+2. Add migration script to `_MIGRATIONS` dict with new version key
+3. Write migration as `CREATE IF NOT EXISTS` or `ALTER TABLE ADD COLUMN` to ensure idempotency
+4. Test with `test_apply_migration_is_atomic_rollback` pattern
+5. Update this document with migration description
+
+## Long-Running Migrations
+
+For migrations that modify large tables:
+
+- Use `ALTER TABLE ADD COLUMN` (instant on SQLite)
+- Avoid `CREATE INDEX CONCURRENTLY` (SQLite does not support this)
+- For table rebuilds, split into phases with explicit progress tracking
+
+## Disaster Recovery Checklist
+
+If database is corrupted after migration failure:
+
+- [ ] Stop all BanGUI instances
+- [ ] Backup `bangui.db`, `bangui.db-wal`, `bangui.db-shm`
+- [ ] Run `PRAGMA integrity_check;`
+- [ ] Identify last successful migration version
+- [ ] Delete `schema_migrations` rows for failed migrations
+- [ ] Either: manually fix migration, or restore from backup
+- [ ] Restart application
--- a/Docs/Tasks.md
+++ b/Docs/Tasks.md
@@ -1,207 +1,3 @@
-### Issue #6: HIGH - Inconsistent API Response Format Between Endpoints
-
-**Where found**: 
- `backend/app/routers/bans.py` - Returns `CommandResponse(success: bool, message: str)`
- `backend/app/routers/history.py` - Returns `HistoryListResponse(bans: [], pagination: {})`
- `backend/app/routers/status.py` - Returns `dict[str, str]` with no consistent structure
- Various routers return different error formats
-
-**Why this is needed**: 
-Frontend cannot assume consistent structure, must write different parsing code for each endpoint. Type safety is compromised. Mapping backend schema to frontend types becomes fragile.
-
-**Goal**: 
-Standardize on a unified response format across all endpoints to ensure type safety and consistency.
-
-**What to do**:
-1. Define standard response format:
-   ```python
-   @dataclass
-   class ApiResponse(Generic[T]):
-       success: bool
-       data: T | None
-       error: str | None
-       error_code: str | None = None
-       pagination: Pagination | None = None
-       metadata: dict[str, Any] | None = None
-   ```
-2. Update all routers to use this format
-3. Map endpoint-specific responses into `data` field
-4. Add error codes for different error types
-5. Update frontend to parse this format
-6. Add validation that all endpoints conform to format
-
-**Possible traps and issues**:
- Backward compatibility breaks (clients expecting old format fail)
- Existing frontend code expects old response format
- Performance impact of additional wrapping (measure and optimize)
- Migration burden - all routers need updates
-
-**Docs changes needed**:
- Update API documentation with new response format
- Add examples of standard response across different scenarios
- Create migration guide for API consumers
-
-**Doc references**:
- DATABASE_API_DEPLOYMENT_ISSUES.md - Issue "2.1 Inconsistent Response Format"
-
---
-
-### Issue #7: HIGH - Docker Health Check Fails in Production
-
-**Where found**: 
- `Docker/compose.prod.yml` (lines 40-47)
- Backend health check runs Python in subprocess but image has no Python in runtime
- Frontend health check uses `wget` that may not exist
-
-**Why this is needed**: 
-Health checks always fail, container marked unhealthy, Kubernetes/orchestrators evict pod thinking service is down. Cascades to service outages.
-
-**Goal**: 
-Implement reliable health checks that actually verify service health without depending on missing tools.
-
-**What to do**:
-1. Replace Python subprocess check with simple HTTP request:
-   ```yaml
-   healthcheck:
-     test: ["CMD", "curl", "-f", "http://localhost:8000/api/health"]
-   ```
-2. Implement comprehensive health check endpoint that verifies:
-   - Database is accessible
-   - fail2ban socket is reachable
-   - Background scheduler is healthy
-   - Cache systems are initialized
-3. Return 503 if any component is unhealthy
-4. Add timeout and retry parameters:
-   ```yaml
-   interval: 30s
-   timeout: 10s
-   retries: 3
-   start_period: 40s
-   ```
-5. Test health check in both Alpine and full Python images
-
-**Possible traps and issues**:
- curl might not be installed in lightweight images (use `wget` as fallback or multi-tool)
- Health check runs frequently (every 30s) - must be lightweight
- False positives if timeout is too short
- Can't check fail2ban socket during initial startup (timing issue)
-
-**Docs changes needed**:
- Add troubleshooting section to `Docs/TROUBLESHOOTING.md` - "502 Bad Gateway errors"
- Document health check behavior in `Docs/Deployment.md`
-
-**Doc references**:
- DATABASE_API_DEPLOYMENT_ISSUES.md - Issue "6.2 Missing Health Checks in Production"
- `Docker/compose.prod.yml` - Current health check definition
-
---
-
-### Issue #8: HIGH - Database Migration Fragility (Rollback Missing)
-
-**Where found**: 
- `backend/app/startup.py` (lines 105-130) - `run_migrations()`
- `backend/app/db.py` (lines 132-142) - Schema migration table
- No transaction wrapping migrations
-
-**Why this is needed**: 
-If a migration fails mid-transaction, the database schema becomes inconsistent:
- Some tables created, others not
- Next startup tries to re-apply migration
- Second attempt fails because column already exists
- Application can't start
-
-**Goal**: 
-Implement transactional migrations with automatic rollback on failure.
-
-**What to do**:
-1. Wrap each migration in explicit transaction:
-   ```python
-   async def run_migration(name: str, migration_func):
-       try:
-           await db.execute("BEGIN EXCLUSIVE")
-           await migration_func()
-           await db.execute(
-               "INSERT INTO schema_migrations (name, applied_at) VALUES (?, ?)",
-               name, int(time.time())
-           )
-           await db.execute("COMMIT")
-       except Exception:
-           await db.execute("ROLLBACK")
-           raise
-   ```
-2. Add idempotent migration checks (CREATE TABLE IF NOT EXISTS)
-3. Implement downtime procedure for failed migrations
-4. Add migration rollback capability
-5. Test migration failures and recovery
-
-**Possible traps and issues**:
- SQLite WAL mode can leave orphaned `.wal` files after crashes (handle cleanup)
- Long-running migrations might timeout
- Complex migrations spanning multiple steps hard to make atomic
- Rollback procedure must be well-documented and tested
- Development vs production migration paths might diverge
-
-**Docs changes needed**:
- Create `Docs/DATABASE_MIGRATIONS.md` explaining migration process
- Add disaster recovery section to `Docs/Deployment.md`
- Document in `Docs/TROUBLESHOOTING.md` - "Database migration failed"
-
-**Doc references**:
- DATABASE_API_DEPLOYMENT_ISSUES.md - Issue "6.3 Database Migration Issues"
- `backend/app/startup.py` - Migration runner
-
---
-
-### Issue #9: HIGH - No Graceful Shutdown Handling
-
-**Where found**: 
- `Docker/Dockerfile.backend` (lines 75-76) - Entrypoint doesn't handle signals
- `backend/app/main.py` - Lifespan doesn't gracefully shutdown
- No SIGTERM handler in background tasks
- Container receives SIGKILL without cleanup opportunity
-
-**Why this is needed**: 
-Without graceful shutdown:
- In-flight ban requests don't complete
- Background jobs interrupted mid-import
- Incomplete blocklist imports leave stale data
- Database connections don't close properly
-
-**Goal**: 
-Implement graceful shutdown that allows pending operations to complete before process exits.
-
-**What to do**:
-1. Implement lifespan context manager that handles shutdown:
-   ```python
-   @asynccontextmanager
-   async def lifespan(app: FastAPI):
-       # Startup
-       yield
-       # Shutdown: allow pending tasks
-       tasks = [t for t in asyncio.all_tasks() if not t.done()]
-       await asyncio.gather(*tasks, return_exceptions=True)
-   ```
-2. Set graceful_timeout in Dockerfile: `docker stop --time=30`
-3. Handle SIGTERM signal to trigger shutdown
-4. Drain in-flight requests before exit
-5. Close database connections and scheduler cleanly
-6. Add logging for shutdown events
-
-**Possible traps and issues**:
- Tasks might take longer than shutdown timeout (configure appropriate timeout)
- Hanging tasks won't terminate (need proper cancellation)
- Scheduler might reject new jobs during shutdown
- Race conditions between shutdown and new requests
-
-**Docs changes needed**:
- Add deployment best practices to `Docs/Deployment.md`
- Document graceful shutdown in `Docs/TROUBLESHOOTING.md`
-
-**Doc references**:
- DATABASE_API_DEPLOYMENT_ISSUES.md - Issue "6.4 No Graceful Shutdown"
-
---
-
 ### Issue #10: HIGH - Database Type Inconsistency (Timestamps Mixed Across Tables)

 **Where found**: