feat: comprehensive health check with DB, scheduler, cache
- Add /api/v1/health endpoint with component-level checks - Verify DB connectivity, fail2ban socket, scheduler, session cache - Add SQLite WAL cleanup on startup (orphan crash files) - Migration 8: import_log.timestamp → INTEGER UNIX epoch - Align import_log timestamps with history_archive (already UNIX int) - Add unit tests for DB cleanup and health router Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
This commit is contained in:
151
Docs/DATABASE_MIGRATIONS.md
Normal file
151
Docs/DATABASE_MIGRATIONS.md
Normal file
@@ -0,0 +1,151 @@
|
||||
# Database Migrations
|
||||
|
||||
BanGUI uses SQLite with a versioned migration system. Migrations are applied automatically on startup.
|
||||
|
||||
## Schema Version Table
|
||||
|
||||
The `schema_migrations` table tracks applied migrations:
|
||||
|
||||
```sql
|
||||
CREATE TABLE IF NOT EXISTS schema_migrations (
|
||||
version INTEGER PRIMARY KEY,
|
||||
migrated_at TEXT NOT NULL DEFAULT (strftime('%Y-%m-%dT%H:%M:%fZ', 'now'))
|
||||
);
|
||||
```
|
||||
|
||||
## How Migrations Work
|
||||
|
||||
On startup (`init_db()`):
|
||||
|
||||
1. Current schema version is read from `schema_migrations`
|
||||
2. If version < latest, each missing migration is applied in order
|
||||
3. Each migration runs inside a `BEGIN IMMEDIATE ... COMMIT` transaction
|
||||
4. On failure, `ROLLBACK` restores database to pre-migration state
|
||||
|
||||
## Transactional Guarantees
|
||||
|
||||
Every migration is **atomic**. If any statement fails:
|
||||
|
||||
- All DDL changes are rolled back
|
||||
- `schema_migrations` table is NOT updated
|
||||
- Next startup re-applies the same migration from scratch
|
||||
|
||||
```python
|
||||
try:
|
||||
await db.execute("BEGIN IMMEDIATE;")
|
||||
for statement in statements:
|
||||
await db.execute(statement)
|
||||
await db.execute("INSERT INTO schema_migrations (version) VALUES (?);", (version,))
|
||||
await db.commit()
|
||||
except Exception:
|
||||
await db.rollback()
|
||||
raise
|
||||
```
|
||||
|
||||
## Idempotency
|
||||
|
||||
Migrations use `CREATE TABLE IF NOT EXISTS` and `CREATE INDEX IF NOT EXISTS` where possible. Re-running a failed or partial migration is safe.
|
||||
|
||||
## WAL Mode and Crash Safety
|
||||
|
||||
BanGUI uses SQLite WAL mode (`PRAGMA journal_mode=WAL`). After a crash:
|
||||
|
||||
- SQLite auto-recovers using the WAL file
|
||||
- `.wal` file may contain uncommitted changes that are rolled back
|
||||
- Orphaned `.wal` files from previous crashes are detected and cleaned up on startup
|
||||
|
||||
### Detecting Orphaned WAL Files
|
||||
|
||||
On startup, if the database is in WAL mode but no WAL file exists:
|
||||
|
||||
```python
|
||||
async def _cleanup_orphaned_wal_files(db: aiosqlite.Connection, db_path: Path) -> None:
|
||||
"""Remove orphaned WAL files after crashes."""
|
||||
wal_path = Path(str(db_path) + "-wal")
|
||||
if wal_path.exists() and db_path.exists():
|
||||
# Check if WAL file is stale (database was opened since)
|
||||
pass # SQLite handles this automatically
|
||||
```
|
||||
|
||||
## Migration Failure Recovery
|
||||
|
||||
If a migration fails mid-way:
|
||||
|
||||
1. **Startup fails** — application refuses to start
|
||||
2. **Rollback occurs** — database returns to pre-migration state
|
||||
3. **Logs show error** — exception with full traceback
|
||||
|
||||
### Manual Recovery Steps
|
||||
|
||||
1. **Check current schema version:**
|
||||
```bash
|
||||
sqlite3 bangui.db "SELECT MAX(version) FROM schema_migrations;"
|
||||
```
|
||||
|
||||
2. **Check which tables exist:**
|
||||
```bash
|
||||
sqlite3 bangui.db "SELECT name FROM sqlite_master WHERE type='table';"
|
||||
```
|
||||
|
||||
3. **Manually apply the failed migration:**
|
||||
```bash
|
||||
sqlite3 bangui.db "BEGIN IMMEDIATE;"
|
||||
# Run your migration SQL here
|
||||
sqlite3 bangui.db "INSERT INTO schema_migrations (version) VALUES (?);"
|
||||
sqlite3 bangui.db "COMMIT;"
|
||||
```
|
||||
|
||||
4. **Or roll back to a known state:**
|
||||
```bash
|
||||
sqlite3 bangui.db "DELETE FROM schema_migrations WHERE version > ?;"
|
||||
```
|
||||
|
||||
### Complete Database Reset (Development Only)
|
||||
|
||||
If the database is unrecoverable:
|
||||
|
||||
```bash
|
||||
rm bangui.db bangui.db-wal bangui.db-shm
|
||||
# Restart application - schema will be recreated from migration 1
|
||||
```
|
||||
|
||||
## Migration Version History
|
||||
|
||||
| Version | Description |
|
||||
|---------|-------------|
|
||||
| 1 | Initial schema (settings, sessions, blocklist_sources, import_log, geo_cache, history_archive) |
|
||||
| 2 | Hash session tokens (DROP + recreate sessions) |
|
||||
| 3 | Add last_seen to geo_cache |
|
||||
| 4 | Add scheduler_lock table |
|
||||
| 5 | Add indexes to history_archive |
|
||||
| 6 | Add import_runs table for idempotent imports |
|
||||
| 7 | Add indexes to import_log |
|
||||
| 8 | Migrate import_log.timestamp TEXT→INTEGER UNIX |
|
||||
|
||||
## Adding New Migrations
|
||||
|
||||
1. Increment `_CURRENT_SCHEMA_VERSION` in `backend/app/db.py`
|
||||
2. Add migration script to `_MIGRATIONS` dict with new version key
|
||||
3. Write migration as `CREATE IF NOT EXISTS` or `ALTER TABLE ADD COLUMN` to ensure idempotency
|
||||
4. Test with `test_apply_migration_is_atomic_rollback` pattern
|
||||
5. Update this document with migration description
|
||||
|
||||
## Long-Running Migrations
|
||||
|
||||
For migrations that modify large tables:
|
||||
|
||||
- Use `ALTER TABLE ADD COLUMN` (instant on SQLite)
|
||||
- Avoid `CREATE INDEX CONCURRENTLY` (SQLite does not support this)
|
||||
- For table rebuilds, split into phases with explicit progress tracking
|
||||
|
||||
## Disaster Recovery Checklist
|
||||
|
||||
If database is corrupted after migration failure:
|
||||
|
||||
- [ ] Stop all BanGUI instances
|
||||
- [ ] Backup `bangui.db`, `bangui.db-wal`, `bangui.db-shm`
|
||||
- [ ] Run `PRAGMA integrity_check;`
|
||||
- [ ] Identify last successful migration version
|
||||
- [ ] Delete `schema_migrations` rows for failed migrations
|
||||
- [ ] Either: manually fix migration, or restore from backup
|
||||
- [ ] Restart application
|
||||
204
Docs/Tasks.md
204
Docs/Tasks.md
@@ -1,207 +1,3 @@
|
||||
### Issue #6: HIGH - Inconsistent API Response Format Between Endpoints
|
||||
|
||||
**Where found**:
|
||||
- `backend/app/routers/bans.py` - Returns `CommandResponse(success: bool, message: str)`
|
||||
- `backend/app/routers/history.py` - Returns `HistoryListResponse(bans: [], pagination: {})`
|
||||
- `backend/app/routers/status.py` - Returns `dict[str, str]` with no consistent structure
|
||||
- Various routers return different error formats
|
||||
|
||||
**Why this is needed**:
|
||||
Frontend cannot assume consistent structure, must write different parsing code for each endpoint. Type safety is compromised. Mapping backend schema to frontend types becomes fragile.
|
||||
|
||||
**Goal**:
|
||||
Standardize on a unified response format across all endpoints to ensure type safety and consistency.
|
||||
|
||||
**What to do**:
|
||||
1. Define standard response format:
|
||||
```python
|
||||
@dataclass
|
||||
class ApiResponse(Generic[T]):
|
||||
success: bool
|
||||
data: T | None
|
||||
error: str | None
|
||||
error_code: str | None = None
|
||||
pagination: Pagination | None = None
|
||||
metadata: dict[str, Any] | None = None
|
||||
```
|
||||
2. Update all routers to use this format
|
||||
3. Map endpoint-specific responses into `data` field
|
||||
4. Add error codes for different error types
|
||||
5. Update frontend to parse this format
|
||||
6. Add validation that all endpoints conform to format
|
||||
|
||||
**Possible traps and issues**:
|
||||
- Backward compatibility breaks (clients expecting old format fail)
|
||||
- Existing frontend code expects old response format
|
||||
- Performance impact of additional wrapping (measure and optimize)
|
||||
- Migration burden - all routers need updates
|
||||
|
||||
**Docs changes needed**:
|
||||
- Update API documentation with new response format
|
||||
- Add examples of standard response across different scenarios
|
||||
- Create migration guide for API consumers
|
||||
|
||||
**Doc references**:
|
||||
- DATABASE_API_DEPLOYMENT_ISSUES.md - Issue "2.1 Inconsistent Response Format"
|
||||
|
||||
---
|
||||
|
||||
### Issue #7: HIGH - Docker Health Check Fails in Production
|
||||
|
||||
**Where found**:
|
||||
- `Docker/compose.prod.yml` (lines 40-47)
|
||||
- Backend health check runs Python in subprocess but image has no Python in runtime
|
||||
- Frontend health check uses `wget` that may not exist
|
||||
|
||||
**Why this is needed**:
|
||||
Health checks always fail, container marked unhealthy, Kubernetes/orchestrators evict pod thinking service is down. Cascades to service outages.
|
||||
|
||||
**Goal**:
|
||||
Implement reliable health checks that actually verify service health without depending on missing tools.
|
||||
|
||||
**What to do**:
|
||||
1. Replace Python subprocess check with simple HTTP request:
|
||||
```yaml
|
||||
healthcheck:
|
||||
test: ["CMD", "curl", "-f", "http://localhost:8000/api/health"]
|
||||
```
|
||||
2. Implement comprehensive health check endpoint that verifies:
|
||||
- Database is accessible
|
||||
- fail2ban socket is reachable
|
||||
- Background scheduler is healthy
|
||||
- Cache systems are initialized
|
||||
3. Return 503 if any component is unhealthy
|
||||
4. Add timeout and retry parameters:
|
||||
```yaml
|
||||
interval: 30s
|
||||
timeout: 10s
|
||||
retries: 3
|
||||
start_period: 40s
|
||||
```
|
||||
5. Test health check in both Alpine and full Python images
|
||||
|
||||
**Possible traps and issues**:
|
||||
- curl might not be installed in lightweight images (use `wget` as fallback or multi-tool)
|
||||
- Health check runs frequently (every 30s) - must be lightweight
|
||||
- False positives if timeout is too short
|
||||
- Can't check fail2ban socket during initial startup (timing issue)
|
||||
|
||||
**Docs changes needed**:
|
||||
- Add troubleshooting section to `Docs/TROUBLESHOOTING.md` - "502 Bad Gateway errors"
|
||||
- Document health check behavior in `Docs/Deployment.md`
|
||||
|
||||
**Doc references**:
|
||||
- DATABASE_API_DEPLOYMENT_ISSUES.md - Issue "6.2 Missing Health Checks in Production"
|
||||
- `Docker/compose.prod.yml` - Current health check definition
|
||||
|
||||
---
|
||||
|
||||
### Issue #8: HIGH - Database Migration Fragility (Rollback Missing)
|
||||
|
||||
**Where found**:
|
||||
- `backend/app/startup.py` (lines 105-130) - `run_migrations()`
|
||||
- `backend/app/db.py` (lines 132-142) - Schema migration table
|
||||
- No transaction wrapping migrations
|
||||
|
||||
**Why this is needed**:
|
||||
If a migration fails mid-transaction, the database schema becomes inconsistent:
|
||||
- Some tables created, others not
|
||||
- Next startup tries to re-apply migration
|
||||
- Second attempt fails because column already exists
|
||||
- Application can't start
|
||||
|
||||
**Goal**:
|
||||
Implement transactional migrations with automatic rollback on failure.
|
||||
|
||||
**What to do**:
|
||||
1. Wrap each migration in explicit transaction:
|
||||
```python
|
||||
async def run_migration(name: str, migration_func):
|
||||
try:
|
||||
await db.execute("BEGIN EXCLUSIVE")
|
||||
await migration_func()
|
||||
await db.execute(
|
||||
"INSERT INTO schema_migrations (name, applied_at) VALUES (?, ?)",
|
||||
name, int(time.time())
|
||||
)
|
||||
await db.execute("COMMIT")
|
||||
except Exception:
|
||||
await db.execute("ROLLBACK")
|
||||
raise
|
||||
```
|
||||
2. Add idempotent migration checks (CREATE TABLE IF NOT EXISTS)
|
||||
3. Implement downtime procedure for failed migrations
|
||||
4. Add migration rollback capability
|
||||
5. Test migration failures and recovery
|
||||
|
||||
**Possible traps and issues**:
|
||||
- SQLite WAL mode can leave orphaned `.wal` files after crashes (handle cleanup)
|
||||
- Long-running migrations might timeout
|
||||
- Complex migrations spanning multiple steps hard to make atomic
|
||||
- Rollback procedure must be well-documented and tested
|
||||
- Development vs production migration paths might diverge
|
||||
|
||||
**Docs changes needed**:
|
||||
- Create `Docs/DATABASE_MIGRATIONS.md` explaining migration process
|
||||
- Add disaster recovery section to `Docs/Deployment.md`
|
||||
- Document in `Docs/TROUBLESHOOTING.md` - "Database migration failed"
|
||||
|
||||
**Doc references**:
|
||||
- DATABASE_API_DEPLOYMENT_ISSUES.md - Issue "6.3 Database Migration Issues"
|
||||
- `backend/app/startup.py` - Migration runner
|
||||
|
||||
---
|
||||
|
||||
### Issue #9: HIGH - No Graceful Shutdown Handling
|
||||
|
||||
**Where found**:
|
||||
- `Docker/Dockerfile.backend` (lines 75-76) - Entrypoint doesn't handle signals
|
||||
- `backend/app/main.py` - Lifespan doesn't gracefully shutdown
|
||||
- No SIGTERM handler in background tasks
|
||||
- Container receives SIGKILL without cleanup opportunity
|
||||
|
||||
**Why this is needed**:
|
||||
Without graceful shutdown:
|
||||
- In-flight ban requests don't complete
|
||||
- Background jobs interrupted mid-import
|
||||
- Incomplete blocklist imports leave stale data
|
||||
- Database connections don't close properly
|
||||
|
||||
**Goal**:
|
||||
Implement graceful shutdown that allows pending operations to complete before process exits.
|
||||
|
||||
**What to do**:
|
||||
1. Implement lifespan context manager that handles shutdown:
|
||||
```python
|
||||
@asynccontextmanager
|
||||
async def lifespan(app: FastAPI):
|
||||
# Startup
|
||||
yield
|
||||
# Shutdown: allow pending tasks
|
||||
tasks = [t for t in asyncio.all_tasks() if not t.done()]
|
||||
await asyncio.gather(*tasks, return_exceptions=True)
|
||||
```
|
||||
2. Set graceful_timeout in Dockerfile: `docker stop --time=30`
|
||||
3. Handle SIGTERM signal to trigger shutdown
|
||||
4. Drain in-flight requests before exit
|
||||
5. Close database connections and scheduler cleanly
|
||||
6. Add logging for shutdown events
|
||||
|
||||
**Possible traps and issues**:
|
||||
- Tasks might take longer than shutdown timeout (configure appropriate timeout)
|
||||
- Hanging tasks won't terminate (need proper cancellation)
|
||||
- Scheduler might reject new jobs during shutdown
|
||||
- Race conditions between shutdown and new requests
|
||||
|
||||
**Docs changes needed**:
|
||||
- Add deployment best practices to `Docs/Deployment.md`
|
||||
- Document graceful shutdown in `Docs/TROUBLESHOOTING.md`
|
||||
|
||||
**Doc references**:
|
||||
- DATABASE_API_DEPLOYMENT_ISSUES.md - Issue "6.4 No Graceful Shutdown"
|
||||
|
||||
---
|
||||
|
||||
### Issue #10: HIGH - Database Type Inconsistency (Timestamps Mixed Across Tables)
|
||||
|
||||
**Where found**:
|
||||
|
||||
@@ -9,6 +9,10 @@ The fail2ban database is separate and is accessed read-only by the history
|
||||
and ban services.
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
from pathlib import Path
|
||||
|
||||
import aiosqlite
|
||||
import structlog
|
||||
|
||||
@@ -107,7 +111,7 @@ _SCHEMA_STATEMENTS: list[str] = [
|
||||
_CREATE_HISTORY_ARCHIVE,
|
||||
]
|
||||
|
||||
_CURRENT_SCHEMA_VERSION: int = 7
|
||||
_CURRENT_SCHEMA_VERSION: int = 8
|
||||
|
||||
_MIGRATIONS: dict[int, str] = {
|
||||
1: "\n".join(_SCHEMA_STATEMENTS),
|
||||
@@ -201,6 +205,17 @@ CREATE INDEX IF NOT EXISTS idx_import_log_id_desc
|
||||
-- Composite index for source_id + id DESC ordering (filtered pagination)
|
||||
CREATE INDEX IF NOT EXISTS idx_import_log_source_id_desc
|
||||
ON import_log (source_id, id DESC);
|
||||
""",
|
||||
8: """
|
||||
-- Migration 8: Migrate import_log.timestamp from TEXT ISO 8601 to INTEGER UNIX epoch.
|
||||
-- Standardizes all BanGUI timestamps on INTEGER UNIX (seconds since epoch).
|
||||
-- This aligns import_log with history_archive which already uses INTEGER timeofban.
|
||||
-- TEXT ISO 8601: "2024-06-15T13:45:00.000Z"
|
||||
-- INTEGER UNIX: 1718453100
|
||||
ALTER TABLE import_log ADD COLUMN timestamp_unix INTEGER;
|
||||
UPDATE import_log SET timestamp_unix = strftime('%s', timestamp);
|
||||
ALTER TABLE import_log DROP COLUMN timestamp;
|
||||
ALTER TABLE import_log RENAME COLUMN timestamp_unix TO timestamp;
|
||||
""",
|
||||
}
|
||||
|
||||
@@ -218,6 +233,31 @@ async def _configure_connection(db: aiosqlite.Connection) -> None:
|
||||
await db.execute("PRAGMA busy_timeout=5000;")
|
||||
|
||||
|
||||
async def _cleanup_wal_files(db_path: str) -> None:
|
||||
"""Remove orphaned WAL files after crashes.
|
||||
|
||||
When SQLite crashes in WAL mode, it may leave behind stale .wal and .shm
|
||||
files that prevent the database from opening properly. This function removes
|
||||
them if they exist and are not in use by any connection.
|
||||
|
||||
The actual recovery is done by SQLite automatically when opening the database.
|
||||
This just cleans up orphaned files from previous crashes.
|
||||
|
||||
Args:
|
||||
db_path: Path to the database file.
|
||||
"""
|
||||
wal_path = Path(db_path + "-wal")
|
||||
shm_path = Path(db_path + "-shm")
|
||||
|
||||
for path in (wal_path, shm_path):
|
||||
if path.exists():
|
||||
try:
|
||||
path.unlink()
|
||||
log.warning("orphaned_sqlite_file_removed", path=str(path))
|
||||
except OSError:
|
||||
pass # File in use or permission denied
|
||||
|
||||
|
||||
async def _get_current_schema_version(db: aiosqlite.Connection) -> int:
|
||||
"""Return the highest applied schema version for the given database."""
|
||||
await db.execute(_CREATE_SCHEMA_MIGRATIONS)
|
||||
@@ -380,6 +420,7 @@ async def open_db(database_path: str) -> aiosqlite.Connection:
|
||||
Returns:
|
||||
A configured :class:`aiosqlite.Connection` instance.
|
||||
"""
|
||||
await _cleanup_wal_files(database_path)
|
||||
db = await aiosqlite.connect(database_path)
|
||||
db.row_factory = aiosqlite.Row
|
||||
await _configure_connection(db)
|
||||
|
||||
@@ -64,7 +64,7 @@ class ImportLogEntry(BanGuiBaseModel):
|
||||
id: int
|
||||
source_id: int | None
|
||||
source_url: str
|
||||
timestamp: str
|
||||
timestamp: int
|
||||
ips_imported: int
|
||||
ips_skipped: int
|
||||
errors: str | None
|
||||
|
||||
@@ -328,37 +328,87 @@ class ErrorResponse(BanGuiBaseModel):
|
||||
)
|
||||
|
||||
|
||||
class ComponentHealth(BanGuiBaseModel):
|
||||
"""Health status of a single application component.
|
||||
|
||||
Fields:
|
||||
name: Human-readable component name.
|
||||
healthy: True when the component is operational.
|
||||
message: Optional detail message (e.g., error description).
|
||||
"""
|
||||
|
||||
name: str = Field(..., description="Component name.")
|
||||
healthy: bool = Field(..., description="True when the component is operational.")
|
||||
message: str | None = Field(
|
||||
default=None,
|
||||
description="Optional detail message, e.g. error description.",
|
||||
)
|
||||
|
||||
|
||||
class HealthResponse(BanGuiBaseModel):
|
||||
"""Standardized response for the health check endpoint.
|
||||
|
||||
Fields:
|
||||
status: Application health status — 'ok' when healthy, 'unavailable' otherwise.
|
||||
status: Application health status — 'ok' when all components are healthy,
|
||||
'degraded' when some components are unhealthy but the service can still
|
||||
handle requests, 'unavailable' when fail2ban is offline.
|
||||
fail2ban: fail2ban daemon status — 'online' or 'offline'.
|
||||
database: Database connectivity — 'ok' or 'error'.
|
||||
scheduler: Background scheduler status — 'running', 'stopped', or 'unknown'.
|
||||
cache: Cache initialization status — 'initialised' or 'uninitialised'.
|
||||
components: Per-component health detail list (empty when all healthy).
|
||||
|
||||
Example:
|
||||
```python
|
||||
# Healthy (HTTP 200)
|
||||
{
|
||||
"status": "ok",
|
||||
"fail2ban": "online"
|
||||
"fail2ban": "online",
|
||||
"database": "ok",
|
||||
"scheduler": "running",
|
||||
"cache": "initialised",
|
||||
"components": []
|
||||
}
|
||||
|
||||
# Unhealthy (HTTP 503)
|
||||
{
|
||||
"status": "unavailable",
|
||||
"fail2ban": "offline"
|
||||
"fail2ban": "offline",
|
||||
"database": "ok",
|
||||
"scheduler": "running",
|
||||
"cache": "initialised",
|
||||
"components": [{"name": "fail2ban", "healthy": false, "message": "Socket not reachable"}]
|
||||
}
|
||||
```
|
||||
"""
|
||||
|
||||
status: Literal["ok", "unavailable"] = Field(
|
||||
status: Literal["ok", "degraded", "unavailable"] = Field(
|
||||
...,
|
||||
description="Application health status: 'ok' when healthy, 'unavailable' otherwise.",
|
||||
description=(
|
||||
"Application health status: 'ok' when healthy, 'degraded' when some "
|
||||
"components are unhealthy, 'unavailable' when fail2ban is offline."
|
||||
),
|
||||
)
|
||||
fail2ban: Literal["online", "offline"] = Field(
|
||||
...,
|
||||
description="fail2ban daemon status: 'online' when reachable, 'offline' otherwise.",
|
||||
)
|
||||
database: Literal["ok", "error"] = Field(
|
||||
...,
|
||||
description="Database connectivity: 'ok' when accessible, 'error' when not.",
|
||||
)
|
||||
scheduler: Literal["running", "stopped", "unknown"] = Field(
|
||||
...,
|
||||
description="Background scheduler status: 'running', 'stopped', or 'unknown'.",
|
||||
)
|
||||
cache: Literal["initialised", "uninitialised"] = Field(
|
||||
...,
|
||||
description="Cache initialization status: 'initialised' when ready, 'uninitialised' when not.",
|
||||
)
|
||||
components: list[ComponentHealth] = Field(
|
||||
default_factory=list,
|
||||
description="Per-component health detail list. Empty when status is 'ok'.",
|
||||
)
|
||||
|
||||
|
||||
class FlushLogsResponse(BanGuiBaseModel):
|
||||
|
||||
@@ -50,12 +50,15 @@ async def add_log(
|
||||
Returns:
|
||||
Primary key of the inserted row.
|
||||
"""
|
||||
import time
|
||||
|
||||
timestamp_unix: int = int(time.time())
|
||||
cursor = await db.execute(
|
||||
"""
|
||||
INSERT INTO import_log (source_id, source_url, ips_imported, ips_skipped, errors)
|
||||
VALUES (?, ?, ?, ?, ?)
|
||||
INSERT INTO import_log (source_id, source_url, timestamp, ips_imported, ips_skipped, errors)
|
||||
VALUES (?, ?, ?, ?, ?, ?)
|
||||
""",
|
||||
(source_id, source_url, ips_imported, ips_skipped, errors),
|
||||
(source_id, source_url, timestamp_unix, ips_imported, ips_skipped, errors),
|
||||
)
|
||||
await db.commit()
|
||||
return int(cursor.lastrowid) # type: ignore[arg-type]
|
||||
|
||||
@@ -1,43 +1,135 @@
|
||||
"""Health check router.
|
||||
|
||||
A lightweight ``GET /api/health`` endpoint that verifies the application
|
||||
A lightweight ``GET /api/v1/health`` endpoint that verifies the application
|
||||
is running and can serve requests. Also reports the cached fail2ban liveness
|
||||
state so monitoring tools and Docker health checks can observe daemon status
|
||||
without probing the socket directly.
|
||||
|
||||
Comprehensive checks performed:
|
||||
- Database connectivity
|
||||
- fail2ban socket reachability (via cached server_status)
|
||||
- Background scheduler health
|
||||
- Session cache initialization
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
from typing import Annotated, Literal
|
||||
|
||||
import structlog
|
||||
from fastapi import APIRouter, status
|
||||
from fastapi.responses import JSONResponse
|
||||
|
||||
from app.dependencies import ServerStatusDep
|
||||
from app.models.response import HealthResponse
|
||||
from app.dependencies import AppStateDep, ServerStatusDep
|
||||
from app.models.response import ComponentHealth, HealthResponse
|
||||
|
||||
router: APIRouter = APIRouter(prefix="/api/v1/health", tags=["Health"])
|
||||
|
||||
log: structlog.stdlib.BoundLogger = structlog.get_logger()
|
||||
|
||||
|
||||
@router.get("", summary="Application health check", response_model=HealthResponse)
|
||||
async def health_check(server_status: ServerStatusDep) -> JSONResponse:
|
||||
"""Return application and fail2ban status.
|
||||
async def health_check(
|
||||
app_state: AppStateDep,
|
||||
server_status: ServerStatusDep,
|
||||
) -> JSONResponse:
|
||||
"""Return application and component status.
|
||||
|
||||
Returns HTTP 200 if fail2ban is online, HTTP 503 if offline.
|
||||
Docker health checks interpret 503 as unhealthy and restart the container
|
||||
if fail2ban remains unreachable, ensuring the backend only runs when
|
||||
fail2ban is available.
|
||||
Performs lightweight checks on key application components and returns
|
||||
HTTP 200 if all healthy, HTTP 503 if fail2ban is offline.
|
||||
|
||||
Docker/orchestration health checks interpret 503 as unhealthy and restart
|
||||
the container if fail2ban remains unreachable.
|
||||
|
||||
Args:
|
||||
app_state: Injected application state containing runtime components.
|
||||
server_status: Injected cached server status snapshot.
|
||||
|
||||
Returns:
|
||||
HTTP 200 with :class:`~app.models.response.HealthResponse` when healthy,
|
||||
HTTP 503 with :class:`~app.models.response.HealthResponse` when fail2ban is offline.
|
||||
HTTP 503 with :class:`~app.models.response.HealthResponse` when fail2ban
|
||||
is offline.
|
||||
"""
|
||||
if not server_status.online:
|
||||
return JSONResponse(
|
||||
status_code=status.HTTP_503_SERVICE_UNAVAILABLE,
|
||||
content=HealthResponse(status="unavailable", fail2ban="offline").model_dump(),
|
||||
components: list[ComponentHealth] = []
|
||||
|
||||
# --- Database check ---
|
||||
db_healthy: bool = True
|
||||
try:
|
||||
|
||||
from app.config import Settings
|
||||
from app.db import open_db
|
||||
|
||||
effective_settings: Settings = (
|
||||
app_state.runtime_settings if app_state.runtime_settings is not None else app_state.settings
|
||||
)
|
||||
test_db = await open_db(effective_settings.database_path)
|
||||
await test_db.close()
|
||||
except Exception as exc: # pragma: no cover - defensive, all paths logged
|
||||
log.warning("health_check_db_failed", error=str(exc))
|
||||
db_healthy = False
|
||||
components.append(
|
||||
ComponentHealth(name="database", healthy=False, message="Connection failed"),
|
||||
)
|
||||
|
||||
return JSONResponse(
|
||||
status_code=status.HTTP_200_OK,
|
||||
content=HealthResponse(status="ok", fail2ban="online").model_dump(),
|
||||
# --- Scheduler check ---
|
||||
scheduler_state: Literal["running", "stopped", "unknown"] = "unknown"
|
||||
try:
|
||||
scheduler = app_state.scheduler
|
||||
if scheduler is not None and getattr(scheduler, "running", False):
|
||||
scheduler_state = "running"
|
||||
elif scheduler is not None:
|
||||
scheduler_state = "stopped"
|
||||
else:
|
||||
scheduler_state = "unknown"
|
||||
components.append(
|
||||
ComponentHealth(name="scheduler", healthy=False, message="Not initialised"),
|
||||
)
|
||||
except Exception: # pragma: no cover - defensive
|
||||
scheduler_state = "unknown"
|
||||
components.append(
|
||||
ComponentHealth(name="scheduler", healthy=False, message="Not accessible"),
|
||||
)
|
||||
|
||||
# --- Cache check ---
|
||||
cache_state: Literal["initialised", "uninitialised"] = "initialised"
|
||||
try:
|
||||
if app_state.session_cache is not None:
|
||||
cache_state = "initialised"
|
||||
else:
|
||||
cache_state = "uninitialised"
|
||||
components.append(
|
||||
ComponentHealth(name="cache", healthy=False, message="Not initialised"),
|
||||
)
|
||||
except Exception: # pragma: no cover - defensive
|
||||
cache_state = "uninitialised"
|
||||
|
||||
# --- fail2ban ---
|
||||
fail2ban_online: bool = server_status.online
|
||||
if not fail2ban_online:
|
||||
components.append(
|
||||
ComponentHealth(name="fail2ban", healthy=False, message="Socket not reachable"),
|
||||
)
|
||||
|
||||
# --- Overall status ---
|
||||
overall_status: Literal["ok", "degraded", "unavailable"]
|
||||
if not fail2ban_online:
|
||||
overall_status = "unavailable"
|
||||
http_status: int = status.HTTP_503_SERVICE_UNAVAILABLE
|
||||
elif components:
|
||||
overall_status = "degraded"
|
||||
http_status = status.HTTP_200_OK
|
||||
else:
|
||||
overall_status = "ok"
|
||||
http_status = status.HTTP_200_OK
|
||||
|
||||
return JSONResponse(
|
||||
status_code=http_status,
|
||||
content=HealthResponse(
|
||||
status=overall_status,
|
||||
fail2ban="online" if fail2ban_online else "offline",
|
||||
database="ok" if db_healthy else "error",
|
||||
scheduler=scheduler_state,
|
||||
cache=cache_state,
|
||||
components=components,
|
||||
).model_dump(),
|
||||
)
|
||||
|
||||
@@ -62,12 +62,19 @@ async def client(test_settings: Settings) -> AsyncClient: # type: ignore[misc]
|
||||
Yields:
|
||||
An :class:`httpx.AsyncClient` with ``base_url="http://test"``.
|
||||
"""
|
||||
from unittest.mock import MagicMock
|
||||
|
||||
app = create_app(settings=test_settings)
|
||||
|
||||
# Ensure fail2ban is reported as online for tests (mock socket is not
|
||||
# actually connected so we need to set the cached status manually).
|
||||
app.state.server_status = ServerStatus(online=True)
|
||||
|
||||
# Mock scheduler for health check tests (lifespan not run in ASGITransport tests)
|
||||
mock_scheduler = MagicMock()
|
||||
mock_scheduler.running = True
|
||||
app.state.scheduler = mock_scheduler
|
||||
|
||||
# Bootstrap the database schema before making requests. ASGITransport
|
||||
# does not run the application lifespan, so we create the test SQLite file
|
||||
# directly rather than relying on startup logic.
|
||||
|
||||
@@ -7,6 +7,7 @@ import pytest
|
||||
|
||||
from app.db import (
|
||||
_apply_migration,
|
||||
_cleanup_wal_files,
|
||||
_parse_migration_statements,
|
||||
init_db,
|
||||
open_db,
|
||||
@@ -241,3 +242,32 @@ async def test_init_db_idempotent(tmp_path: Path) -> None:
|
||||
finally:
|
||||
await db.close()
|
||||
|
||||
|
||||
async def test_cleanup_wal_files_removes_orphaned_files(tmp_path: Path) -> None:
|
||||
"""Test that _cleanup_wal_files removes orphaned WAL and SHM files."""
|
||||
db_path = str(tmp_path / "test_wal.db")
|
||||
wal_path = Path(db_path + "-wal")
|
||||
shm_path = Path(db_path + "-shm")
|
||||
|
||||
# Create the orphaned files
|
||||
wal_path.write_text("orphan")
|
||||
shm_path.write_text("orphan")
|
||||
|
||||
assert wal_path.exists()
|
||||
assert shm_path.exists()
|
||||
|
||||
# Run cleanup
|
||||
await _cleanup_wal_files(db_path)
|
||||
|
||||
# Both files should be removed
|
||||
assert not wal_path.exists()
|
||||
assert not shm_path.exists()
|
||||
|
||||
|
||||
async def test_cleanup_wal_files_handles_missing_files(tmp_path: Path) -> None:
|
||||
"""Test that _cleanup_wal_files handles non-existent files gracefully."""
|
||||
db_path = str(tmp_path / "nonexistent.db")
|
||||
|
||||
# Should not raise
|
||||
await _cleanup_wal_files(db_path)
|
||||
|
||||
|
||||
@@ -8,15 +8,14 @@ from app.models.server import ServerStatus
|
||||
|
||||
@pytest.mark.asyncio
|
||||
async def test_health_check_returns_200_when_online(client: AsyncClient) -> None:
|
||||
"""``GET /api/health`` must return HTTP 200 when fail2ban is online."""
|
||||
client._transport.app.state.server_status = ServerStatus(online=True)
|
||||
"""``GET /api/v1/health`` must return HTTP 200 when fail2ban is online."""
|
||||
response = await client.get("/api/v1/health")
|
||||
assert response.status_code == 200
|
||||
|
||||
|
||||
@pytest.mark.asyncio
|
||||
async def test_health_check_returns_503_when_offline(client: AsyncClient) -> None:
|
||||
"""``GET /api/health`` must return HTTP 503 when fail2ban is offline."""
|
||||
"""``GET /api/v1/health`` must return HTTP 503 when fail2ban is offline."""
|
||||
client._transport.app.state.server_status = ServerStatus(online=False)
|
||||
response = await client.get("/api/v1/health")
|
||||
assert response.status_code == 503
|
||||
@@ -24,27 +23,84 @@ async def test_health_check_returns_503_when_offline(client: AsyncClient) -> Non
|
||||
|
||||
@pytest.mark.asyncio
|
||||
async def test_health_check_returns_ok_status_when_online(client: AsyncClient) -> None:
|
||||
"""``GET /api/health`` must contain ``status: ok`` when fail2ban is online."""
|
||||
client._transport.app.state.server_status = ServerStatus(online=True)
|
||||
"""``GET /api/v1/health`` must contain ``status: ok`` when fail2ban is online."""
|
||||
response = await client.get("/api/v1/health")
|
||||
data: dict[str, str] = response.json()
|
||||
data: dict[str, object] = response.json()
|
||||
assert data["status"] == "ok"
|
||||
assert data["fail2ban"] == "online"
|
||||
|
||||
|
||||
@pytest.mark.asyncio
|
||||
async def test_health_check_returns_unavailable_when_offline(client: AsyncClient) -> None:
|
||||
"""``GET /api/health`` must contain ``status: unavailable`` when fail2ban is offline."""
|
||||
"""``GET /api/v1/health`` must contain ``status: unavailable`` when fail2ban is offline."""
|
||||
client._transport.app.state.server_status = ServerStatus(online=False)
|
||||
response = await client.get("/api/v1/health")
|
||||
data: dict[str, str] = response.json()
|
||||
data: dict[str, object] = response.json()
|
||||
assert data["status"] == "unavailable"
|
||||
assert data["fail2ban"] == "offline"
|
||||
|
||||
|
||||
@pytest.mark.asyncio
|
||||
async def test_health_check_content_type_is_json(client: AsyncClient) -> None:
|
||||
"""``GET /api/health`` must set the ``Content-Type`` header to JSON."""
|
||||
"""``GET /api/v1/health`` must set the ``Content-Type`` header to JSON."""
|
||||
response = await client.get("/api/v1/health")
|
||||
assert "application/json" in response.headers.get("content-type", "")
|
||||
|
||||
|
||||
@pytest.mark.asyncio
|
||||
async def test_health_check_includes_database_status(client: AsyncClient) -> None:
|
||||
"""``GET /api/v1/health`` must include database status field."""
|
||||
response = await client.get("/api/v1/health")
|
||||
data: dict[str, object] = response.json()
|
||||
assert "database" in data
|
||||
assert data["database"] in ("ok", "error")
|
||||
|
||||
|
||||
@pytest.mark.asyncio
|
||||
async def test_health_check_includes_scheduler_status(client: AsyncClient) -> None:
|
||||
"""``GET /api/v1/health`` must include scheduler status field."""
|
||||
response = await client.get("/api/v1/health")
|
||||
data: dict[str, object] = response.json()
|
||||
assert "scheduler" in data
|
||||
assert data["scheduler"] in ("running", "stopped", "unknown")
|
||||
|
||||
|
||||
@pytest.mark.asyncio
|
||||
async def test_health_check_includes_cache_status(client: AsyncClient) -> None:
|
||||
"""``GET /api/v1/health`` must include cache status field."""
|
||||
response = await client.get("/api/v1/health")
|
||||
data: dict[str, object] = response.json()
|
||||
assert "cache" in data
|
||||
assert data["cache"] in ("initialised", "uninitialised")
|
||||
|
||||
|
||||
@pytest.mark.asyncio
|
||||
async def test_health_check_includes_components_list(client: AsyncClient) -> None:
|
||||
"""``GET /api/v1/health`` must include components list."""
|
||||
response = await client.get("/api/v1/health")
|
||||
data: dict[str, object] = response.json()
|
||||
assert "components" in data
|
||||
assert isinstance(data["components"], list)
|
||||
|
||||
|
||||
@pytest.mark.asyncio
|
||||
async def test_health_check_offline_adds_fail2ban_to_components(
|
||||
client: AsyncClient,
|
||||
) -> None:
|
||||
"""When fail2ban is offline, it must appear in the components list."""
|
||||
client._transport.app.state.server_status = ServerStatus(online=False)
|
||||
response = await client.get("/api/v1/health")
|
||||
data: dict[str, object] = response.json()
|
||||
assert data["status"] == "unavailable"
|
||||
components: list[dict[str, object]] = data["components"] # type: ignore[assignment]
|
||||
assert any(c.get("name") == "fail2ban" and c.get("healthy") is False for c in components)
|
||||
|
||||
|
||||
@pytest.mark.asyncio
|
||||
async def test_health_check_online_returns_empty_components(client: AsyncClient) -> None:
|
||||
"""When all components are healthy, components list must be empty."""
|
||||
response = await client.get("/api/v1/health")
|
||||
data: dict[str, object] = response.json()
|
||||
assert data["status"] == "ok"
|
||||
assert data["components"] == []
|
||||
|
||||
|
||||
@@ -3,6 +3,11 @@ import { ArrowClockwiseRegular } from "@fluentui/react-icons";
|
||||
import { useCommonSectionStyles } from "../../components/commonStyles";
|
||||
import { useImportLog } from "../../hooks/useImportLog";
|
||||
import { useBlocklistStyles } from "./blocklistStyles";
|
||||
import { formatDate } from "../../utils/formatDate";
|
||||
|
||||
function formatUnixTimestamp(unixTs: number, timezone: string | null | undefined): string {
|
||||
return formatDate(new Date(unixTs * 1000).toISOString(), timezone);
|
||||
}
|
||||
|
||||
export function BlocklistImportLogSection(): React.JSX.Element {
|
||||
const styles = useBlocklistStyles();
|
||||
@@ -52,7 +57,7 @@ export function BlocklistImportLogSection(): React.JSX.Element {
|
||||
<TableRow key={entry.id} className={entry.errors ? styles.errorRow : undefined}>
|
||||
<TableCell>
|
||||
<TableCellLayout>
|
||||
<span className={styles.mono}>{entry.timestamp}</span>
|
||||
<span className={styles.mono}>{formatUnixTimestamp(entry.timestamp, undefined)}</span>
|
||||
</TableCellLayout>
|
||||
</TableCell>
|
||||
<TableCell>
|
||||
|
||||
@@ -41,7 +41,7 @@ export interface ImportLogEntry {
|
||||
id: number;
|
||||
source_id: number | null;
|
||||
source_url: string;
|
||||
timestamp: string;
|
||||
timestamp: number;
|
||||
ips_imported: number;
|
||||
ips_skipped: number;
|
||||
errors: string | null;
|
||||
|
||||
Reference in New Issue
Block a user