feat: comprehensive health check with DB, scheduler, cache
- Add /api/v1/health endpoint with component-level checks - Verify DB connectivity, fail2ban socket, scheduler, session cache - Add SQLite WAL cleanup on startup (orphan crash files) - Migration 8: import_log.timestamp → INTEGER UNIX epoch - Align import_log timestamps with history_archive (already UNIX int) - Add unit tests for DB cleanup and health router Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
This commit is contained in:
204
Docs/Tasks.md
204
Docs/Tasks.md
@@ -1,207 +1,3 @@
|
||||
### Issue #6: HIGH - Inconsistent API Response Format Between Endpoints
|
||||
|
||||
**Where found**:
|
||||
- `backend/app/routers/bans.py` - Returns `CommandResponse(success: bool, message: str)`
|
||||
- `backend/app/routers/history.py` - Returns `HistoryListResponse(bans: [], pagination: {})`
|
||||
- `backend/app/routers/status.py` - Returns `dict[str, str]` with no consistent structure
|
||||
- Various routers return different error formats
|
||||
|
||||
**Why this is needed**:
|
||||
Frontend cannot assume consistent structure, must write different parsing code for each endpoint. Type safety is compromised. Mapping backend schema to frontend types becomes fragile.
|
||||
|
||||
**Goal**:
|
||||
Standardize on a unified response format across all endpoints to ensure type safety and consistency.
|
||||
|
||||
**What to do**:
|
||||
1. Define standard response format:
|
||||
```python
|
||||
@dataclass
|
||||
class ApiResponse(Generic[T]):
|
||||
success: bool
|
||||
data: T | None
|
||||
error: str | None
|
||||
error_code: str | None = None
|
||||
pagination: Pagination | None = None
|
||||
metadata: dict[str, Any] | None = None
|
||||
```
|
||||
2. Update all routers to use this format
|
||||
3. Map endpoint-specific responses into `data` field
|
||||
4. Add error codes for different error types
|
||||
5. Update frontend to parse this format
|
||||
6. Add validation that all endpoints conform to format
|
||||
|
||||
**Possible traps and issues**:
|
||||
- Backward compatibility breaks (clients expecting old format fail)
|
||||
- Existing frontend code expects old response format
|
||||
- Performance impact of additional wrapping (measure and optimize)
|
||||
- Migration burden - all routers need updates
|
||||
|
||||
**Docs changes needed**:
|
||||
- Update API documentation with new response format
|
||||
- Add examples of standard response across different scenarios
|
||||
- Create migration guide for API consumers
|
||||
|
||||
**Doc references**:
|
||||
- DATABASE_API_DEPLOYMENT_ISSUES.md - Issue "2.1 Inconsistent Response Format"
|
||||
|
||||
---
|
||||
|
||||
### Issue #7: HIGH - Docker Health Check Fails in Production
|
||||
|
||||
**Where found**:
|
||||
- `Docker/compose.prod.yml` (lines 40-47)
|
||||
- Backend health check runs Python in subprocess but image has no Python in runtime
|
||||
- Frontend health check uses `wget` that may not exist
|
||||
|
||||
**Why this is needed**:
|
||||
Health checks always fail, container marked unhealthy, Kubernetes/orchestrators evict pod thinking service is down. Cascades to service outages.
|
||||
|
||||
**Goal**:
|
||||
Implement reliable health checks that actually verify service health without depending on missing tools.
|
||||
|
||||
**What to do**:
|
||||
1. Replace Python subprocess check with simple HTTP request:
|
||||
```yaml
|
||||
healthcheck:
|
||||
test: ["CMD", "curl", "-f", "http://localhost:8000/api/health"]
|
||||
```
|
||||
2. Implement comprehensive health check endpoint that verifies:
|
||||
- Database is accessible
|
||||
- fail2ban socket is reachable
|
||||
- Background scheduler is healthy
|
||||
- Cache systems are initialized
|
||||
3. Return 503 if any component is unhealthy
|
||||
4. Add timeout and retry parameters:
|
||||
```yaml
|
||||
interval: 30s
|
||||
timeout: 10s
|
||||
retries: 3
|
||||
start_period: 40s
|
||||
```
|
||||
5. Test health check in both Alpine and full Python images
|
||||
|
||||
**Possible traps and issues**:
|
||||
- curl might not be installed in lightweight images (use `wget` as fallback or multi-tool)
|
||||
- Health check runs frequently (every 30s) - must be lightweight
|
||||
- False positives if timeout is too short
|
||||
- Can't check fail2ban socket during initial startup (timing issue)
|
||||
|
||||
**Docs changes needed**:
|
||||
- Add troubleshooting section to `Docs/TROUBLESHOOTING.md` - "502 Bad Gateway errors"
|
||||
- Document health check behavior in `Docs/Deployment.md`
|
||||
|
||||
**Doc references**:
|
||||
- DATABASE_API_DEPLOYMENT_ISSUES.md - Issue "6.2 Missing Health Checks in Production"
|
||||
- `Docker/compose.prod.yml` - Current health check definition
|
||||
|
||||
---
|
||||
|
||||
### Issue #8: HIGH - Database Migration Fragility (Rollback Missing)
|
||||
|
||||
**Where found**:
|
||||
- `backend/app/startup.py` (lines 105-130) - `run_migrations()`
|
||||
- `backend/app/db.py` (lines 132-142) - Schema migration table
|
||||
- No transaction wrapping migrations
|
||||
|
||||
**Why this is needed**:
|
||||
If a migration fails mid-transaction, the database schema becomes inconsistent:
|
||||
- Some tables created, others not
|
||||
- Next startup tries to re-apply migration
|
||||
- Second attempt fails because column already exists
|
||||
- Application can't start
|
||||
|
||||
**Goal**:
|
||||
Implement transactional migrations with automatic rollback on failure.
|
||||
|
||||
**What to do**:
|
||||
1. Wrap each migration in explicit transaction:
|
||||
```python
|
||||
async def run_migration(name: str, migration_func):
|
||||
try:
|
||||
await db.execute("BEGIN EXCLUSIVE")
|
||||
await migration_func()
|
||||
await db.execute(
|
||||
"INSERT INTO schema_migrations (name, applied_at) VALUES (?, ?)",
|
||||
name, int(time.time())
|
||||
)
|
||||
await db.execute("COMMIT")
|
||||
except Exception:
|
||||
await db.execute("ROLLBACK")
|
||||
raise
|
||||
```
|
||||
2. Add idempotent migration checks (CREATE TABLE IF NOT EXISTS)
|
||||
3. Implement downtime procedure for failed migrations
|
||||
4. Add migration rollback capability
|
||||
5. Test migration failures and recovery
|
||||
|
||||
**Possible traps and issues**:
|
||||
- SQLite WAL mode can leave orphaned `.wal` files after crashes (handle cleanup)
|
||||
- Long-running migrations might timeout
|
||||
- Complex migrations spanning multiple steps hard to make atomic
|
||||
- Rollback procedure must be well-documented and tested
|
||||
- Development vs production migration paths might diverge
|
||||
|
||||
**Docs changes needed**:
|
||||
- Create `Docs/DATABASE_MIGRATIONS.md` explaining migration process
|
||||
- Add disaster recovery section to `Docs/Deployment.md`
|
||||
- Document in `Docs/TROUBLESHOOTING.md` - "Database migration failed"
|
||||
|
||||
**Doc references**:
|
||||
- DATABASE_API_DEPLOYMENT_ISSUES.md - Issue "6.3 Database Migration Issues"
|
||||
- `backend/app/startup.py` - Migration runner
|
||||
|
||||
---
|
||||
|
||||
### Issue #9: HIGH - No Graceful Shutdown Handling
|
||||
|
||||
**Where found**:
|
||||
- `Docker/Dockerfile.backend` (lines 75-76) - Entrypoint doesn't handle signals
|
||||
- `backend/app/main.py` - Lifespan doesn't gracefully shutdown
|
||||
- No SIGTERM handler in background tasks
|
||||
- Container receives SIGKILL without cleanup opportunity
|
||||
|
||||
**Why this is needed**:
|
||||
Without graceful shutdown:
|
||||
- In-flight ban requests don't complete
|
||||
- Background jobs interrupted mid-import
|
||||
- Incomplete blocklist imports leave stale data
|
||||
- Database connections don't close properly
|
||||
|
||||
**Goal**:
|
||||
Implement graceful shutdown that allows pending operations to complete before process exits.
|
||||
|
||||
**What to do**:
|
||||
1. Implement lifespan context manager that handles shutdown:
|
||||
```python
|
||||
@asynccontextmanager
|
||||
async def lifespan(app: FastAPI):
|
||||
# Startup
|
||||
yield
|
||||
# Shutdown: allow pending tasks
|
||||
tasks = [t for t in asyncio.all_tasks() if not t.done()]
|
||||
await asyncio.gather(*tasks, return_exceptions=True)
|
||||
```
|
||||
2. Set graceful_timeout in Dockerfile: `docker stop --time=30`
|
||||
3. Handle SIGTERM signal to trigger shutdown
|
||||
4. Drain in-flight requests before exit
|
||||
5. Close database connections and scheduler cleanly
|
||||
6. Add logging for shutdown events
|
||||
|
||||
**Possible traps and issues**:
|
||||
- Tasks might take longer than shutdown timeout (configure appropriate timeout)
|
||||
- Hanging tasks won't terminate (need proper cancellation)
|
||||
- Scheduler might reject new jobs during shutdown
|
||||
- Race conditions between shutdown and new requests
|
||||
|
||||
**Docs changes needed**:
|
||||
- Add deployment best practices to `Docs/Deployment.md`
|
||||
- Document graceful shutdown in `Docs/TROUBLESHOOTING.md`
|
||||
|
||||
**Doc references**:
|
||||
- DATABASE_API_DEPLOYMENT_ISSUES.md - Issue "6.4 No Graceful Shutdown"
|
||||
|
||||
---
|
||||
|
||||
### Issue #10: HIGH - Database Type Inconsistency (Timestamps Mixed Across Tables)
|
||||
|
||||
**Where found**:
|
||||
|
||||
Reference in New Issue
Block a user