The health check endpoint now properly indicates service unavailability: - Returns HTTP 200 when fail2ban is online - Returns HTTP 503 when fail2ban is offline This allows Docker and other orchestration tools to correctly detect when fail2ban is unreachable and automatically restart the backend container, preventing the situation where Docker treats the container as healthy despite fail2ban being down. Changes: - Update GET /api/health to return 503 on fail2ban offline - Return appropriate JSON response bodies for each state - Update tests to verify both online (200) and offline (503) scenarios - Update Dockerfile HEALTHCHECK documentation - Add Health Checks section to Deployment.md documentation All tests pass with 100% coverage on health.py. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
595 lines
14 KiB
Markdown
595 lines
14 KiB
Markdown
## [CRITICAL] Health check endpoint returns wrong status code
|
|
|
|
**Where found**
|
|
|
|
- `backend/app/routers/health.py` — always returns 200, even when fail2ban offline
|
|
|
|
**Why this is needed**
|
|
|
|
Docker health checks interpret 200 as "healthy". If fail2ban offline but backend returns 200, Docker thinks container healthy and doesn't restart it.
|
|
|
|
**Goal**
|
|
|
|
Return 503 Service Unavailable when fail2ban is offline.
|
|
|
|
**What to do**
|
|
|
|
1. Change health endpoint to return 503 when offline:
|
|
```python
|
|
if not server_status.online:
|
|
return JSONResponse(
|
|
status_code=503,
|
|
content={"status": "unavailable", "fail2ban": "offline"}
|
|
)
|
|
```
|
|
|
|
2. Update Docker health check to expect 503 as "unhealthy"
|
|
|
|
**Possible traps and issues**
|
|
|
|
- Returning 503 causes orchestration tools to restart container
|
|
- If fail2ban restarts frequently, health check becomes flaky
|
|
- Consider gradual degradation
|
|
|
|
**Docs changes needed**
|
|
|
|
- Update `Docker/Dockerfile.backend` health check documentation
|
|
- Update `Docs/Deployment.md` § Health Checks
|
|
|
|
**Doc references**
|
|
|
|
- `backend/app/routers/health.py`
|
|
- `Docker/Dockerfile.backend`
|
|
|
|
---
|
|
|
|
## [IMPORTANT] Database transactions lack explicit isolation
|
|
|
|
**Where found**
|
|
|
|
- `backend/app/repositories/session_repo.py:40-60` — multiple queries without `BEGIN TRANSACTION`
|
|
- Similar pattern in multi-step operations across repositories
|
|
|
|
**Why this is needed**
|
|
|
|
Without explicit boundaries, concurrent requests can race: Thread A checks if exists → not found, Thread B checks same → not found, Thread A inserts → succeeds, Thread B inserts → duplicate error or silent overwrite.
|
|
|
|
**Goal**
|
|
|
|
Wrap all multi-step operations in explicit transactions with appropriate isolation level.
|
|
|
|
**What to do**
|
|
|
|
1. Use explicit `BEGIN IMMEDIATE` transaction:
|
|
```python
|
|
await db.execute("BEGIN IMMEDIATE")
|
|
try:
|
|
await db.execute("INSERT INTO sessions ...")
|
|
await db.commit()
|
|
except Exception:
|
|
await db.rollback()
|
|
raise
|
|
```
|
|
|
|
2. Use `IMMEDIATE` mode to lock immediately for writes
|
|
3. Document transaction boundaries clearly
|
|
|
|
**Possible traps and issues**
|
|
|
|
- Nested transactions (SAVEPOINTs) may be needed
|
|
- Locks held too long cause contention
|
|
- Deadlocks possible with concurrent writers
|
|
|
|
**Docs changes needed**
|
|
|
|
- Add section in `Docs/Backend-Development.md` § Database Transactions
|
|
|
|
**Doc references**
|
|
|
|
- `Docs/Backend-Development.md` (database design)
|
|
|
|
---
|
|
|
|
## [IMPORTANT] Scheduler lock race condition
|
|
|
|
**Where found**
|
|
|
|
- `backend/app/utils/scheduler_lock.py:56-58` — heartbeat interval 10 seconds
|
|
|
|
**Why this is needed**
|
|
|
|
Current design: Process A acquires lock, heartbeat misses, lock expires, Process B acquires lock, both running simultaneously → duplicate work, data corruption.
|
|
|
|
**Goal**
|
|
|
|
Implement robust distributed locking that prevents concurrent execution.
|
|
|
|
**What to do**
|
|
|
|
**Option A (Strengthen heartbeat):**
|
|
- Reduce interval to 5s (half of timeout)
|
|
- Use database advisory locks
|
|
- Monitor heartbeat failures
|
|
|
|
**Option B (Migrate to Redis):**
|
|
- Use `redlock-py` or `aioredis`
|
|
- Simpler, more reliable than database-backed
|
|
|
|
**Current code improvements:**
|
|
- Log when heartbeat fails
|
|
- Add metric for lock contention
|
|
- Test multi-process scenario
|
|
|
|
**Possible traps and issues**
|
|
|
|
- Database locks don't scale under high contention
|
|
- Redis adds new dependency
|
|
- Clock skew breaks timestamp-based expiry
|
|
|
|
**Docs changes needed**
|
|
|
|
- Update `Docs/Deployment.md` § Scheduler Lock
|
|
- Add troubleshooting: "Blocklist import runs twice"
|
|
|
|
**Doc references**
|
|
|
|
- `Docs/Deployment.md` (scheduler)
|
|
- `backend/app/utils/scheduler_lock.py` (lock implementation)
|
|
|
|
---
|
|
|
|
## [IMPORTANT] API pagination doesn't return metadata
|
|
|
|
**Where found**
|
|
|
|
- `backend/app/routers/history.py` — returns bare list, no pagination metadata
|
|
- All paginated routers have same issue
|
|
|
|
**Why this is needed**
|
|
|
|
Frontend receives bare list, cannot determine: total results, whether more pages exist, last page number. Must guess or re-query.
|
|
|
|
**Goal**
|
|
|
|
Return pagination metadata with every paginated response.
|
|
|
|
**What to do**
|
|
|
|
1. Create response wrapper:
|
|
```python
|
|
class PaginatedResponse(BaseModel):
|
|
data: list[Item]
|
|
pagination: PaginationMetadata
|
|
```
|
|
|
|
2. Update all paginated routers to return this wrapper
|
|
3. Update frontend to use metadata for UI
|
|
|
|
**Possible traps and issues**
|
|
|
|
- `SELECT COUNT(*)` is slow on large tables
|
|
- Response shape change — old frontend may not handle
|
|
|
|
**Docs changes needed**
|
|
|
|
- Update API documentation § Pagination
|
|
|
|
**Doc references**
|
|
|
|
- `backend/app/utils/pagination.py`
|
|
|
|
---
|
|
|
|
## [IMPORTANT] Error response schema inconsistent
|
|
|
|
**Where found**
|
|
|
|
- Different handlers return different response shapes
|
|
- Fail2Ban errors: `{ "error_code": "...", "detail": "..." }`
|
|
- Validation errors: `{ "detail": [...] }`
|
|
- Not found errors: `{ "detail": "...", "error_code": "..." }`
|
|
|
|
**Why this is needed**
|
|
|
|
Frontend must normalize multiple shapes, making error handling fragile and error-prone.
|
|
|
|
**Goal**
|
|
|
|
Unify all error responses to single schema.
|
|
|
|
**What to do**
|
|
|
|
1. Define canonical error response:
|
|
```python
|
|
class ErrorResponse(BaseModel):
|
|
error_code: str
|
|
message: str
|
|
status: int
|
|
details: dict | None = None
|
|
```
|
|
|
|
2. Update all handlers to return this format
|
|
3. Update frontend to expect unified schema
|
|
|
|
**Possible traps and issues**
|
|
|
|
- Backward compatibility with old clients
|
|
- FastAPI's built-in handlers may override custom
|
|
- Rich detail structures need accommodation
|
|
|
|
**Docs changes needed**
|
|
|
|
- Update API documentation with unified error schema
|
|
- Add error code reference table
|
|
|
|
**Doc references**
|
|
|
|
- `Docs/API.md` (error codes)
|
|
- `backend/app/main.py` (exception handlers)
|
|
|
|
---
|
|
|
|
## [IMPORTANT] Provider ordering fragility (Frontend)
|
|
|
|
**Where found**
|
|
|
|
- `frontend/src/App.tsx` — 10-level deep provider nesting
|
|
- `frontend/src/providers/PROVIDER_ORDER.md` — documents order, no compile-time enforcement
|
|
|
|
**Why this is needed**
|
|
|
|
Provider order (ThemeProvider → AppContents → FluentProvider → ...) enforced only at runtime. Accidental reorder caught only after deploy.
|
|
|
|
**Goal**
|
|
|
|
Add compile-time validation of provider ordering.
|
|
|
|
**What to do**
|
|
|
|
1. Create provider composition utility enforcing order
|
|
2. Use TypeScript discriminated unions
|
|
3. Add ESLint rule to check provider wrapping
|
|
|
|
**Possible traps and issues**
|
|
|
|
- TypeScript doesn't easily enforce ordering
|
|
- May be overkill — improve runtime error messages instead
|
|
|
|
**Docs changes needed**
|
|
|
|
- Update `Docs/Architekture.md` § 3.2 (Providers)
|
|
|
|
**Doc references**
|
|
|
|
- `Docs/Architekture.md` § 3.2 (Providers)
|
|
- `frontend/src/providers/PROVIDER_ORDER.md`
|
|
|
|
---
|
|
|
|
## [IMPORTANT] Promise cancellation not checked in .then()/.catch() chains
|
|
|
|
**Where found**
|
|
|
|
- `frontend/src/components/blocklist/BlocklistSourcesSection.tsx:84-88`
|
|
- `frontend/src/components/blocklist/BlocklistScheduleSection.tsx:49-58`
|
|
- Multiple components use this pattern
|
|
|
|
**Why this is needed**
|
|
|
|
When user navigates away, `.then()` chains don't check if cancelled. State updated on unmounted component → React warnings, memory leak, notification shows wrong context.
|
|
|
|
**Goal**
|
|
|
|
Check for cancellation in all `.then()/.catch()` chains.
|
|
|
|
**What to do**
|
|
|
|
1. Replace `.then()/.catch()` with `async/await` and cancellation check
|
|
2. Or use wrapper hook to hide logic
|
|
|
|
**Possible traps and issues**
|
|
|
|
- Checking `signal.aborted` after `await` introduces race conditions
|
|
- Better: let AbortError propagate, catch it in catch block
|
|
|
|
**Docs changes needed**
|
|
|
|
- Update `Docs/Web-Development.md` § Async Patterns
|
|
|
|
**Doc references**
|
|
|
|
- `Docs/Web-Development.md` (async patterns)
|
|
|
|
---
|
|
|
|
## [MEDIUM] Inefficient database pagination uses OFFSET
|
|
|
|
**Where found**
|
|
|
|
- `backend/app/utils/pagination.py` — uses `OFFSET (page-1) * page_size`
|
|
|
|
**Why this is needed**
|
|
|
|
OFFSET scans and discards N rows to fetch N+limit. Last page on 10M row table: 15 seconds ⚠️
|
|
|
|
**Goal**
|
|
|
|
Implement keyset pagination (cursor-based) for large result sets.
|
|
|
|
**What to do**
|
|
|
|
1. **Short-term:** Add database indexes on sort columns
|
|
2. **Long-term:** Implement cursor-based pagination using WHERE instead of OFFSET
|
|
3. Frontend sends cursor (last row ID) instead of page number
|
|
|
|
**Possible traps and issues**
|
|
|
|
- Cursor must be deterministic
|
|
- API contract changes
|
|
- Cursor format must be opaque to client
|
|
|
|
**Docs changes needed**
|
|
|
|
- Update `Docs/Backend-Development.md` § Database Performance
|
|
|
|
**Doc references**
|
|
|
|
- `Docs/Backend-Development.md` (database performance)
|
|
|
|
---
|
|
|
|
## [MEDIUM] Session secret rotation not implemented
|
|
|
|
**Where found**
|
|
|
|
- `backend/app/config.py` — single `session_secret` with no rotation support
|
|
|
|
**Why this is needed**
|
|
|
|
If secret leaks, all sessions compromised. No way to invalidate old sessions.
|
|
|
|
**Goal**
|
|
|
|
Support gradual secret rotation without forcing logout.
|
|
|
|
**What to do**
|
|
|
|
1. Store multiple secrets: current and previous
|
|
2. Accept tokens signed with either key
|
|
3. Re-sign tokens with current secret on validation
|
|
|
|
**Possible traps and issues**
|
|
|
|
- Rotation strategy must be documented
|
|
- Metrics needed to track secret usage
|
|
|
|
**Docs changes needed**
|
|
|
|
- Update `Docs/Backend-Development.md` § Session Management
|
|
|
|
**Doc references**
|
|
|
|
- `Docs/Backend-Development.md`
|
|
|
|
---
|
|
|
|
## [MEDIUM] No CORS configuration
|
|
|
|
**Where found**
|
|
|
|
- `backend/app/main.py` — no CORS middleware added
|
|
|
|
**Why this is needed**
|
|
|
|
If frontend on different origin, cross-origin requests blocked without CORS configuration.
|
|
|
|
**Goal**
|
|
|
|
Add CORS middleware with proper origin whitelisting.
|
|
|
|
**What to do**
|
|
|
|
1. Add CORS middleware with specific origin whitelist
|
|
2. Make configurable via environment variable
|
|
3. Default to localhost for development
|
|
|
|
**Possible traps and issues**
|
|
|
|
- `allow_origins=["*"]` defeats CORS security
|
|
- Credentials require specific origins, not wildcard
|
|
- Missing config silently fails in browser
|
|
|
|
**Docs changes needed**
|
|
|
|
- Update `Docs/Deployment.md` § CORS Configuration
|
|
|
|
**Doc references**
|
|
|
|
- `Docs/Deployment.md`
|
|
|
|
---
|
|
|
|
## [MEDIUM] Input validation missing for regex patterns (ReDoS)
|
|
|
|
**Where found**
|
|
|
|
- `backend/app/routers/config.py` — regex validation accepts arbitrary patterns without timeout
|
|
|
|
**Why this is needed**
|
|
|
|
Malicious regex causes catastrophic backtracking (ReDoS). Attacker sends pattern → compilation hangs → DoS.
|
|
|
|
**Goal**
|
|
|
|
Add timeout and complexity limits to regex validation.
|
|
|
|
**What to do**
|
|
|
|
1. Add timeout to regex compilation (2 seconds recommended)
|
|
2. Add length limit (reject patterns > 1000 characters)
|
|
3. Use `signal.alarm()` (Unix) or timeout library
|
|
|
|
**Possible traps and issues**
|
|
|
|
- `signal.alarm()` Unix-only
|
|
- Some valid complex regexes may timeout
|
|
- Frontend should also validate (defense in depth)
|
|
|
|
**Docs changes needed**
|
|
|
|
- Update API docs to document regex validation limits
|
|
|
|
**Doc references**
|
|
|
|
- `backend/app/routers/config.py`
|
|
|
|
---
|
|
|
|
## [MEDIUM] No structured logging to external system
|
|
|
|
**Where found**
|
|
|
|
- Logs only go to stdout/file, no external aggregation
|
|
|
|
**Why this is needed**
|
|
|
|
Can't search across instances, historical logs lost on instance recycle.
|
|
|
|
**Goal**
|
|
|
|
Ship logs to centralized logging platform.
|
|
|
|
**What to do**
|
|
|
|
1. **Short-term:** Ensure `structlog` JSON output is valid (already done)
|
|
2. **Long-term:** Ship to logging platform (ELK, Datadog, Papertrail)
|
|
|
|
**Possible traps and issues**
|
|
|
|
- External logging adds latency
|
|
- Sensitive data must not be logged
|
|
- Log volume can be massive
|
|
|
|
**Docs changes needed**
|
|
|
|
- Add `Docs/Observability.md` section on logging
|
|
|
|
**Doc references**
|
|
|
|
- `Docs/Observability.md` (new)
|
|
|
|
---
|
|
|
|
## [MEDIUM] No Application Performance Monitoring (APM)
|
|
|
|
**Where found**
|
|
|
|
- Backend: no metrics collection, latency tracking
|
|
- Frontend: no error tracking, performance metrics
|
|
- No observability into request performance
|
|
|
|
**Why this is needed**
|
|
|
|
Without metrics, blind in production: API slow? Unknown. Which endpoints fail most? Unknown.
|
|
|
|
**Goal**
|
|
|
|
Add comprehensive metrics collection and monitoring.
|
|
|
|
**What to do**
|
|
|
|
1. **Backend metrics:**
|
|
- Add Prometheus metrics: request count, latency, active requests
|
|
- Expose `/metrics` endpoint
|
|
|
|
2. **Frontend metrics:**
|
|
- Page load time, FCP, LCP using `web-vitals`
|
|
- API error rates and latencies
|
|
|
|
3. **Aggregation:**
|
|
- Prometheus + Grafana, or Datadog/NewRelic
|
|
|
|
**Possible traps and issues**
|
|
|
|
- Metrics collection has performance cost
|
|
- Cardinality explosion with tags
|
|
- PII in metrics
|
|
|
|
**Docs changes needed**
|
|
|
|
- Add `Docs/Observability.md`
|
|
|
|
**Doc references**
|
|
|
|
- `Docs/Observability.md` (new)
|
|
|
|
---
|
|
|
|
## [LOW] Frontend charts not memoized
|
|
|
|
**Where found**
|
|
|
|
- `frontend/src/components/TopCountriesPieChart.tsx`
|
|
- `frontend/src/components/TopCountriesBarChart.tsx`
|
|
|
|
**Why this is needed**
|
|
|
|
Charts re-render on every parent update, Recharts reprocesses 5000+ points.
|
|
|
|
**Goal**
|
|
|
|
Memoize chart components.
|
|
|
|
**What to do**
|
|
|
|
1. Wrap with `React.memo` with custom comparison
|
|
2. Ensure data objects are stable
|
|
|
|
**Possible traps and issues**
|
|
|
|
- Shallow comparison might not be enough
|
|
- Memoization has memory cost
|
|
|
|
**Docs changes needed**
|
|
|
|
- No documentation changes
|
|
|
|
**Doc references**
|
|
|
|
- `frontend/src/components/TopCountriesChart.tsx`
|
|
|
|
---
|
|
|
|
## [LOW] No request deduplication on frontend
|
|
|
|
**Where found**
|
|
|
|
- `frontend/src/hooks/useFetchData.ts` — each call launches new request
|
|
- User clicks "Refresh" twice → two identical requests
|
|
|
|
**Why this is needed**
|
|
|
|
Duplicates waste bandwidth, cause race conditions (response 2 arrives first, then response 1 overwrites with stale data).
|
|
|
|
**Goal**
|
|
|
|
Deduplicate identical in-flight requests.
|
|
|
|
**What to do**
|
|
|
|
1. Implement request cache
|
|
2. Clear cache entry when response received
|
|
3. Use in `useFetchData`
|
|
|
|
**Possible traps and issues**
|
|
|
|
- Cache must be cleared on data mutation
|
|
- Stale data in cache possible if not careful
|
|
|
|
**Docs changes needed**
|
|
|
|
- No documentation changes
|
|
|
|
**Doc references**
|
|
|
|
- `frontend/src/hooks/useFetchData.ts` |