## [Backend] Model package cross-imports — models are not true leaf nodes **Where found** - `backend/app/models/jail.py:8` — imports `BantimeEscalation` from `app.models.config` - `backend/app/models/history.py:12` — imports `TimeRange` from `app.models.ban` - `backend/app/models/config.py:12-14` — imports `get_settings` from `app.config` and `validate_log_path` from `app.utils.path_utils` at module level **Why this is needed** Pydantic models are supposed to be pure data classes with no inter-package dependencies. When `jail.py` imports from `config.py`, any change to `config.py` can silently break the `jail` model class. This creates invisible coupling across the models layer. **Goal** Make all model modules true leaf nodes in the dependency graph. No model should import from another model package or from application-layer modules (config, utils, services). **What to do** 1. Move `BantimeEscalation` from `app/models/config.py` to `app/models/jail.py` (or a shared `app/models/_shared.py`) 2. Move `TimeRange` from `app/models/ban.py` to `app/models/history.py` (or a shared `_shared.py`) 3. Remove the `from app.config import get_settings` and `from app.utils.path_utils import validate_log_path` imports from `app/models/config.py` 4. Add `from __future__ import annotations` if not present, use string annotations for cross-type references **Possible traps and issues** - `BantimeEscalation` may be imported elsewhere — search all of `backend/app/` for usages before moving - Moving types may break downstream imports in `routers/`, `services/`, or `mappers/` - Python's `from __future__ import annotations` turns all annotations into strings at parse time, avoiding forward-reference issues **Docs changes needed** - Update `Docs/Architekture.md` § 2.1 Project Structure — note that models must not import from other model packages - Add note in `Docs/Backend-Development.md` about model layering rules **Doc references** - `Docs/Architekture.md` § 2.1 (models should be leaf nodes) - `Docs/Backend-Development.md` (model layering rules) --- ## [Backend] Pydantic validators execute at import time **Where found** - `backend/app/models/config.py:12-14` — module-level imports of `get_settings` and `validate_log_path` - Pydantic `Field()` definitions that call runtime-dependent functions at class-definition time **Why this is needed** When `Field()` default values or validators call `get_settings()` at import time, those calls happen the moment Python imports the module. This means loading `app.models.config` requires `app.config` to be importable first, creating hidden circular dependencies. **Goal** Remove all side-effects from Pydantic model definitions. Validators and field defaults must be pure data — no I/O, no config access, no external state. **What to do** 1. Identify all `Field()` definitions that call `get_settings()`, `validate_log_path()`, or runtime-dependent functions 2. Remove those calls from field definitions 3. Use `model_validator` that runs at instance creation time: ```python @model_validator(mode="after") def validate_range(self): from app.config import get_settings settings = get_settings() if not (1 <= self.log_max_age_days <= settings.max_log_max_age_days): raise ValueError(...) return self ``` 4. Consider moving complex validation logic to the service layer where settings are already available **Possible traps and issues** - Some validators may depend on settings not yet loaded at model instantiation — test in unit tests without full app context - Changing from field-level to `model_validator` changes error message format slightly - Check if `app.models.response.BanGuiBaseModel` has validator side-effects **Docs changes needed** - Update `Docs/Architekture.md` § 2.1 (models layer) — validators must not perform I/O - Add note in `Docs/Backend-Development.md` about validator design constraint **Doc references** - `Docs/Architekture.md` § 2.1 - `Docs/Backend-Development.md` (Pydantic model conventions) --- ## [Backend] Exception handler overlap — broad handlers catching everything **Where found** - `backend/app/main.py:182-200` — `_get_error_code()` accepts any `Exception` and falls back to snake_case conversion - Multiple handlers (lines 329-466) accept `Exception` as parameter **Why this is needed** Broad exception handlers create fragility: adding a new `DomainError` subclass without explicitly registering a handler silently falls through, producing generic error codes instead of specific ones. The fallback chain is not explicitly documented. **Goal** Make the exception handler registration explicit and documented. Every exception type that can bubble up should have a clear path to a handler. **What to do** 1. Audit all exception handlers and confirm they are registered with the most specific base class 2. Add a comment block documenting the fallback chain 3. Ensure every custom `DomainError` subclass has `error_code` and `get_error_metadata()` implemented 4. Add a catch-all `Exception` handler as the absolute last resort **Possible traps and issues** - If a new `DomainError` subclass is added without handler registration, it silently returns wrong status code - `ValueError` handler may catch Pydantic `ValidationError` subclasses **Docs changes needed** - Update `Docs/Architekture.md` § 2.2 (Application Entry Point) — document exception handler hierarchy - Add section in `Docs/Backend-Development.md` on exception taxonomy **Doc references** - `Docs/Architekture.md` § 2.2 (Application Entry Point) - `Docs/Backend-Development.md` (exception conventions) --- ## [Backend] Login rate limiter — penalty sleep does not block the request **Where found** - `backend/app/routers/auth.py:82-107` — rate limiter check happens before password verification, penalty sleep happens after **Why this is needed** The current design means attackers who stay under 5 requests/minute get no penalty at all. The `asyncio.sleep` only fires after the rate limit is already exceeded, significantly weakening the limiter's effectiveness. **Goal** Ensure the rate limiter blocks requests **before** the password check is attempted. Each wrong password should incur a progressive delay. **What to do** 1. Remove the `acquire`/`release` pattern 2. Change flow so `record_failure` is called on every wrong password and `is_allowed` returns False when limit exceeded 3. Implement exponential backoff: `penalty = min(base_delay * (2 ** failure_count), max_delay)` 4. Consider using a token bucket rather than sliding window 5. Ensure `is_allowed` uses the failure count atomically **Possible traps and issues** - If `asyncio.sleep` is called before password check, legitimate users experience latency on response - Keep maximum penalty reasonable (2-5 seconds) - `record_failure` counter must be stored durably (in-memory is fine for single-worker) **Docs changes needed** - Update `Docs/Architekture.md` § 2.2 (auth router) — reflect rate limiting behavior - Add note in `Docs/Backend-Development.md` about rate limiter design **Doc references** - `Docs/Architekture.md` § 2.2 (auth router) - `backend/app/routers/auth.py` (login endpoint) --- ## [Backend] Module-level imports inside dependency provider functions **Where found** - `backend/app/dependencies.py:151` — `from app.db import open_db` inside `get_db()` - `backend/app/dependencies.py:258, 270, 282, 294, 307, 319, 332` — local imports inside each `get_()` provider - Similar patterns in service provider functions **Why this is needed** FastAPI calls dependency provider functions on every request. While Python caches modules, the import statement still has overhead. More importantly, the pattern is inconsistent — some providers have module-level imports while others have local imports, making it unclear which providers are safe to call at high frequency. **Goal** Move all repository and service imports to module level in `dependencies.py`. **What to do** 1. Move all repository imports to module level near the top 2. Similarly move `from app.services import health_service` to module level 3. Keep `from app.db import open_db` as local import (only needed within `get_db()`) 4. Move `from app.services import auth_service` to module level with `TYPE_CHECKING` guard if circular deps prevent it **Possible traps and issues** - Moving imports to module level may expose hidden circular imports - Current local-import pattern was likely chosen to avoid circular deps — verify no circular dependencies before making change **Docs changes needed** - Update `Docs/Architekture.md` § 2.3 (Dependency Wiring) — repository and service imports should be at module level **Doc references** - `Docs/Architekture.md` § 2.3 (Dependency Wiring) - `backend/app/dependencies.py` (composition root) --- ## [Backend] `get_password_hash` lives in `setup_service` but is used by `auth_service` **Where found** - `backend/app/services/auth_service.py:27` — imports `get_password_hash` from `setup_service` **Why this is needed** `auth_service.py` handles all authentication concerns and is the natural home for password hash retrieval. Having it import from `setup_service` creates incorrect semantic dependency and unnecessary coupling. **Goal** Move `get_password_hash` to `auth_service.py` where it belongs, or to a shared `app/utils/crypto.py` module. **What to do** 1. Move the `get_password_hash` function body from `app/services/setup_service.py` to `app/services/auth_service.py` 2. Update the import in `auth_service.py` to use the local function 3. Search all of `backend/app/` for usages and update imports 4. Remove the function from `setup_service.py` **Possible traps and issues** - Search thoroughly for all usages before removing — may be imported in tests, other services, setup router - If used during initial setup flow, update that usage to import from `auth_service` instead **Docs changes needed** - No documentation changes required — internal code reorganization **Doc references** - `backend/app/services/auth_service.py` - `backend/app/services/setup_service.py` --- ## [Backend] `re` module imported inside function body **Where found** - `backend/app/main.py:198-199` — `import re` inside `_get_error_code()` **Why this is needed** Importing inside a function is a code smell. Standard practice is to import modules at the top of the file. **Goal** Move `import re` to the module-level imports at the top of `main.py`. **What to do** 1. Add `import re` to existing imports section at top of `backend/app/main.py` 2. Remove `import re` line from inside `_get_error_code()` **Possible traps and issues** - None — straightforward refactoring with no behavioral change **Docs changes needed** - No documentation changes needed **Doc references** - `backend/app/main.py` --- ## [Frontend] AuthProvider sessionStorage not synchronized across tabs **Where found** - `frontend/src/providers/AuthProvider.tsx` — uses `sessionStorage` to persist `isAuthenticated` across page refreshes - `sessionStorage` is **tab-scoped** — not shared across browser tabs **Why this is needed** If a user logs out in Tab A, Tab B's `sessionStorage` still holds `isAuthenticated: true`. Tab B continues showing authenticated UI until next full page refresh or 401 error. **Goal** Ensure logout state is propagated across all open tabs immediately. **What to do** 1. Add `storage` event listener in `AuthProvider` to receive logout events from other tabs 2. When `logout()` is called, clear `sessionStorage` explicitly 3. Consider also using `BroadcastChannel` API for real-time cross-tab sync **Possible traps and issues** - `storage` event only fires when `sessionStorage` changed **in another tab** - `StorageEvent` doesn't fire if cookies are sole authentication mechanism - Very old browsers don't support `StorageEvent` — fallback to full-page refresh **Docs changes needed** - Update `Docs/Architekture.md` § 3.2 (AuthProvider) — document cross-tab synchronization - Add note in `Docs/Web-Development.md` about authentication state management **Doc references** - `Docs/Architekture.md` § 3.2 (Providers section) - `frontend/src/providers/AuthProvider.tsx` --- ## [Frontend] usePolledData — setInterval without drift correction **Where found** - `frontend/src/hooks/usePolledData.ts` — uses `setInterval` without tracking expected next poll time **Why this is needed** With fixed `setInterval`, if `refetch` takes time (e.g., 2 seconds for slow API) and `pollInterval` is 5 seconds, the actual polling pattern accumulates drift. Effective polling interval becomes > 5 seconds, wasting bandwidth and CPU. **Goal** Implement drift-corrected polling that schedules next poll relative to **completion** of previous poll, not its start. **What to do** 1. Replace `setInterval` with self-scheduling timeout 2. Track elapsed time between poll start and completion 3. Schedule next poll with compensation: `delay = Math.max(0, pollInterval - elapsed)` 4. Alternatively use `use-sse` or custom hook `useDriftCorrectedPolling` **Possible traps and issues** - Cancellation check must happen both before and after `await refetch()` to prevent race conditions - Coordinate cancellation with `AbortController` already used by `useFetchData` - If `pollInterval` changes, hook must restart polling loop **Docs changes needed** - Update `frontend/src/hooks/README.md` — document drift-corrected polling behavior **Doc references** - `frontend/src/hooks/usePolledData.ts` - `frontend/src/hooks/README.md` --- ## [Frontend] ErrorBoundary — non-standard `componentStack` property **Where found** - `frontend/src/components/ErrorBoundary.tsx` — accesses `errorInfo.componentStack` - Property not in `React.ErrorInfo` public type definitions **Why this is needed** Accessing `errorInfo.componentStack` produces TypeScript error under strict mode. Property is React DevTools implementation detail, may change without notice. **Goal** Remove dependency on `errorInfo.componentStack`, use alternative for stack information. **What to do** 1. Check what `recordCritical` does with `componentStack` parameter 2. Consider using standard `Error.prototype.stack` instead 3. Or cast `errorInfo` to `any` with comment explaining trade-off: ```typescript const stack = (errorInfo as any).componentStack as string | undefined; // TODO: Remove when React types include componentStack ``` 4. Handle case where `componentStack` is `undefined` **Possible traps and issues** - `Error.prototype.stack` contains JavaScript call stack, not React component stack - Casting to `any` suppresses type checking — ensure it's documented - In production builds `componentStack` may be empty or absent **Docs changes needed** - Update `Docs/Architekture.md` § 3.2 (Components) — note that `ErrorBoundary` uses React-internal error info **Doc references** - `frontend/src/components/ErrorBoundary.tsx` - `Docs/Architekture.md` § 3.2 (Components section) --- ## [Frontend] No loading skeleton for Login page **Where found** - `frontend/src/pages/LoginPage.tsx` — renders password form immediately without checking session validation state **Why this is needed** `AuthProvider` validates session on mount. During validation (network round-trip), `LoginPage` renders form. If user already has valid session, they briefly see login page before redirect — "flash" of wrong page. **Goal** Show loading state (spinner or skeleton) while session validation in progress. **What to do** 1. Add `useSessionValidation` call to `LoginPage` to get `isValidating` flag 2. If `isValidating`, return `` 3. Alternatively, have `AuthProvider` expose `isValidating` state that `LoginPage` can read 4. Ensure redirect to dashboard happens automatically via `useEffect` **Possible traps and issues** - `useSessionValidation` may not expose `isValidating` — check hook implementation - Ensure loading spinner is consistent with Fluent UI design language - Redirect logic must handle `?next=` query parameter **Docs changes needed** - No documentation changes needed — UX improvement to existing component **Doc references** - `frontend/src/pages/LoginPage.tsx` - `frontend/src/providers/AuthProvider.tsx` --- ## [CRITICAL] Backend session cache not cluster-safe **Where found** - `backend/app/startup.py:66-91` — `verify_bangui_workers()` validates `BANGUI_WORKERS` env var but only if explicitly set - `backend/app/utils/session_cache.py` — in-memory cache stored on `app.state` (process-local) - Multi-worker deployments use separate worker processes with separate `app.state` **Why this is needed** Multi-worker deployments (e.g., `gunicorn -w 4`) create separate worker processes. Each has its own in-memory `app.state`, so sessions cached in Worker A are invisible to Workers B, C, D. Request landing on different worker shows session invalid — users randomly logged out. **Goal** Ensure multi-worker deployments either prevented at startup or migrate to shared session store (Redis, PostgreSQL). **What to do** **Option A (Strict single-worker):** - Add runtime detection of actual worker count (not just env var checking) - Fail startup if multi-worker scenario detected - Document single-worker requirement prominently **Option B (Support multi-worker):** - Migrate session cache to Redis or PostgreSQL - Update `app/utils/session_cache.py` to use distributed backend - Add `BANGUI_REDIS_URL` or similar to config **Immediate mitigation:** - Document in deployment guide that `BANGUI_WORKERS=1` is mandatory - Add validation that fails loudly if `BANGUI_WORKERS != 1` **Possible traps and issues** - `uvicorn --workers N` creates N processes, each with separate `app.state` - Environment variable validation easily bypassed using command-line flag - Detecting actual worker count at runtime is tricky — consider `os.getpid()` and shared lock file **Docs changes needed** - Update `Docs/Deployment.md` § Deployment Constraints — explicitly document single-worker requirement - Add troubleshooting entry: "Why am I randomly logged out?" → Check `BANGUI_WORKERS` - Update `Docker/docker-compose.yml` with comment explaining requirement **Doc references** - `Docs/Deployment.md` (deployment constraints) - `backend/app/startup.py` (worker validation) - `backend/app/utils/session_cache.py` (session cache) --- ## [CRITICAL] Frontend-Backend type generation drift **Where found** - `frontend/src/types/` — manually maintained Pydantic model transcriptions - `backend/app/models/` — source Pydantic models - No build-time validation or code generation linking them **Why this is needed** Types manually copied from Python models. When backend model changes, frontend types not updated automatically. Frontend uses stale types, causing runtime errors, type errors at build time, or silent UI bugs. **Goal** Implement automated type synchronization from backend schema to frontend types, validated at build time. **What to do** 1. **Recommended:** Generate TypeScript types from OpenAPI schema ```bash openapi-typescript http://localhost:8000/openapi.json -o src/types/generated.ts ``` 2. **Setup:** - Ensure backend exposes OpenAPI schema at `/openapi.json` (FastAPI has this built-in) - Add `openapi-typescript` to `frontend/package.json` devDependencies - Generate types in pre-build step: `npm run generate:types` - Import types from generated file, not hand-written - Add CI check: fail build if generated types don't match committed types 3. **Alternative:** Use `typed-rest-client` or `msw` with type generation **Possible traps and issues** - OpenAPI schema must be kept up-to-date — CI validation must enforce - Generated types may have different names than hand-written types — migration needed - Private/internal model fields should be excluded from schema **Docs changes needed** - Update `Docs/Web-Development.md` § Type Generation — document workflow - Add pre-commit hook documentation - Update `Docs/Backend-Development.md` to note API changes must keep schema in sync **Doc references** - `Docs/Web-Development.md` (type generation) - `Docs/Backend-Development.md` (API changes) --- ## [CRITICAL] Docker containers lack resource limits **Where found** - `Docker/docker-compose.yml` — no `deploy.limits` or `deploy.reservations` sections **Why this is needed** Without resource limits, single container can consume all host CPU, memory, disk. "Noisy neighbor" scenario where backend memory leak → uses 100% RAM → OOM kill → host unresponsive. **Goal** Set hard and soft resource limits for all containers. **What to do** 1. Add resource limits to `docker-compose.yml`: ```yaml backend: deploy: limits: cpus: '2' memory: 512M reservations: cpus: '1' memory: 256M ``` 2. Document these limits in `Docs/Deployment.md` 3. For Kubernetes, add equivalent `resources.limits` and `resources.requests` **Possible traps and issues** - Limits set too low → OOM kill or throttling - Backend may need more memory for large blocklists - Test under expected load before finalizing - Different environments may need different limits **Docs changes needed** - Update `Docker/docker-compose.yml` with `deploy` sections - Add section in `Docs/Deployment.md` § Resource Allocation **Doc references** - `Docker/docker-compose.yml` - `Docs/Deployment.md` (resource allocation) --- ## [CRITICAL] Global rate limiting missing **Where found** - `backend/app/routers/auth.py` — only `/api/auth/login` has rate limiting - All other routers have no rate limiting **Why this is needed** Without rate limiting, attackers can spam endpoints to cause CPU spike, database overload, or network bandwidth exhaustion. **Goal** Implement global per-IP rate limiting on all endpoints. **What to do** 1. Add rate limiting middleware to `backend/app/main.py`: ```python from slowapi import Limiter limiter = Limiter(key_func=get_remote_address, default_limits=["200 per minute"]) app.state.limiter = limiter ``` 2. Apply to all routers with appropriate limits per endpoint 3. Return proper HTTP 429 with `Retry-After` header 4. Document limits in API docs **Possible traps and issues** - Limits set too low block legitimate users - Distributed deployments need shared limiter state (Redis-backed) - Different endpoints may need different limits - Trusted IPs should bypass limiting **Docs changes needed** - Add section in `Docs/Backend-Development.md` § Rate Limiting - Document default limits in deployment guide **Doc references** - `Docs/Backend-Development.md` (rate limiting) - `backend/app/main.py` (middleware setup) --- ## [CRITICAL] Missing security headers (CSP, X-Frame-Options, etc.) **Where found** - Backend does not set `Content-Security-Policy`, `X-Frame-Options`, `X-Content-Type-Options` headers - Frontend HTML served without CSP meta tags **Why this is needed** Without security headers, browsers won't protect against XSS, clickjacking, MIME-sniffing, referrer leakage attacks. **Goal** Add security headers to all HTTP responses. **What to do** 1. Add security headers middleware to `backend/app/main.py`: ```python @app.middleware("http") async def add_security_headers(request, call_next): response = await call_next(request) response.headers["Content-Security-Policy"] = "default-src 'self'" response.headers["X-Frame-Options"] = "DENY" response.headers["X-Content-Type-Options"] = "nosniff" return response ``` 2. In frontend `index.html`, add CSP meta tag 3. Test with browser DevTools Security tab **Possible traps and issues** - CSP `'unsafe-inline'` defeats security — avoid if possible - CDN resources may need explicit allowlist - Too restrictive CSP breaks functionality; too loose defeats security **Docs changes needed** - Add section in `Docs/Security.md` § HTTP Security Headers **Doc references** - `Docs/Security.md` (security headers) --- ## [CRITICAL] Background tasks lack timeout protection **Where found** - `backend/app/tasks/blocklist_import.py` — no timeout - `backend/app/tasks/health_check.py` — no timeout - All task functions lack timeout wrapper **Why this is needed** If task hangs (API unreachable, network partition), task runs forever. Never completes → lock never released → duplicate work, resource exhaustion. **Goal** Ensure all background tasks complete within bounded time or fail gracefully. **What to do** 1. Wrap all task functions with `asyncio.wait_for(task, timeout)`: ```python await asyncio.wait_for(blocklist_service.import_all(...), timeout=300) ``` 2. Set appropriate timeouts per task: - Blocklist import: 300s (5 min) - Health probe: 10s - Geo cache flush: 60s 3. Log timeout events and trigger alerts **Possible traps and issues** - Timeout too short → legitimate tasks killed prematurely - Timeout too long → resource leak if many tasks hang - Killing task mid-operation may leave inconsistent state **Docs changes needed** - Add section in `Docs/Backend-Development.md` § Background Tasks **Doc references** - `Docs/Backend-Development.md` (background tasks) - `backend/app/tasks/` (task modules) --- ## [CRITICAL] Background tasks not idempotent **Where found** - `backend/app/tasks/blocklist_import.py` — bans applied without checking if already banned - `backend/app/tasks/geo_cache_flush.py` — cache entries written without transaction - Multi-step operations not wrapped in transaction **Why this is needed** If task crashes mid-execution, partial state remains. On retry: bans applied again → duplicates, cache entries written twice → corruption. **Goal** Make all background tasks idempotent — retrying produces same result as running once. **What to do** 1. Use operation IDs to deduplicate: ```python operation_id = f"import_{source.id}_{datetime.now().date().isoformat()}" if await import_log_repo.get_by_operation_id(operation_id): return # Already done ``` 2. Use transactions for multi-step operations 3. Store operation state before execution **Possible traps and issues** - Idempotency keys must be unique but deterministic - Transactions require database support - State machine (pending → completed/failed) must be enforced **Docs changes needed** - Update `Docs/Backend-Development.md` § Task Idempotency **Doc references** - `Docs/Backend-Development.md` (task design) - `backend/app/tasks/` (task implementations) --- ## [CRITICAL] Health check endpoint returns wrong status code **Where found** - `backend/app/routers/health.py` — always returns 200, even when fail2ban offline **Why this is needed** Docker health checks interpret 200 as "healthy". If fail2ban offline but backend returns 200, Docker thinks container healthy and doesn't restart it. **Goal** Return 503 Service Unavailable when fail2ban is offline. **What to do** 1. Change health endpoint to return 503 when offline: ```python if not server_status.online: return JSONResponse( status_code=503, content={"status": "unavailable", "fail2ban": "offline"} ) ``` 2. Update Docker health check to expect 503 as "unhealthy" **Possible traps and issues** - Returning 503 causes orchestration tools to restart container - If fail2ban restarts frequently, health check becomes flaky - Consider gradual degradation **Docs changes needed** - Update `Docker/Dockerfile.backend` health check documentation - Update `Docs/Deployment.md` § Health Checks **Doc references** - `backend/app/routers/health.py` - `Docker/Dockerfile.backend` --- ## [IMPORTANT] Database transactions lack explicit isolation **Where found** - `backend/app/repositories/session_repo.py:40-60` — multiple queries without `BEGIN TRANSACTION` - Similar pattern in multi-step operations across repositories **Why this is needed** Without explicit boundaries, concurrent requests can race: Thread A checks if exists → not found, Thread B checks same → not found, Thread A inserts → succeeds, Thread B inserts → duplicate error or silent overwrite. **Goal** Wrap all multi-step operations in explicit transactions with appropriate isolation level. **What to do** 1. Use explicit `BEGIN IMMEDIATE` transaction: ```python await db.execute("BEGIN IMMEDIATE") try: await db.execute("INSERT INTO sessions ...") await db.commit() except Exception: await db.rollback() raise ``` 2. Use `IMMEDIATE` mode to lock immediately for writes 3. Document transaction boundaries clearly **Possible traps and issues** - Nested transactions (SAVEPOINTs) may be needed - Locks held too long cause contention - Deadlocks possible with concurrent writers **Docs changes needed** - Add section in `Docs/Backend-Development.md` § Database Transactions **Doc references** - `Docs/Backend-Development.md` (database design) --- ## [IMPORTANT] Scheduler lock race condition **Where found** - `backend/app/utils/scheduler_lock.py:56-58` — heartbeat interval 10 seconds **Why this is needed** Current design: Process A acquires lock, heartbeat misses, lock expires, Process B acquires lock, both running simultaneously → duplicate work, data corruption. **Goal** Implement robust distributed locking that prevents concurrent execution. **What to do** **Option A (Strengthen heartbeat):** - Reduce interval to 5s (half of timeout) - Use database advisory locks - Monitor heartbeat failures **Option B (Migrate to Redis):** - Use `redlock-py` or `aioredis` - Simpler, more reliable than database-backed **Current code improvements:** - Log when heartbeat fails - Add metric for lock contention - Test multi-process scenario **Possible traps and issues** - Database locks don't scale under high contention - Redis adds new dependency - Clock skew breaks timestamp-based expiry **Docs changes needed** - Update `Docs/Deployment.md` § Scheduler Lock - Add troubleshooting: "Blocklist import runs twice" **Doc references** - `Docs/Deployment.md` (scheduler) - `backend/app/utils/scheduler_lock.py` (lock implementation) --- ## [IMPORTANT] API pagination doesn't return metadata **Where found** - `backend/app/routers/history.py` — returns bare list, no pagination metadata - All paginated routers have same issue **Why this is needed** Frontend receives bare list, cannot determine: total results, whether more pages exist, last page number. Must guess or re-query. **Goal** Return pagination metadata with every paginated response. **What to do** 1. Create response wrapper: ```python class PaginatedResponse(BaseModel): data: list[Item] pagination: PaginationMetadata ``` 2. Update all paginated routers to return this wrapper 3. Update frontend to use metadata for UI **Possible traps and issues** - `SELECT COUNT(*)` is slow on large tables - Response shape change — old frontend may not handle **Docs changes needed** - Update API documentation § Pagination **Doc references** - `backend/app/utils/pagination.py` --- ## [IMPORTANT] Error response schema inconsistent **Where found** - Different handlers return different response shapes - Fail2Ban errors: `{ "error_code": "...", "detail": "..." }` - Validation errors: `{ "detail": [...] }` - Not found errors: `{ "detail": "...", "error_code": "..." }` **Why this is needed** Frontend must normalize multiple shapes, making error handling fragile and error-prone. **Goal** Unify all error responses to single schema. **What to do** 1. Define canonical error response: ```python class ErrorResponse(BaseModel): error_code: str message: str status: int details: dict | None = None ``` 2. Update all handlers to return this format 3. Update frontend to expect unified schema **Possible traps and issues** - Backward compatibility with old clients - FastAPI's built-in handlers may override custom - Rich detail structures need accommodation **Docs changes needed** - Update API documentation with unified error schema - Add error code reference table **Doc references** - `Docs/API.md` (error codes) - `backend/app/main.py` (exception handlers) --- ## [IMPORTANT] Provider ordering fragility (Frontend) **Where found** - `frontend/src/App.tsx` — 10-level deep provider nesting - `frontend/src/providers/PROVIDER_ORDER.md` — documents order, no compile-time enforcement **Why this is needed** Provider order (ThemeProvider → AppContents → FluentProvider → ...) enforced only at runtime. Accidental reorder caught only after deploy. **Goal** Add compile-time validation of provider ordering. **What to do** 1. Create provider composition utility enforcing order 2. Use TypeScript discriminated unions 3. Add ESLint rule to check provider wrapping **Possible traps and issues** - TypeScript doesn't easily enforce ordering - May be overkill — improve runtime error messages instead **Docs changes needed** - Update `Docs/Architekture.md` § 3.2 (Providers) **Doc references** - `Docs/Architekture.md` § 3.2 (Providers) - `frontend/src/providers/PROVIDER_ORDER.md` --- ## [IMPORTANT] Promise cancellation not checked in .then()/.catch() chains **Where found** - `frontend/src/components/blocklist/BlocklistSourcesSection.tsx:84-88` - `frontend/src/components/blocklist/BlocklistScheduleSection.tsx:49-58` - Multiple components use this pattern **Why this is needed** When user navigates away, `.then()` chains don't check if cancelled. State updated on unmounted component → React warnings, memory leak, notification shows wrong context. **Goal** Check for cancellation in all `.then()/.catch()` chains. **What to do** 1. Replace `.then()/.catch()` with `async/await` and cancellation check 2. Or use wrapper hook to hide logic **Possible traps and issues** - Checking `signal.aborted` after `await` introduces race conditions - Better: let AbortError propagate, catch it in catch block **Docs changes needed** - Update `Docs/Web-Development.md` § Async Patterns **Doc references** - `Docs/Web-Development.md` (async patterns) --- ## [MEDIUM] Inefficient database pagination uses OFFSET **Where found** - `backend/app/utils/pagination.py` — uses `OFFSET (page-1) * page_size` **Why this is needed** OFFSET scans and discards N rows to fetch N+limit. Last page on 10M row table: 15 seconds ⚠️ **Goal** Implement keyset pagination (cursor-based) for large result sets. **What to do** 1. **Short-term:** Add database indexes on sort columns 2. **Long-term:** Implement cursor-based pagination using WHERE instead of OFFSET 3. Frontend sends cursor (last row ID) instead of page number **Possible traps and issues** - Cursor must be deterministic - API contract changes - Cursor format must be opaque to client **Docs changes needed** - Update `Docs/Backend-Development.md` § Database Performance **Doc references** - `Docs/Backend-Development.md` (database performance) --- ## [MEDIUM] Session secret rotation not implemented **Where found** - `backend/app/config.py` — single `session_secret` with no rotation support **Why this is needed** If secret leaks, all sessions compromised. No way to invalidate old sessions. **Goal** Support gradual secret rotation without forcing logout. **What to do** 1. Store multiple secrets: current and previous 2. Accept tokens signed with either key 3. Re-sign tokens with current secret on validation **Possible traps and issues** - Rotation strategy must be documented - Metrics needed to track secret usage **Docs changes needed** - Update `Docs/Backend-Development.md` § Session Management **Doc references** - `Docs/Backend-Development.md` --- ## [MEDIUM] No CORS configuration **Where found** - `backend/app/main.py` — no CORS middleware added **Why this is needed** If frontend on different origin, cross-origin requests blocked without CORS configuration. **Goal** Add CORS middleware with proper origin whitelisting. **What to do** 1. Add CORS middleware with specific origin whitelist 2. Make configurable via environment variable 3. Default to localhost for development **Possible traps and issues** - `allow_origins=["*"]` defeats CORS security - Credentials require specific origins, not wildcard - Missing config silently fails in browser **Docs changes needed** - Update `Docs/Deployment.md` § CORS Configuration **Doc references** - `Docs/Deployment.md` --- ## [MEDIUM] Input validation missing for regex patterns (ReDoS) **Where found** - `backend/app/routers/config.py` — regex validation accepts arbitrary patterns without timeout **Why this is needed** Malicious regex causes catastrophic backtracking (ReDoS). Attacker sends pattern → compilation hangs → DoS. **Goal** Add timeout and complexity limits to regex validation. **What to do** 1. Add timeout to regex compilation (2 seconds recommended) 2. Add length limit (reject patterns > 1000 characters) 3. Use `signal.alarm()` (Unix) or timeout library **Possible traps and issues** - `signal.alarm()` Unix-only - Some valid complex regexes may timeout - Frontend should also validate (defense in depth) **Docs changes needed** - Update API docs to document regex validation limits **Doc references** - `backend/app/routers/config.py` --- ## [MEDIUM] No structured logging to external system **Where found** - Logs only go to stdout/file, no external aggregation **Why this is needed** Can't search across instances, historical logs lost on instance recycle. **Goal** Ship logs to centralized logging platform. **What to do** 1. **Short-term:** Ensure `structlog` JSON output is valid (already done) 2. **Long-term:** Ship to logging platform (ELK, Datadog, Papertrail) **Possible traps and issues** - External logging adds latency - Sensitive data must not be logged - Log volume can be massive **Docs changes needed** - Add `Docs/Observability.md` section on logging **Doc references** - `Docs/Observability.md` (new) --- ## [MEDIUM] No Application Performance Monitoring (APM) **Where found** - Backend: no metrics collection, latency tracking - Frontend: no error tracking, performance metrics - No observability into request performance **Why this is needed** Without metrics, blind in production: API slow? Unknown. Which endpoints fail most? Unknown. **Goal** Add comprehensive metrics collection and monitoring. **What to do** 1. **Backend metrics:** - Add Prometheus metrics: request count, latency, active requests - Expose `/metrics` endpoint 2. **Frontend metrics:** - Page load time, FCP, LCP using `web-vitals` - API error rates and latencies 3. **Aggregation:** - Prometheus + Grafana, or Datadog/NewRelic **Possible traps and issues** - Metrics collection has performance cost - Cardinality explosion with tags - PII in metrics **Docs changes needed** - Add `Docs/Observability.md` **Doc references** - `Docs/Observability.md` (new) --- ## [LOW] Frontend charts not memoized **Where found** - `frontend/src/components/TopCountriesPieChart.tsx` - `frontend/src/components/TopCountriesBarChart.tsx` **Why this is needed** Charts re-render on every parent update, Recharts reprocesses 5000+ points. **Goal** Memoize chart components. **What to do** 1. Wrap with `React.memo` with custom comparison 2. Ensure data objects are stable **Possible traps and issues** - Shallow comparison might not be enough - Memoization has memory cost **Docs changes needed** - No documentation changes **Doc references** - `frontend/src/components/TopCountriesChart.tsx` --- ## [LOW] No request deduplication on frontend **Where found** - `frontend/src/hooks/useFetchData.ts` — each call launches new request - User clicks "Refresh" twice → two identical requests **Why this is needed** Duplicates waste bandwidth, cause race conditions (response 2 arrives first, then response 1 overwrites with stale data). **Goal** Deduplicate identical in-flight requests. **What to do** 1. Implement request cache 2. Clear cache entry when response received 3. Use in `useFetchData` **Possible traps and issues** - Cache must be cleared on data mutation - Stale data in cache possible if not careful **Docs changes needed** - No documentation changes **Doc references** - `frontend/src/hooks/useFetchData.ts`