## [CRITICAL] Missing security headers (CSP, X-Frame-Options, etc.)

**Where found**

- Backend does not set `Content-Security-Policy`, `X-Frame-Options`, `X-Content-Type-Options` headers
- Frontend HTML served without CSP meta tags

**Why this is needed**

Without security headers, browsers won't protect against XSS, clickjacking, MIME-sniffing, referrer leakage attacks.

**Goal**

Add security headers to all HTTP responses.

**What to do**

1. Add security headers middleware to `backend/app/main.py`:
   ```python
   @app.middleware("http")
   async def add_security_headers(request, call_next):
       response = await call_next(request)
       response.headers["Content-Security-Policy"] = "default-src 'self'"
       response.headers["X-Frame-Options"] = "DENY"
       response.headers["X-Content-Type-Options"] = "nosniff"
       return response
   ```

2. In frontend `index.html`, add CSP meta tag
3. Test with browser DevTools Security tab

**Possible traps and issues**

- CSP `'unsafe-inline'` defeats security — avoid if possible
- CDN resources may need explicit allowlist
- Too restrictive CSP breaks functionality; too loose defeats security

**Docs changes needed**

- Add section in `Docs/Security.md` § HTTP Security Headers

**Doc references**

- `Docs/Security.md` (security headers)

---

## [CRITICAL] Background tasks lack timeout protection

**Where found**

- `backend/app/tasks/blocklist_import.py` — no timeout
- `backend/app/tasks/health_check.py` — no timeout
- All task functions lack timeout wrapper

**Why this is needed**

If task hangs (API unreachable, network partition), task runs forever. Never completes → lock never released → duplicate work, resource exhaustion.

**Goal**

Ensure all background tasks complete within bounded time or fail gracefully.

**What to do**

1. Wrap all task functions with `asyncio.wait_for(task, timeout)`:
   ```python
   await asyncio.wait_for(blocklist_service.import_all(...), timeout=300)
   ```

2. Set appropriate timeouts per task:
   - Blocklist import: 300s (5 min)
   - Health probe: 10s
   - Geo cache flush: 60s

3. Log timeout events and trigger alerts

**Possible traps and issues**

- Timeout too short → legitimate tasks killed prematurely
- Timeout too long → resource leak if many tasks hang
- Killing task mid-operation may leave inconsistent state

**Docs changes needed**

- Add section in `Docs/Backend-Development.md` § Background Tasks

**Doc references**

- `Docs/Backend-Development.md` (background tasks)
- `backend/app/tasks/` (task modules)

---

## [CRITICAL] Background tasks not idempotent

**Where found**

- `backend/app/tasks/blocklist_import.py` — bans applied without checking if already banned
- `backend/app/tasks/geo_cache_flush.py` — cache entries written without transaction
- Multi-step operations not wrapped in transaction

**Why this is needed**

If task crashes mid-execution, partial state remains. On retry: bans applied again → duplicates, cache entries written twice → corruption.

**Goal**

Make all background tasks idempotent — retrying produces same result as running once.

**What to do**

1. Use operation IDs to deduplicate:
   ```python
   operation_id = f"import_{source.id}_{datetime.now().date().isoformat()}"
   if await import_log_repo.get_by_operation_id(operation_id):
       return  # Already done
   ```

2. Use transactions for multi-step operations
3. Store operation state before execution

**Possible traps and issues**

- Idempotency keys must be unique but deterministic
- Transactions require database support
- State machine (pending → completed/failed) must be enforced

**Docs changes needed**

- Update `Docs/Backend-Development.md` § Task Idempotency

**Doc references**

- `Docs/Backend-Development.md` (task design)
- `backend/app/tasks/` (task implementations)

---

## [CRITICAL] Health check endpoint returns wrong status code

**Where found**

- `backend/app/routers/health.py` — always returns 200, even when fail2ban offline

**Why this is needed**

Docker health checks interpret 200 as "healthy". If fail2ban offline but backend returns 200, Docker thinks container healthy and doesn't restart it.

**Goal**

Return 503 Service Unavailable when fail2ban is offline.

**What to do**

1. Change health endpoint to return 503 when offline:
   ```python
   if not server_status.online:
       return JSONResponse(
           status_code=503,
           content={"status": "unavailable", "fail2ban": "offline"}
       )
   ```

2. Update Docker health check to expect 503 as "unhealthy"

**Possible traps and issues**

- Returning 503 causes orchestration tools to restart container
- If fail2ban restarts frequently, health check becomes flaky
- Consider gradual degradation

**Docs changes needed**

- Update `Docker/Dockerfile.backend` health check documentation
- Update `Docs/Deployment.md` § Health Checks

**Doc references**

- `backend/app/routers/health.py`
- `Docker/Dockerfile.backend`

---

## [IMPORTANT] Database transactions lack explicit isolation

**Where found**

- `backend/app/repositories/session_repo.py:40-60` — multiple queries without `BEGIN TRANSACTION`
- Similar pattern in multi-step operations across repositories

**Why this is needed**

Without explicit boundaries, concurrent requests can race: Thread A checks if exists → not found, Thread B checks same → not found, Thread A inserts → succeeds, Thread B inserts → duplicate error or silent overwrite.

**Goal**

Wrap all multi-step operations in explicit transactions with appropriate isolation level.

**What to do**

1. Use explicit `BEGIN IMMEDIATE` transaction:
   ```python
   await db.execute("BEGIN IMMEDIATE")
   try:
       await db.execute("INSERT INTO sessions ...")
       await db.commit()
   except Exception:
       await db.rollback()
       raise
   ```

2. Use `IMMEDIATE` mode to lock immediately for writes
3. Document transaction boundaries clearly

**Possible traps and issues**

- Nested transactions (SAVEPOINTs) may be needed
- Locks held too long cause contention
- Deadlocks possible with concurrent writers

**Docs changes needed**

- Add section in `Docs/Backend-Development.md` § Database Transactions

**Doc references**

- `Docs/Backend-Development.md` (database design)

---

## [IMPORTANT] Scheduler lock race condition

**Where found**

- `backend/app/utils/scheduler_lock.py:56-58` — heartbeat interval 10 seconds

**Why this is needed**

Current design: Process A acquires lock, heartbeat misses, lock expires, Process B acquires lock, both running simultaneously → duplicate work, data corruption.

**Goal**

Implement robust distributed locking that prevents concurrent execution.

**What to do**

**Option A (Strengthen heartbeat):**
- Reduce interval to 5s (half of timeout)
- Use database advisory locks
- Monitor heartbeat failures

**Option B (Migrate to Redis):**
- Use `redlock-py` or `aioredis`
- Simpler, more reliable than database-backed

**Current code improvements:**
- Log when heartbeat fails
- Add metric for lock contention
- Test multi-process scenario

**Possible traps and issues**

- Database locks don't scale under high contention
- Redis adds new dependency
- Clock skew breaks timestamp-based expiry

**Docs changes needed**

- Update `Docs/Deployment.md` § Scheduler Lock
- Add troubleshooting: "Blocklist import runs twice"

**Doc references**

- `Docs/Deployment.md` (scheduler)
- `backend/app/utils/scheduler_lock.py` (lock implementation)

---

## [IMPORTANT] API pagination doesn't return metadata

**Where found**

- `backend/app/routers/history.py` — returns bare list, no pagination metadata
- All paginated routers have same issue

**Why this is needed**

Frontend receives bare list, cannot determine: total results, whether more pages exist, last page number. Must guess or re-query.

**Goal**

Return pagination metadata with every paginated response.

**What to do**

1. Create response wrapper:
   ```python
   class PaginatedResponse(BaseModel):
       data: list[Item]
       pagination: PaginationMetadata
   ```

2. Update all paginated routers to return this wrapper
3. Update frontend to use metadata for UI

**Possible traps and issues**

- `SELECT COUNT(*)` is slow on large tables
- Response shape change — old frontend may not handle

**Docs changes needed**

- Update API documentation § Pagination

**Doc references**

- `backend/app/utils/pagination.py`

---

## [IMPORTANT] Error response schema inconsistent

**Where found**

- Different handlers return different response shapes
- Fail2Ban errors: `{ "error_code": "...", "detail": "..." }`
- Validation errors: `{ "detail": [...] }`
- Not found errors: `{ "detail": "...", "error_code": "..." }`

**Why this is needed**

Frontend must normalize multiple shapes, making error handling fragile and error-prone.

**Goal**

Unify all error responses to single schema.

**What to do**

1. Define canonical error response:
   ```python
   class ErrorResponse(BaseModel):
       error_code: str
       message: str
       status: int
       details: dict | None = None
   ```

2. Update all handlers to return this format
3. Update frontend to expect unified schema

**Possible traps and issues**

- Backward compatibility with old clients
- FastAPI's built-in handlers may override custom
- Rich detail structures need accommodation

**Docs changes needed**

- Update API documentation with unified error schema
- Add error code reference table

**Doc references**

- `Docs/API.md` (error codes)
- `backend/app/main.py` (exception handlers)

---

## [IMPORTANT] Provider ordering fragility (Frontend)

**Where found**

- `frontend/src/App.tsx` — 10-level deep provider nesting
- `frontend/src/providers/PROVIDER_ORDER.md` — documents order, no compile-time enforcement

**Why this is needed**

Provider order (ThemeProvider → AppContents → FluentProvider → ...) enforced only at runtime. Accidental reorder caught only after deploy.

**Goal**

Add compile-time validation of provider ordering.

**What to do**

1. Create provider composition utility enforcing order
2. Use TypeScript discriminated unions
3. Add ESLint rule to check provider wrapping

**Possible traps and issues**

- TypeScript doesn't easily enforce ordering
- May be overkill — improve runtime error messages instead

**Docs changes needed**

- Update `Docs/Architekture.md` § 3.2 (Providers)

**Doc references**

- `Docs/Architekture.md` § 3.2 (Providers)
- `frontend/src/providers/PROVIDER_ORDER.md`

---

## [IMPORTANT] Promise cancellation not checked in .then()/.catch() chains

**Where found**

- `frontend/src/components/blocklist/BlocklistSourcesSection.tsx:84-88`
- `frontend/src/components/blocklist/BlocklistScheduleSection.tsx:49-58`
- Multiple components use this pattern

**Why this is needed**

When user navigates away, `.then()` chains don't check if cancelled. State updated on unmounted component → React warnings, memory leak, notification shows wrong context.

**Goal**

Check for cancellation in all `.then()/.catch()` chains.

**What to do**

1. Replace `.then()/.catch()` with `async/await` and cancellation check
2. Or use wrapper hook to hide logic

**Possible traps and issues**

- Checking `signal.aborted` after `await` introduces race conditions
- Better: let AbortError propagate, catch it in catch block

**Docs changes needed**

- Update `Docs/Web-Development.md` § Async Patterns

**Doc references**

- `Docs/Web-Development.md` (async patterns)

---

## [MEDIUM] Inefficient database pagination uses OFFSET

**Where found**

- `backend/app/utils/pagination.py` — uses `OFFSET (page-1) * page_size`

**Why this is needed**

OFFSET scans and discards N rows to fetch N+limit. Last page on 10M row table: 15 seconds ⚠️

**Goal**

Implement keyset pagination (cursor-based) for large result sets.

**What to do**

1. **Short-term:** Add database indexes on sort columns
2. **Long-term:** Implement cursor-based pagination using WHERE instead of OFFSET
3. Frontend sends cursor (last row ID) instead of page number

**Possible traps and issues**

- Cursor must be deterministic
- API contract changes
- Cursor format must be opaque to client

**Docs changes needed**

- Update `Docs/Backend-Development.md` § Database Performance

**Doc references**

- `Docs/Backend-Development.md` (database performance)

---

## [MEDIUM] Session secret rotation not implemented

**Where found**

- `backend/app/config.py` — single `session_secret` with no rotation support

**Why this is needed**

If secret leaks, all sessions compromised. No way to invalidate old sessions.

**Goal**

Support gradual secret rotation without forcing logout.

**What to do**

1. Store multiple secrets: current and previous
2. Accept tokens signed with either key
3. Re-sign tokens with current secret on validation

**Possible traps and issues**

- Rotation strategy must be documented
- Metrics needed to track secret usage

**Docs changes needed**

- Update `Docs/Backend-Development.md` § Session Management

**Doc references**

- `Docs/Backend-Development.md`

---

## [MEDIUM] No CORS configuration

**Where found**

- `backend/app/main.py` — no CORS middleware added

**Why this is needed**

If frontend on different origin, cross-origin requests blocked without CORS configuration.

**Goal**

Add CORS middleware with proper origin whitelisting.

**What to do**

1. Add CORS middleware with specific origin whitelist
2. Make configurable via environment variable
3. Default to localhost for development

**Possible traps and issues**

- `allow_origins=["*"]` defeats CORS security
- Credentials require specific origins, not wildcard
- Missing config silently fails in browser

**Docs changes needed**

- Update `Docs/Deployment.md` § CORS Configuration

**Doc references**

- `Docs/Deployment.md`

---

## [MEDIUM] Input validation missing for regex patterns (ReDoS)

**Where found**

- `backend/app/routers/config.py` — regex validation accepts arbitrary patterns without timeout

**Why this is needed**

Malicious regex causes catastrophic backtracking (ReDoS). Attacker sends pattern → compilation hangs → DoS.

**Goal**

Add timeout and complexity limits to regex validation.

**What to do**

1. Add timeout to regex compilation (2 seconds recommended)
2. Add length limit (reject patterns > 1000 characters)
3. Use `signal.alarm()` (Unix) or timeout library

**Possible traps and issues**

- `signal.alarm()` Unix-only
- Some valid complex regexes may timeout
- Frontend should also validate (defense in depth)

**Docs changes needed**

- Update API docs to document regex validation limits

**Doc references**

- `backend/app/routers/config.py`

---

## [MEDIUM] No structured logging to external system

**Where found**

- Logs only go to stdout/file, no external aggregation

**Why this is needed**

Can't search across instances, historical logs lost on instance recycle.

**Goal**

Ship logs to centralized logging platform.

**What to do**

1. **Short-term:** Ensure `structlog` JSON output is valid (already done)
2. **Long-term:** Ship to logging platform (ELK, Datadog, Papertrail)

**Possible traps and issues**

- External logging adds latency
- Sensitive data must not be logged
- Log volume can be massive

**Docs changes needed**

- Add `Docs/Observability.md` section on logging

**Doc references**

- `Docs/Observability.md` (new)

---

## [MEDIUM] No Application Performance Monitoring (APM)

**Where found**

- Backend: no metrics collection, latency tracking
- Frontend: no error tracking, performance metrics
- No observability into request performance

**Why this is needed**

Without metrics, blind in production: API slow? Unknown. Which endpoints fail most? Unknown.

**Goal**

Add comprehensive metrics collection and monitoring.

**What to do**

1. **Backend metrics:**
   - Add Prometheus metrics: request count, latency, active requests
   - Expose `/metrics` endpoint

2. **Frontend metrics:**
   - Page load time, FCP, LCP using `web-vitals`
   - API error rates and latencies

3. **Aggregation:**
   - Prometheus + Grafana, or Datadog/NewRelic

**Possible traps and issues**

- Metrics collection has performance cost
- Cardinality explosion with tags
- PII in metrics

**Docs changes needed**

- Add `Docs/Observability.md`

**Doc references**

- `Docs/Observability.md` (new)

---

## [LOW] Frontend charts not memoized

**Where found**

- `frontend/src/components/TopCountriesPieChart.tsx`
- `frontend/src/components/TopCountriesBarChart.tsx`

**Why this is needed**

Charts re-render on every parent update, Recharts reprocesses 5000+ points.

**Goal**

Memoize chart components.

**What to do**

1. Wrap with `React.memo` with custom comparison
2. Ensure data objects are stable

**Possible traps and issues**

- Shallow comparison might not be enough
- Memoization has memory cost

**Docs changes needed**

- No documentation changes

**Doc references**

- `frontend/src/components/TopCountriesChart.tsx`

---

## [LOW] No request deduplication on frontend

**Where found**

- `frontend/src/hooks/useFetchData.ts` — each call launches new request
- User clicks "Refresh" twice → two identical requests

**Why this is needed**

Duplicates waste bandwidth, cause race conditions (response 2 arrives first, then response 1 overwrites with stale data).

**Goal**

Deduplicate identical in-flight requests.

**What to do**

1. Implement request cache
2. Clear cache entry when response received
3. Use in `useFetchData`

**Possible traps and issues**

- Cache must be cleared on data mutation
- Stale data in cache possible if not careful

**Docs changes needed**

- No documentation changes

**Doc references**

- `frontend/src/hooks/useFetchData.ts`