BanGUI/Docs/Tasks.md at d1316ca66e50f6505fe2d6ea409374e59f44d39a

Files

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

2026-04-30 21:05:00 +02:00

20 KiB

Raw Blame History

[CRITICAL] Docker containers lack resource limits

Where found

Docker/docker-compose.yml — no deploy.limits or deploy.reservations sections

Why this is needed

Without resource limits, single container can consume all host CPU, memory, disk. "Noisy neighbor" scenario where backend memory leak → uses 100% RAM → OOM kill → host unresponsive.

Goal

Set hard and soft resource limits for all containers.

What to do

Add resource limits to docker-compose.yml:

backend:
  deploy:
    limits:
      cpus: '2'
      memory: 512M
    reservations:
      cpus: '1'
      memory: 256M

Document these limits in Docs/Deployment.md
For Kubernetes, add equivalent resources.limits and resources.requests

Possible traps and issues

Limits set too low → OOM kill or throttling
Backend may need more memory for large blocklists
Test under expected load before finalizing
Different environments may need different limits

Docs changes needed

Update Docker/docker-compose.yml with deploy sections
Add section in Docs/Deployment.md § Resource Allocation

Doc references

Docker/docker-compose.yml
Docs/Deployment.md (resource allocation)

[CRITICAL] Global rate limiting missing

Where found

backend/app/routers/auth.py — only /api/auth/login has rate limiting
All other routers have no rate limiting

Why this is needed

Without rate limiting, attackers can spam endpoints to cause CPU spike, database overload, or network bandwidth exhaustion.

Goal

Implement global per-IP rate limiting on all endpoints.

What to do

Add rate limiting middleware to backend/app/main.py:

from slowapi import Limiter
limiter = Limiter(key_func=get_remote_address, default_limits=["200 per minute"])
app.state.limiter = limiter

Apply to all routers with appropriate limits per endpoint
Return proper HTTP 429 with Retry-After header
Document limits in API docs

Possible traps and issues

Limits set too low block legitimate users
Distributed deployments need shared limiter state (Redis-backed)
Different endpoints may need different limits
Trusted IPs should bypass limiting

Docs changes needed

Add section in Docs/Backend-Development.md § Rate Limiting
Document default limits in deployment guide

Doc references

Docs/Backend-Development.md (rate limiting)
backend/app/main.py (middleware setup)

[CRITICAL] Missing security headers (CSP, X-Frame-Options, etc.)

Where found

Backend does not set Content-Security-Policy, X-Frame-Options, X-Content-Type-Options headers
Frontend HTML served without CSP meta tags

Why this is needed

Without security headers, browsers won't protect against XSS, clickjacking, MIME-sniffing, referrer leakage attacks.

Goal

Add security headers to all HTTP responses.

What to do

Add security headers middleware to backend/app/main.py:

@app.middleware("http")
async def add_security_headers(request, call_next):
    response = await call_next(request)
    response.headers["Content-Security-Policy"] = "default-src 'self'"
    response.headers["X-Frame-Options"] = "DENY"
    response.headers["X-Content-Type-Options"] = "nosniff"
    return response

In frontend index.html, add CSP meta tag
Test with browser DevTools Security tab

Possible traps and issues

CSP 'unsafe-inline' defeats security — avoid if possible
CDN resources may need explicit allowlist
Too restrictive CSP breaks functionality; too loose defeats security

Docs changes needed

Add section in Docs/Security.md § HTTP Security Headers

Doc references

Docs/Security.md (security headers)

[CRITICAL] Background tasks lack timeout protection

Where found

backend/app/tasks/blocklist_import.py — no timeout
backend/app/tasks/health_check.py — no timeout
All task functions lack timeout wrapper

Why this is needed

If task hangs (API unreachable, network partition), task runs forever. Never completes → lock never released → duplicate work, resource exhaustion.

Goal

Ensure all background tasks complete within bounded time or fail gracefully.

What to do

Wrap all task functions with asyncio.wait_for(task, timeout):

await asyncio.wait_for(blocklist_service.import_all(...), timeout=300)

Set appropriate timeouts per task:
- Blocklist import: 300s (5 min)
- Health probe: 10s
- Geo cache flush: 60s
Log timeout events and trigger alerts

Possible traps and issues

Timeout too short → legitimate tasks killed prematurely
Timeout too long → resource leak if many tasks hang
Killing task mid-operation may leave inconsistent state

Docs changes needed

Add section in Docs/Backend-Development.md § Background Tasks

Doc references

Docs/Backend-Development.md (background tasks)
backend/app/tasks/ (task modules)

[CRITICAL] Background tasks not idempotent

Where found

backend/app/tasks/blocklist_import.py — bans applied without checking if already banned
backend/app/tasks/geo_cache_flush.py — cache entries written without transaction
Multi-step operations not wrapped in transaction

Why this is needed

If task crashes mid-execution, partial state remains. On retry: bans applied again → duplicates, cache entries written twice → corruption.

Goal

Make all background tasks idempotent — retrying produces same result as running once.

What to do

Use operation IDs to deduplicate:

operation_id = f"import_{source.id}_{datetime.now().date().isoformat()}"
if await import_log_repo.get_by_operation_id(operation_id):
    return  # Already done

Use transactions for multi-step operations
Store operation state before execution

Possible traps and issues

Idempotency keys must be unique but deterministic
Transactions require database support
State machine (pending → completed/failed) must be enforced

Docs changes needed

Update Docs/Backend-Development.md § Task Idempotency

Doc references

Docs/Backend-Development.md (task design)
backend/app/tasks/ (task implementations)

[CRITICAL] Health check endpoint returns wrong status code

Where found

backend/app/routers/health.py — always returns 200, even when fail2ban offline

Why this is needed

Docker health checks interpret 200 as "healthy". If fail2ban offline but backend returns 200, Docker thinks container healthy and doesn't restart it.

Goal

Return 503 Service Unavailable when fail2ban is offline.

What to do

Change health endpoint to return 503 when offline:

if not server_status.online:
    return JSONResponse(
        status_code=503,
        content={"status": "unavailable", "fail2ban": "offline"}
    )

Update Docker health check to expect 503 as "unhealthy"

Possible traps and issues

Returning 503 causes orchestration tools to restart container
If fail2ban restarts frequently, health check becomes flaky
Consider gradual degradation

Docs changes needed

Update Docker/Dockerfile.backend health check documentation
Update Docs/Deployment.md § Health Checks

Doc references

backend/app/routers/health.py
Docker/Dockerfile.backend

[IMPORTANT] Database transactions lack explicit isolation

Where found

backend/app/repositories/session_repo.py:40-60 — multiple queries without BEGIN TRANSACTION
Similar pattern in multi-step operations across repositories

Why this is needed

Without explicit boundaries, concurrent requests can race: Thread A checks if exists → not found, Thread B checks same → not found, Thread A inserts → succeeds, Thread B inserts → duplicate error or silent overwrite.

Goal

Wrap all multi-step operations in explicit transactions with appropriate isolation level.

What to do

Use explicit BEGIN IMMEDIATE transaction:

await db.execute("BEGIN IMMEDIATE")
try:
    await db.execute("INSERT INTO sessions ...")
    await db.commit()
except Exception:
    await db.rollback()
    raise

Use IMMEDIATE mode to lock immediately for writes
Document transaction boundaries clearly

Possible traps and issues

Nested transactions (SAVEPOINTs) may be needed
Locks held too long cause contention
Deadlocks possible with concurrent writers

Docs changes needed

Add section in Docs/Backend-Development.md § Database Transactions

Doc references

Docs/Backend-Development.md (database design)

[IMPORTANT] Scheduler lock race condition

Where found

backend/app/utils/scheduler_lock.py:56-58 — heartbeat interval 10 seconds

Why this is needed

Current design: Process A acquires lock, heartbeat misses, lock expires, Process B acquires lock, both running simultaneously → duplicate work, data corruption.

Goal

Implement robust distributed locking that prevents concurrent execution.

What to do

Option A (Strengthen heartbeat):

Reduce interval to 5s (half of timeout)
Use database advisory locks
Monitor heartbeat failures

Option B (Migrate to Redis):

Use redlock-py or aioredis
Simpler, more reliable than database-backed

Current code improvements:

Log when heartbeat fails
Add metric for lock contention
Test multi-process scenario

Possible traps and issues

Database locks don't scale under high contention
Redis adds new dependency
Clock skew breaks timestamp-based expiry

Docs changes needed

Update Docs/Deployment.md § Scheduler Lock
Add troubleshooting: "Blocklist import runs twice"

Doc references

Docs/Deployment.md (scheduler)
backend/app/utils/scheduler_lock.py (lock implementation)

[IMPORTANT] API pagination doesn't return metadata

Where found

backend/app/routers/history.py — returns bare list, no pagination metadata
All paginated routers have same issue

Why this is needed

Frontend receives bare list, cannot determine: total results, whether more pages exist, last page number. Must guess or re-query.

Goal

Return pagination metadata with every paginated response.

What to do

Create response wrapper:

class PaginatedResponse(BaseModel):
    data: list[Item]
    pagination: PaginationMetadata

Update all paginated routers to return this wrapper
Update frontend to use metadata for UI

Possible traps and issues

SELECT COUNT(*) is slow on large tables
Response shape change — old frontend may not handle

Docs changes needed

Update API documentation § Pagination

Doc references

backend/app/utils/pagination.py

[IMPORTANT] Error response schema inconsistent

Where found

Different handlers return different response shapes
Fail2Ban errors: { "error_code": "...", "detail": "..." }
Validation errors: { "detail": [...] }
Not found errors: { "detail": "...", "error_code": "..." }

Why this is needed

Frontend must normalize multiple shapes, making error handling fragile and error-prone.

Goal

Unify all error responses to single schema.

What to do

Define canonical error response:

class ErrorResponse(BaseModel):
    error_code: str
    message: str
    status: int
    details: dict | None = None

Update all handlers to return this format
Update frontend to expect unified schema

Possible traps and issues

Backward compatibility with old clients
FastAPI's built-in handlers may override custom
Rich detail structures need accommodation

Docs changes needed

Update API documentation with unified error schema
Add error code reference table

Doc references

Docs/API.md (error codes)
backend/app/main.py (exception handlers)

[IMPORTANT] Provider ordering fragility (Frontend)

Where found

frontend/src/App.tsx — 10-level deep provider nesting
frontend/src/providers/PROVIDER_ORDER.md — documents order, no compile-time enforcement

Why this is needed

Provider order (ThemeProvider → AppContents → FluentProvider → ...) enforced only at runtime. Accidental reorder caught only after deploy.

Goal

Add compile-time validation of provider ordering.

What to do

Create provider composition utility enforcing order
Use TypeScript discriminated unions
Add ESLint rule to check provider wrapping

Possible traps and issues

TypeScript doesn't easily enforce ordering
May be overkill — improve runtime error messages instead

Docs changes needed

Update Docs/Architekture.md § 3.2 (Providers)

Doc references

Docs/Architekture.md § 3.2 (Providers)
frontend/src/providers/PROVIDER_ORDER.md

[IMPORTANT] Promise cancellation not checked in .then()/.catch() chains

Where found

frontend/src/components/blocklist/BlocklistSourcesSection.tsx:84-88
frontend/src/components/blocklist/BlocklistScheduleSection.tsx:49-58
Multiple components use this pattern

Why this is needed

When user navigates away, .then() chains don't check if cancelled. State updated on unmounted component → React warnings, memory leak, notification shows wrong context.

Goal

Check for cancellation in all .then()/.catch() chains.

What to do

Replace .then()/.catch() with async/await and cancellation check
Or use wrapper hook to hide logic

Possible traps and issues

Checking signal.aborted after await introduces race conditions
Better: let AbortError propagate, catch it in catch block

Docs changes needed

Update Docs/Web-Development.md § Async Patterns

Doc references

Docs/Web-Development.md (async patterns)

[MEDIUM] Inefficient database pagination uses OFFSET

Where found

backend/app/utils/pagination.py — uses OFFSET (page-1) * page_size

Why this is needed

OFFSET scans and discards N rows to fetch N+limit. Last page on 10M row table: 15 seconds ⚠️

Goal

Implement keyset pagination (cursor-based) for large result sets.

What to do

Short-term: Add database indexes on sort columns
Long-term: Implement cursor-based pagination using WHERE instead of OFFSET
Frontend sends cursor (last row ID) instead of page number

Possible traps and issues

Cursor must be deterministic
API contract changes
Cursor format must be opaque to client

Docs changes needed

Update Docs/Backend-Development.md § Database Performance

Doc references

Docs/Backend-Development.md (database performance)

[MEDIUM] Session secret rotation not implemented

Where found

backend/app/config.py — single session_secret with no rotation support

Why this is needed

If secret leaks, all sessions compromised. No way to invalidate old sessions.

Goal

Support gradual secret rotation without forcing logout.

What to do

Store multiple secrets: current and previous
Accept tokens signed with either key
Re-sign tokens with current secret on validation

Possible traps and issues

Rotation strategy must be documented
Metrics needed to track secret usage

Docs changes needed

Update Docs/Backend-Development.md § Session Management

Doc references

Docs/Backend-Development.md

[MEDIUM] No CORS configuration

Where found

backend/app/main.py — no CORS middleware added

Why this is needed

If frontend on different origin, cross-origin requests blocked without CORS configuration.

Goal

Add CORS middleware with proper origin whitelisting.

What to do

Add CORS middleware with specific origin whitelist
Make configurable via environment variable
Default to localhost for development

Possible traps and issues

allow_origins=["*"] defeats CORS security
Credentials require specific origins, not wildcard
Missing config silently fails in browser

Docs changes needed

Update Docs/Deployment.md § CORS Configuration

Doc references

Docs/Deployment.md

[MEDIUM] Input validation missing for regex patterns (ReDoS)

Where found

backend/app/routers/config.py — regex validation accepts arbitrary patterns without timeout

Why this is needed

Malicious regex causes catastrophic backtracking (ReDoS). Attacker sends pattern → compilation hangs → DoS.

Goal

Add timeout and complexity limits to regex validation.

What to do

Add timeout to regex compilation (2 seconds recommended)
Add length limit (reject patterns > 1000 characters)
Use signal.alarm() (Unix) or timeout library

Possible traps and issues

signal.alarm() Unix-only
Some valid complex regexes may timeout
Frontend should also validate (defense in depth)

Docs changes needed

Update API docs to document regex validation limits

Doc references

backend/app/routers/config.py

[MEDIUM] No structured logging to external system

Where found

Logs only go to stdout/file, no external aggregation

Why this is needed

Can't search across instances, historical logs lost on instance recycle.

Goal

Ship logs to centralized logging platform.

What to do

Short-term: Ensure structlog JSON output is valid (already done)
Long-term: Ship to logging platform (ELK, Datadog, Papertrail)

Possible traps and issues

External logging adds latency
Sensitive data must not be logged
Log volume can be massive

Docs changes needed

Add Docs/Observability.md section on logging

Doc references

Docs/Observability.md (new)

[MEDIUM] No Application Performance Monitoring (APM)

Where found

Backend: no metrics collection, latency tracking
Frontend: no error tracking, performance metrics
No observability into request performance

Why this is needed

Without metrics, blind in production: API slow? Unknown. Which endpoints fail most? Unknown.

Goal

Add comprehensive metrics collection and monitoring.

What to do

Backend metrics:
- Add Prometheus metrics: request count, latency, active requests
- Expose /metrics endpoint
Frontend metrics:
- Page load time, FCP, LCP using web-vitals
- API error rates and latencies
Aggregation:
- Prometheus + Grafana, or Datadog/NewRelic

Possible traps and issues

Metrics collection has performance cost
Cardinality explosion with tags
PII in metrics

Docs changes needed

Add Docs/Observability.md

Doc references

Docs/Observability.md (new)

[LOW] Frontend charts not memoized

Where found

frontend/src/components/TopCountriesPieChart.tsx
frontend/src/components/TopCountriesBarChart.tsx

Why this is needed

Charts re-render on every parent update, Recharts reprocesses 5000+ points.

Goal

Memoize chart components.

What to do

Wrap with React.memo with custom comparison
Ensure data objects are stable

Possible traps and issues

Shallow comparison might not be enough
Memoization has memory cost

Docs changes needed

No documentation changes

Doc references

frontend/src/components/TopCountriesChart.tsx

[LOW] No request deduplication on frontend

Where found

frontend/src/hooks/useFetchData.ts — each call launches new request
User clicks "Refresh" twice → two identical requests

Why this is needed

Duplicates waste bandwidth, cause race conditions (response 2 arrives first, then response 1 overwrites with stale data).

Goal

Deduplicate identical in-flight requests.

What to do

Implement request cache
Clear cache entry when response received
Use in useFetchData

Possible traps and issues

Cache must be cleared on data mutation
Stale data in cache possible if not careful

Docs changes needed

No documentation changes

Doc references

frontend/src/hooks/useFetchData.ts

20 KiB Raw Blame History

[CRITICAL] Docker containers lack resource limits

[CRITICAL] Global rate limiting missing

[CRITICAL] Missing security headers (CSP, X-Frame-Options, etc.)

[CRITICAL] Background tasks lack timeout protection

[CRITICAL] Background tasks not idempotent

[CRITICAL] Health check endpoint returns wrong status code

[IMPORTANT] Database transactions lack explicit isolation

[IMPORTANT] Scheduler lock race condition

[IMPORTANT] API pagination doesn't return metadata

[IMPORTANT] Error response schema inconsistent

[IMPORTANT] Provider ordering fragility (Frontend)

[IMPORTANT] Promise cancellation not checked in .then()/.catch() chains

[MEDIUM] Inefficient database pagination uses OFFSET

[MEDIUM] Session secret rotation not implemented

[MEDIUM] No CORS configuration

[MEDIUM] Input validation missing for regex patterns (ReDoS)

[MEDIUM] No structured logging to external system

[MEDIUM] No Application Performance Monitoring (APM)

[LOW] Frontend charts not memoized

[LOW] No request deduplication on frontend

20 KiB

Raw Blame History