Refactor rate limiting with exponential backoff strategy

- Update rate limiter to use exponential backoff instead of fixed limit - Implement progressive delays for failed login attempts (0.5s, 1s, 2s, 4s, 5s max) - Update auth router documentation and endpoint docs - Refactor test suite to match new rate limiting behavior - Update backend development documentation - Clean up unused tasks documentation Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-04-30 19:58:09 +02:00
parent 2db635ae19
commit 277f2a467c
6 changed files with 165 additions and 208 deletions
--- a/Docs/Architekture.md
+++ b/Docs/Architekture.md
@@ -1170,6 +1170,7 @@ The `setup_completed = "1"` key is still written for backward compatibility with
 - **GeoCache** — `GeoCache` instance is created at startup with a configurable `allow_http_fallback` flag and stored on `app.state.geo_cache`. It implements a primary + fallback resolution strategy: (1) try local MaxMind GeoLite2-Country MMDB database (primary, encrypted, no network traffic), (2) if unavailable/no result and allowed, fall back to ip-api.com HTTP API (unencrypted, disabled by default for security). Encapsulates in-memory lookup cache, negative cache for unresolvable IPs (5-minute TTL), dirty set for persistence, and thread-safe async locking. Cache is loaded from the `geo_cache` SQLite table on startup. New resolutions are accumulated in memory and periodically flushed to the database by the `geo_cache_flush` background task. Stale entries are re-resolved by the `geo_re_resolve` task. Injected into routes and tasks via FastAPI's dependency system. See Backend-Development.md § IP Geolocation Resolution for setup and security details.
 - **Runtime state** (`RuntimeState` in `app.utils.runtime_state`) — stores mutable application state: `server_status` (fail2ban online/offline), `last_activation` (jail activation tracking), `pending_recovery` (crash detection), `runtime_settings` (effective configuration), and service-specific state holders like `jail_service_state` (`JailServiceState` for jail capability detection cache). RuntimeState fields are managed through dedicated functions (e.g., `record_activation()`, `clear_pending_recovery()`) and via dependency injection to services. Service-specific state (like `JailServiceState`) is nested within `RuntimeState` to keep all mutable state in one controlled location. **⚠️  RuntimeState is process-local and only safe when BanGUI runs as a single asyncio worker.** Mutations must not span `await` points (cooperative scheduling within a single event loop is safe). In multi-worker deployments, each process has its own copy — logouts from worker A don't affect worker B's cache, health status updates are per-worker, and activation tracking is unreliable. BanGUI enforces single-worker mode (TASK-002) to prevent this issue. For future multi-worker support, replace RuntimeState with a shared coordination backend (Redis, shared memory, database). See `app/utils/runtime_state.py` module docstring for details.
 - **Setup-completion flag** — once `is_setup_complete()` returns `True`, the result is stored in `app.state._setup_complete_cached`. The `SetupRedirectMiddleware` skips the DB query on all subsequent requests, removing 1 SQL query per request for the common post-setup case. The completion flag is only written after the runtime database is successfully initialized and all initial setup settings are persisted, preventing a failed setup from permanently bypassing the setup wizard.
+- **Login Rate Limiting** — the `/api/auth/login` endpoint employs exponential backoff to defend against brute-force attacks. Each failed login attempt is recorded per client IP, and subsequent attempts within the backoff window return HTTP 429 Too Many Requests. The penalty grows exponentially with each consecutive failure (2s, 4s, 8s, up to 10s max), ensuring attackers face rapidly increasing delays. This is complemented by bcrypt password hashing (≈100ms per attempt), which adds computational resistance without blocking legitimate users. The backoff counter resets after 60 seconds without additional failures. The rate limiter is process-local and tracks failures in memory via `app.utils.rate_limiter.RateLimiter`, stored on `app.state.login_rate_limiter`. Client IP detection respects proxy headers (`X-Forwarded-For`, `X-Real-IP`) only from configured trusted proxies, preventing header spoofing attacks. In multi-worker deployments, each worker has independent rate limit counters; BanGUI enforces single-worker mode (TASK-002) to prevent attackers from bypassing limits by distributing requests across workers.

 ### 8.1 CSRF Protection

--- a/Docs/Backend-Development.md
+++ b/Docs/Backend-Development.md
@@ -2169,17 +2169,22 @@ await config_service.update_global_config(socket_path, update)  # Validates agai

 ### Login Rate Limiting

-The login endpoint (`POST /api/auth/login`) is protected against brute-force attacks using an in-memory rate limiter.
+The login endpoint (`POST /api/auth/login`) is protected against brute-force attacks using an in-memory exponential backoff rate limiter.

 **Design:**
- Uses a `dict[str, deque[float]]` keyed by client IP, storing login attempt timestamps within a time window.
- Attempts outside the window are automatically removed during validation checks.
+- Uses a `dict[str, deque[float]]` keyed by client IP, storing failed login timestamps within a time window.
+- Old failures outside the time window are automatically pruned during validation checks.
 - Expired IP entries are cleaned up to prevent unbounded memory growth.

 **Rate Limit Rules:**
- **5 attempts per 60 seconds** per IP address.
- Requests exceeding the limit return **HTTP 429 Too Many Requests** with a `Retry-After` header.
- Each failed login triggers a progressive server-side delay (exponential back-off, 1–10 seconds) to further slow attacks, on top of bcrypt hashing (~100ms). The penalty grows with consecutive failures and resets after the rate-limit window expires. Concurrency protection caps the delay when multiple penalty tasks are already running for the same IP.
+- **Exponential backoff:** Each failed login attempt incurs a progressively longer delay before the next attempt is allowed:
+  - 1st failure: 1 × 2¹ = 2 seconds
+  - 2nd failure: 1 × 2² = 4 seconds
+  - 3rd failure: 1 × 2³ = 8 seconds
+  - 4th+ failures: capped at 10 seconds (max)
+- Failed attempts that arrive during the backoff period return **HTTP 429 Too Many Requests** with a `Retry-After` header indicating the remaining wait time.
+- Each failed login is also accompanied by bcrypt password hashing (~100ms), providing additional computational resistance.
+- The backoff counter resets after the rate-limit window (60 seconds by default) expires with no new failures.

 **IP Extraction (Proxy Safety):**
 - When behind nginx, the rate limiter reads the real client IP from `X-Forwarded-For` or `X-Real-IP` headers.
--- a/Docs/Tasks.md
+++ b/Docs/Tasks.md
@@ -1,42 +1,3 @@
-## [Backend] Exception handler overlap — broad handlers catching everything
-
-**Where found**
-
- `backend/app/main.py:182-200` — `_get_error_code()` accepts any `Exception` and falls back to snake_case conversion
- Multiple handlers (lines 329-466) accept `Exception` as parameter
-
-**Why this is needed**
-
-Broad exception handlers create fragility: adding a new `DomainError` subclass without explicitly registering a handler silently falls through, producing generic error codes instead of specific ones. The fallback chain is not explicitly documented.
-
-**Goal**
-
-Make the exception handler registration explicit and documented. Every exception type that can bubble up should have a clear path to a handler.
-
-**What to do**
-
-1. Audit all exception handlers and confirm they are registered with the most specific base class
-2. Add a comment block documenting the fallback chain
-3. Ensure every custom `DomainError` subclass has `error_code` and `get_error_metadata()` implemented
-4. Add a catch-all `Exception` handler as the absolute last resort
-
-**Possible traps and issues**
-
- If a new `DomainError` subclass is added without handler registration, it silently returns wrong status code
- `ValueError` handler may catch Pydantic `ValidationError` subclasses
-
-**Docs changes needed**
-
- Update `Docs/Architekture.md` § 2.2 (Application Entry Point) — document exception handler hierarchy
- Add section in `Docs/Backend-Development.md` on exception taxonomy
-
-**Doc references**
-
- `Docs/Architekture.md` § 2.2 (Application Entry Point)
- `Docs/Backend-Development.md` (exception conventions)
-
---
-
 ## [Backend] Login rate limiter — penalty sleep does not block the request

 **Where found**