Refactor rate limiting with exponential backoff strategy

- Update rate limiter to use exponential backoff instead of fixed limit - Implement progressive delays for failed login attempts (0.5s, 1s, 2s, 4s, 5s max) - Update auth router documentation and endpoint docs - Refactor test suite to match new rate limiting behavior - Update backend development documentation - Clean up unused tasks documentation Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-04-30 19:58:09 +02:00
parent 2db635ae19
commit 277f2a467c
6 changed files with 165 additions and 208 deletions
--- a/backend/app/utils/rate_limiter.py
+++ b/backend/app/utils/rate_limiter.py
@@ -1,26 +1,39 @@
 """In-memory rate limiter for IP-based request throttling.

-Tracks login attempts per IP address and enforces a configurable limit.
-Uses a dictionary of deques (per IP) storing timestamps of recent attempts.
+Implements exponential backoff for failed login attempts using failure tracking.
+Each wrong password attempt increments the failure count for that IP, and subsequent
+attempts are blocked for a duration that grows exponentially up to a maximum.
+
+Uses a dictionary of deques (per IP) storing timestamps of recent failures.
 Old entries are cleaned up by a background task to prevent unbounded growth.

 Process-local implementation — in multi-worker setups, each worker has
 independent counters. This constraint limits the blast radius of brute-force
 attacks to a single worker.

-The penalty strategy for failed login attempts is also managed here:
-record_failure() records a failure timestamp and returns the penalty delay
-to apply, enabling progressive back-off without exhausting request capacity.
+**How It Works:**

-Operational Notes
-----------------
+1. A successful login resets the failure counter for that IP.
+2. Each failed login (wrong password) calls record_failure() and increments the counter.
+3. is_allowed() checks if enough time has passed since the last failure based on
+   the current failure count. The delay grows exponentially with each consecutive failure:

-**Cleanup Lifecycle**: The rate limiter state (_attempts, _failures, _lock_counts)
-grows as IPs interact with the system. To prevent unbounded memory growth during
-long runtimes, a scheduled background task (rate_limiter_cleanup) calls the
-cleanup_expired() method every 30 minutes. This is safe because:
+   - 1st failure: 0.5 second penalty
+   - 2nd failure: 1 second penalty (0.5 * 2^1)
+   - 3rd failure: 2 seconds penalty (0.5 * 2^2)
+   - 4th failure: 4 seconds penalty (0.5 * 2^3)
+   - ... up to the configured maximum (default 5 seconds)

- cleanup_expired() only removes IPs with no recent attempts (all timestamps
+4. Penalties are cumulative within the window: if an attacker makes 5 failed
+   attempts, they must wait the full 5 seconds before trying again (not 5 seconds
+   per attempt).
+
+**Cleanup Lifecycle**: The rate limiter state (_failures) grows as IPs interact
+with the system. To prevent unbounded memory growth during long runtimes, a
+scheduled background task (rate_limiter_cleanup) calls cleanup_expired() every
+30 minutes. This is safe because:
+
+- cleanup_expired() only removes IPs with no recent failures (all timestamps
  outside the rate-limit window), so active IPs are never disrupted.
 - The cleanup is non-blocking and logged for observability.
 - Individual requests already prune old timestamps from each IP's deque during
@@ -70,48 +83,57 @@ class RateLimiter:

        Args:
            max_attempts: Maximum attempts allowed within the window.
+                (Deprecated: now only used for cleanup window size)
            window_seconds: Time window (seconds) for rate limit.
        """
        self.max_attempts: int = max_attempts
        self.window_seconds: int = window_seconds
-        self._attempts: dict[str, deque[float]] = {}
        self._failures: dict[str, deque[float]] = {}
-        self._lock_counts: dict[str, int] = {}

    def is_allowed(self, ip_address: str) -> bool:
        """Check if a request from *ip_address* is allowed.

-        If allowed, the current timestamp is recorded. Old entries (outside
-        the window) are removed before checking.
+        Checks if the IP has accumulated failures that would currently block
+        the attempt due to penalty backoff. Does NOT record a new attempt —
+        that happens only on successful password verification.

        Args:
            ip_address: The client IP address to rate-limit.

        Returns:
-            ``True`` if the request is allowed, ``False`` if the limit is exceeded.
+            ``True`` if the request is allowed (past penalty period), ``False``
+            if currently blocked by exponential backoff.
        """
        now = time()
+
+        if ip_address not in self._failures:
+            self._failures[ip_address] = deque()
+
+        failures = self._failures[ip_address]
        cutoff = now - self.window_seconds

-        if ip_address not in self._attempts:
-            self._attempts[ip_address] = deque()
+        # Remove old failures outside the window
+        while failures and failures[0] < cutoff:
+            failures.popleft()

-        attempts = self._attempts[ip_address]
+        # If no recent failures, request is allowed
+        if not failures:
+            return True

-        # Remove old attempts outside the window
-        while attempts and attempts[0] < cutoff:
-            attempts.popleft()
+        # Calculate accumulated penalty: how much time must pass before
+        # the next attempt is allowed, based on failure count
+        failure_count = len(failures)
+        penalty = min(
+            LOGIN_PENALTY_BASE_SECONDS * (LOGIN_PENALTY_MULTIPLIER ** failure_count),
+            LOGIN_PENALTY_MAX_SECONDS,
+        )

-        # Check if the limit is exceeded
-        if len(attempts) >= self.max_attempts:
-            return False
-
-        # Record this attempt
-        attempts.append(now)
-        return True
+        # Check if enough time has passed since the last failure
+        time_since_last_failure = now - failures[-1]
+        return time_since_last_failure >= penalty

    def cleanup_expired(self) -> None:
-        """Remove all IPs with no recent attempts (cleanup task).
+        """Remove all IPs with no recent failures (cleanup task).

        Called periodically by the background task to prevent unbounded
        growth of the tracking dictionary.
@@ -120,119 +142,67 @@ class RateLimiter:
        cutoff = now - self.window_seconds

        ips_to_remove = []
-        for ip_address, attempts in self._attempts.items():
-            # Remove old attempts
-            while attempts and attempts[0] < cutoff:
-                attempts.popleft()
-            # Mark IP for removal if no attempts remain
-            if not attempts:
+        for ip_address, failures in self._failures.items():
+            # Remove old failures
+            while failures and failures[0] < cutoff:
+                failures.popleft()
+            # Mark IP for removal if no failures remain
+            if not failures:
                ips_to_remove.append(ip_address)

        for ip_address in ips_to_remove:
-            del self._attempts[ip_address]
+            del self._failures[ip_address]

        if ips_to_remove:
            log.debug("rate_limiter_cleanup", removed_ips=len(ips_to_remove))

    def get_state(self) -> Mapping[str, int]:
-        """Return a read-only view of current attempt counts per IP.
+        """Return a read-only view of current failure counts per IP.

        For debugging and monitoring.

        Returns:
-            A mapping of IP addresses to their attempt counts.
+            A mapping of IP addresses to their failure counts.
        """
        now = time()
        cutoff = now - self.window_seconds
        result = {}
-        for ip_address, attempts in self._attempts.items():
-            # Count non-expired attempts
-            count = sum(1 for ts in attempts if ts >= cutoff)
+        for ip_address, failures in self._failures.items():
+            # Count non-expired failures
+            count = sum(1 for ts in failures if ts >= cutoff)
            if count > 0:
                result[ip_address] = count
        return result

    def reset(self) -> None:
-        """Clear all tracked attempts (for testing)."""
-        self._attempts.clear()
+        """Clear all tracked failures (for testing)."""
        self._failures.clear()
-        self._lock_counts.clear()

    # ---------------------------------------------------------------------------
    # Penalty strategy for failed login attempts
    # ---------------------------------------------------------------------------

-    def record_failure(self, ip_address: str) -> float:
-        """Record a failed login attempt and return the penalty delay in seconds.
+    def record_failure(self, ip_address: str) -> None:
+        """Record a failed login attempt.

-        Tracks consecutive failures per IP. Penalty grows exponentially with
-        each failure, bounded by :data:`~app.utils.constants.LOGIN_PENALTY_MAX_SECONDS`,
-        then resets the failure counter. This provides brute-force resistance
-        without exhausting request capacity.
-
-        A concurrency guard (``_lock_counts``) prevents a single IP from
-        accumulating many concurrent penalty tasks.
+        Tracks failures per IP to enable exponential backoff in is_allowed().
+        The penalty delay is automatically calculated in is_allowed() based on
+        the failure count, providing transparent brute-force resistance.

        Args:
            ip_address: The client IP address whose login attempt failed.
-
-        Returns:
-            The penalty delay in seconds to apply.
        """
        now = time()

        if ip_address not in self._failures:
            self._failures[ip_address] = deque()
-        if ip_address not in self._lock_counts:
-            self._lock_counts[ip_address] = 0

        failures = self._failures[ip_address]
-        lock_count = self._lock_counts[ip_address]
-
-        # Reset if last failure is outside the window
        cutoff = now - self.window_seconds
+
+        # Remove old failures outside the window
        while failures and failures[0] < cutoff:
            failures.popleft()

-        consecutive = len(failures)
-        penalty = min(
-            LOGIN_PENALTY_BASE_SECONDS * (LOGIN_PENALTY_MULTIPLIER ** consecutive),
-            LOGIN_PENALTY_MAX_SECONDS,
-        )
-
+        # Record this failure
        failures.append(now)
-
-        # Concurrency protection: if too many concurrent sleeps are already
-        # running for this IP, cap the penalty to avoid thread exhaustion.
-        if lock_count >= 3:
-            penalty = min(penalty, LOGIN_PENALTY_BASE_SECONDS)
-
-        return penalty
-
-    def acquire(self, ip_address: str) -> bool:
-        """Acquire a concurrency slot for a penalty task.
-
-        Args:
-            ip_address: The client IP address.
-
-        Returns:
-            ``True`` if the slot was acquired, ``False`` if the IP already has
-            the maximum number of concurrent penalty tasks running.
-        """
-        if ip_address not in self._lock_counts:
-            self._lock_counts[ip_address] = 0
-
-        if self._lock_counts[ip_address] >= 3:
-            return False
-
-        self._lock_counts[ip_address] += 1
-        return True
-
-    def release(self, ip_address: str) -> None:
-        """Release a concurrency slot when a penalty task completes.
-
-        Args:
-            ip_address: The client IP address.
-        """
-        if ip_address in self._lock_counts and self._lock_counts[ip_address] > 0:
-            self._lock_counts[ip_address] -= 1