TASK-032: Implement geo_cache retention policy and cleanup

Add automatic cleanup of stale geolocation cache entries to prevent unbounded database growth. Resolves the issue where unique IP addresses accumulated indefinitely in the geo_cache table, degrading query performance. ## Changes ### Database Schema (Migration 3) - Add 'last_seen' column to geo_cache table tracking last reference time - Existing entries default to current timestamp ### Repository Layer (geo_cache_repo.py) - Update upsert_entry() to set/refresh last_seen on insert/update - Update upsert_neg_entry() to set/refresh last_seen on negative cache hits - Update bulk_upsert_entries() to set/refresh last_seen in batch operations - Add delete_stale_entries(db, cutoff_iso) -> int for purging old entries ### Background Task (geo_cache_cleanup.py) - New APScheduler task that runs nightly (24-hour interval) - Calculates cutoff as 90 days ago from current time (UTC) - Deletes all entries with last_seen older than cutoff - Logs operation results (info when deleted > 0, debug when 0 deleted) - Configurable retention period via GEO_CACHE_RETENTION_DAYS constant ### Application Startup (startup.py) - Register geo_cache_cleanup task in scheduler during app startup - Placed after geo_cache_flush in task registration order ### Tests - Add delete_stale_entries test cases covering: * Removal of old entries beyond cutoff * No deletion when all entries are recent * Empty table edge case - Update existing test fixtures to include last_seen column - Add full test suite for cleanup task registration and execution ### Documentation - Architekture.md: Document cleanup task, update schema/diagram - Backend-Development.md: Add retention policy documentation ## Behavior When an IP is accessed, its last_seen is refreshed. After 90 days of no access, an IP is purged by the nightly cleanup. On next encounter, the IP is re-resolved from MaxMind MMDB or ip-api.com (if configured). This is acceptable because: 1. Stale geolocation data may become inaccurate over time 2. Re-resolution cost is minimal compared to unbounded storage growth 3. Active IPs maintain fresh data through their last_seen updates Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-04-26 19:24:34 +02:00
parent 32aad186c3
commit e2560f5db0
9 changed files with 405 additions and 89 deletions
--- a/Docs/Architekture.md
+++ b/Docs/Architekture.md
@@ -133,7 +133,8 @@ backend/
 │  │   ├── geo_cache_repo.py  #   IP geolocation cache persistence│   │   └── import_log_repo.py #   Import run history records
 │   ├── tasks/                 # APScheduler background jobs
 │   │   ├── blocklist_import.py#   Scheduled blocklist download and application
-│   │   ├── geo_cache_flush.py #   Periodic geo cache persistence (dirty-set flush to SQLite)│  │   ├── geo_re_resolve.py  #   Periodic re-resolution of stale geo cache records│   │   └── health_check.py   #   Periodic fail2ban connectivity probe
+│   │   ├── geo_cache_flush.py #   Periodic geo cache persistence (dirty-set flush to SQLite)│  │   ├── geo_cache_cleanup.py #   Periodic purge of stale geo cache entries
+│   │   ├── geo_re_resolve.py  #   Periodic re-resolution of stale geo cache records│   │   └── health_check.py   #   Periodic fail2ban connectivity probe
 │   └── utils/                 # Helpers, constants, shared types
 │       ├── fail2ban_client.py #   Async wrapper around the fail2ban socket protocol
 │       ├── fail2ban_response.py #   Canonical response parsing: ok(), to_dict(), ensure_list(), is_not_found_error()
@@ -241,6 +242,7 @@ APScheduler background jobs that run on a schedule without user interaction.
 | Task | Purpose |
 |---|---|
 | `blocklist_import.py` | Downloads all enabled blocklist sources, validates entries, applies bans, records results in the import log |
+| `geo_cache_cleanup.py` | Periodically removes entries from the `geo_cache` table that have not been referenced in the configured retention period (default: 90 days). Prevents unbounded database growth. |
 | `geo_cache_flush.py` | Periodically flushes newly resolved IPs from the in-memory dirty set to the `geo_cache` SQLite table (default: every 60 seconds). GET requests populate only the in-memory cache; this task persists them without blocking any request. |
 | `geo_re_resolve.py` | Periodically re-resolves stale entries in `geo_cache` to keep geolocation data fresh |
 | `health_check.py` | Periodically pings the fail2ban socket and updates the cached server status so the frontend always has fresh data |
@@ -649,7 +651,7 @@ BanGUI maintains its **own SQLite database** (separate from the fail2ban databas
 |---|---|
 | `settings` | Key-value store for application configuration (master password hash, fail2ban socket path, database path, timezone, session duration) |
 | `sessions` | Active session token hashes with expiry timestamps. Tokens are stored as one-way SHA256 hashes to prevent token hijacking if the database is exposed. |
-| `geo_cache` | Resolved IP geolocation results (ip, country_code, country_name, asn, org, cached_at). Loaded into memory at startup via `load_cache_from_db()`; new entries are flushed back by the `geo_cache_flush` background task. |
+| `geo_cache` | Resolved IP geolocation results (ip, country_code, country_name, asn, org, cached_at, last_seen). Tracks the last time each IP address was referenced to enable retention policies. Entries older than 90 days are automatically purged by the `geo_cache_cleanup` task to prevent unbounded growth. Loaded into memory at startup via `load_cache_from_db()`; new entries are flushed back by the `geo_cache_flush` background task. |
 | `blocklist_sources` | Registered blocklist URLs (id, name, url, enabled, created_at, updated_at) |
 | `import_logs` | Record of every blocklist import run (id, source_id, timestamp, ips_imported, ips_skipped, errors, status) |

@@ -701,6 +703,7 @@ APScheduler 4.x (async mode) manages recurring background tasks.
 │  (async, in-process) │
 ├──────────────────────┤
 │  blocklist_import    │  ── runs on configured schedule (default: daily 03:00)
+│  geo_cache_cleanup   │  ── runs every 24 hours (nightly)
 │  geo_cache_flush     │  ── runs every 60 seconds
 │  health_check        │  ── runs every 30 seconds
 └──────────────────────┘
--- a/Docs/Backend-Development.md
+++ b/Docs/Backend-Development.md
@@ -979,6 +979,16 @@ BANGUI_GEOIP_ALLOW_HTTP_FALLBACK="true"                    # Enable HTTP fallbac
 - Failed lookups are cached for 5 minutes to avoid hammering external APIs.
 - The background `geo_cache_flush` task (runs every 60 seconds) persists newly resolved entries to the database.
 - The background `geo_re_resolve` task (configurable schedule) periodically re-resolves stale entries to keep data fresh.
+- The background `geo_cache_cleanup` task (runs nightly) removes entries not referenced in the configured retention period (default: 90 days) to prevent unbounded database growth and maintain query performance.
+
+**Retention & Cleanup:**
+
+The `geo_cache` table tracks the last time each IP was referenced via a `last_seen` timestamp. Over time, as unique IPs accumulate, the table can grow very large, degrading query performance on every geo lookup. To manage this:
+
+- The `geo_cache_cleanup` background task runs once per day (default: midnight UTC).
+- It removes all entries where `last_seen` is older than the configured retention period (default: 90 days).
+- If a purged IP is encountered again after cleanup, it will be re-resolved from the MaxMind database or ip-api.com (if configured).
+- The retention period is controlled by the constant `GEO_CACHE_RETENTION_DAYS` in `backend/app/tasks/geo_cache_cleanup.py`.

 ### API Documentation Configuration

--- a/Docs/Tasks.md
+++ b/Docs/Tasks.md
@@ -1,84 +1,3 @@
-## TASK-030 — ip-api.com geo lookups use plain HTTP — IP addresses sent unencrypted
-
-**Severity:** Medium
-
-### Where found
-`backend/app/services/geo_cache.py` lines ~41–46:
-```python
-_API_URL = "http://ip-api.com/json/{ip}?fields=..."
-_BATCH_API_URL = "http://ip-api.com/batch?fields=..."
-```
-
-### Why this is needed
-All banned and monitored IP addresses are transmitted to ip-api.com in cleartext over HTTP. These are potentially sensitive data (PII under GDPR/CCPA — IP addresses identify users). Any network path between the BanGUI server and ip-api.com's servers can observe or modify the traffic. Forged responses would corrupt the geo database silently.
-
-### Goal
-Use encrypted transport for all geo API calls, or switch to a local resolver.
-
-### What to do
-ip-api.com's free tier does not support HTTPS. The recommended approach:
-1. Promote the existing `geoip_db_path` setting (MaxMind GeoLite2-Country MMDB) to the **primary** resolver.
-2. Use ip-api.com as a secondary fallback only when the MMDB is unavailable or returns no result.
-3. Add documentation and compose file examples for downloading and mounting the GeoLite2 MMDB.
-4. If ip-api.com HTTP is retained as a fallback, add a config flag `BANGUI_GEOIP_ALLOW_HTTP_FALLBACK` (default `false`) and warn clearly at startup when enabled.
-
-### Possible traps and issues
- The MaxMind GeoLite2 database requires a free account and a license key to download — document the setup process.
- The GeoLite2-Country MMDB does not include ASN or organisation data — these fields will be `null` when using the local resolver. The `GeoInfo` model must handle nullable `asn` and `org`.
-
-### Docs changes needed
- `Features.md` — document the geo resolution mechanism and MMDB setup.
- `Architekture.md` — update the external API dependency section.
- `Backend-Development.md` — configuration for `geoip_db_path`.
-
-### Doc references
- [Features.md](Features.md) — geolocation
- [Architekture.md](Architekture.md) — external API dependencies
-
---
-
-## TASK-031 — bcrypt 72-byte truncation not enforced — long passwords silently equivalent to their prefix
-
-**Severity:** Medium
-
-### Where found
-`backend/app/models/auth.py` — `LoginRequest.password: str = Field(...)` (no `max_length`). `backend/app/models/setup.py` — `SetupRequest.master_password` has `min_length=8` but no `max_length`.
-
-### Why this is needed
-bcrypt silently truncates all input at 72 bytes before hashing. A user who sets a 100-character password can be authenticated by supplying only the first 72 characters. The extra characters provide no additional security. An attacker who has reduced the search space to 72 characters can brute-force the password more efficiently than the user intended.
-
-### Goal
-Enforce a maximum password length of 72 bytes, or pre-hash before bcrypt to remove the limit entirely.
-
-### What to do
-**Option A (simple):**
-1. Add `max_length=72` to `SetupRequest.master_password` and `LoginRequest.password`.
-2. Update the setup wizard UI to reflect the 72-character maximum.
-
-**Option B (removes the 72-byte limit entirely):**
-1. Pre-hash the password with HMAC-SHA256 using the `session_secret` as the key before passing to bcrypt:
-   ```python
-   pre_hashed = hmac.new(secret.encode(), password.encode(), hashlib.sha256).digest()
-   bcrypt.hashpw(pre_hashed, bcrypt.gensalt())
-   ```
-2. Apply consistently in both `run_setup()` and `_check_password()`.
-
-Option A is recommended as the simpler, lower-risk fix. Option B is architecturally cleaner but requires a stored hash migration.
-
-### Possible traps and issues
- Option A: Users who already have passwords longer than 72 characters will need to reset. For a single-admin app this is acceptable.
- Option B: If the `session_secret` changes, all stored password hashes become invalid (since the pre-hash key changes). This is a hidden coupling — document it explicitly.
-
-### Docs changes needed
- `Features.md` — document the password length constraint.
- `Backend-Development.md` — bcrypt usage notes.
-
-### Doc references
- [Features.md](Features.md) — authentication and setup
- [Backend-Development.md](Backend-Development.md) — password hashing
-
---
-
 ## TASK-032 — `geo_cache` table grows unboundedly — no eviction or purge

 **Severity:** Medium