TASK-032: Implement geo_cache retention policy and cleanup

Add automatic cleanup of stale geolocation cache entries to prevent
unbounded database growth. Resolves the issue where unique IP addresses
accumulated indefinitely in the geo_cache table, degrading query performance.

## Changes

### Database Schema (Migration 3)
- Add 'last_seen' column to geo_cache table tracking last reference time
- Existing entries default to current timestamp

### Repository Layer (geo_cache_repo.py)
- Update upsert_entry() to set/refresh last_seen on insert/update
- Update upsert_neg_entry() to set/refresh last_seen on negative cache hits
- Update bulk_upsert_entries() to set/refresh last_seen in batch operations
- Add delete_stale_entries(db, cutoff_iso) -> int for purging old entries

### Background Task (geo_cache_cleanup.py)
- New APScheduler task that runs nightly (24-hour interval)
- Calculates cutoff as 90 days ago from current time (UTC)
- Deletes all entries with last_seen older than cutoff
- Logs operation results (info when deleted > 0, debug when 0 deleted)
- Configurable retention period via GEO_CACHE_RETENTION_DAYS constant

### Application Startup (startup.py)
- Register geo_cache_cleanup task in scheduler during app startup
- Placed after geo_cache_flush in task registration order

### Tests
- Add delete_stale_entries test cases covering:
  * Removal of old entries beyond cutoff
  * No deletion when all entries are recent
  * Empty table edge case
- Update existing test fixtures to include last_seen column
- Add full test suite for cleanup task registration and execution

### Documentation
- Architekture.md: Document cleanup task, update schema/diagram
- Backend-Development.md: Add retention policy documentation

## Behavior

When an IP is accessed, its last_seen is refreshed. After 90 days of no
access, an IP is purged by the nightly cleanup. On next encounter, the IP
is re-resolved from MaxMind MMDB or ip-api.com (if configured).

This is acceptable because:
1. Stale geolocation data may become inaccurate over time
2. Re-resolution cost is minimal compared to unbounded storage growth
3. Active IPs maintain fresh data through their last_seen updates

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
This commit is contained in:
2026-04-26 19:24:34 +02:00
parent 32aad186c3
commit e2560f5db0
9 changed files with 405 additions and 89 deletions

View File

@@ -133,7 +133,8 @@ backend/
│ │ ├── geo_cache_repo.py # IP geolocation cache persistence│ │ └── import_log_repo.py # Import run history records
│ ├── tasks/ # APScheduler background jobs
│ │ ├── blocklist_import.py# Scheduled blocklist download and application
│ │ ├── geo_cache_flush.py # Periodic geo cache persistence (dirty-set flush to SQLite)│ │ ├── geo_re_resolve.py # Periodic re-resolution of stale geo cache records│ │ └── health_check.py # Periodic fail2ban connectivity probe
│ │ ├── geo_cache_flush.py # Periodic geo cache persistence (dirty-set flush to SQLite)│ │ ├── geo_cache_cleanup.py # Periodic purge of stale geo cache entries
│ │ ├── geo_re_resolve.py # Periodic re-resolution of stale geo cache records│ │ └── health_check.py # Periodic fail2ban connectivity probe
│ └── utils/ # Helpers, constants, shared types
│ ├── fail2ban_client.py # Async wrapper around the fail2ban socket protocol
│ ├── fail2ban_response.py # Canonical response parsing: ok(), to_dict(), ensure_list(), is_not_found_error()
@@ -241,6 +242,7 @@ APScheduler background jobs that run on a schedule without user interaction.
| Task | Purpose |
|---|---|
| `blocklist_import.py` | Downloads all enabled blocklist sources, validates entries, applies bans, records results in the import log |
| `geo_cache_cleanup.py` | Periodically removes entries from the `geo_cache` table that have not been referenced in the configured retention period (default: 90 days). Prevents unbounded database growth. |
| `geo_cache_flush.py` | Periodically flushes newly resolved IPs from the in-memory dirty set to the `geo_cache` SQLite table (default: every 60 seconds). GET requests populate only the in-memory cache; this task persists them without blocking any request. |
| `geo_re_resolve.py` | Periodically re-resolves stale entries in `geo_cache` to keep geolocation data fresh |
| `health_check.py` | Periodically pings the fail2ban socket and updates the cached server status so the frontend always has fresh data |
@@ -649,7 +651,7 @@ BanGUI maintains its **own SQLite database** (separate from the fail2ban databas
|---|---|
| `settings` | Key-value store for application configuration (master password hash, fail2ban socket path, database path, timezone, session duration) |
| `sessions` | Active session token hashes with expiry timestamps. Tokens are stored as one-way SHA256 hashes to prevent token hijacking if the database is exposed. |
| `geo_cache` | Resolved IP geolocation results (ip, country_code, country_name, asn, org, cached_at). Loaded into memory at startup via `load_cache_from_db()`; new entries are flushed back by the `geo_cache_flush` background task. |
| `geo_cache` | Resolved IP geolocation results (ip, country_code, country_name, asn, org, cached_at, last_seen). Tracks the last time each IP address was referenced to enable retention policies. Entries older than 90 days are automatically purged by the `geo_cache_cleanup` task to prevent unbounded growth. Loaded into memory at startup via `load_cache_from_db()`; new entries are flushed back by the `geo_cache_flush` background task. |
| `blocklist_sources` | Registered blocklist URLs (id, name, url, enabled, created_at, updated_at) |
| `import_logs` | Record of every blocklist import run (id, source_id, timestamp, ips_imported, ips_skipped, errors, status) |
@@ -701,6 +703,7 @@ APScheduler 4.x (async mode) manages recurring background tasks.
│ (async, in-process) │
├──────────────────────┤
│ blocklist_import │ ── runs on configured schedule (default: daily 03:00)
│ geo_cache_cleanup │ ── runs every 24 hours (nightly)
│ geo_cache_flush │ ── runs every 60 seconds
│ health_check │ ── runs every 30 seconds
└──────────────────────┘