Files
BanGUI/Docs/Tasks.md
Lukas e2560f5db0 TASK-032: Implement geo_cache retention policy and cleanup
Add automatic cleanup of stale geolocation cache entries to prevent
unbounded database growth. Resolves the issue where unique IP addresses
accumulated indefinitely in the geo_cache table, degrading query performance.

## Changes

### Database Schema (Migration 3)
- Add 'last_seen' column to geo_cache table tracking last reference time
- Existing entries default to current timestamp

### Repository Layer (geo_cache_repo.py)
- Update upsert_entry() to set/refresh last_seen on insert/update
- Update upsert_neg_entry() to set/refresh last_seen on negative cache hits
- Update bulk_upsert_entries() to set/refresh last_seen in batch operations
- Add delete_stale_entries(db, cutoff_iso) -> int for purging old entries

### Background Task (geo_cache_cleanup.py)
- New APScheduler task that runs nightly (24-hour interval)
- Calculates cutoff as 90 days ago from current time (UTC)
- Deletes all entries with last_seen older than cutoff
- Logs operation results (info when deleted > 0, debug when 0 deleted)
- Configurable retention period via GEO_CACHE_RETENTION_DAYS constant

### Application Startup (startup.py)
- Register geo_cache_cleanup task in scheduler during app startup
- Placed after geo_cache_flush in task registration order

### Tests
- Add delete_stale_entries test cases covering:
  * Removal of old entries beyond cutoff
  * No deletion when all entries are recent
  * Empty table edge case
- Update existing test fixtures to include last_seen column
- Add full test suite for cleanup task registration and execution

### Documentation
- Architekture.md: Document cleanup task, update schema/diagram
- Backend-Development.md: Add retention policy documentation

## Behavior

When an IP is accessed, its last_seen is refreshed. After 90 days of no
access, an IP is purged by the nightly cleanup. On next encounter, the IP
is re-resolved from MaxMind MMDB or ip-api.com (if configured).

This is acceptable because:
1. Stale geolocation data may become inaccurate over time
2. Re-resolution cost is minimal compared to unbounded storage growth
3. Active IPs maintain fresh data through their last_seen updates

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-04-26 19:24:34 +02:00

66 lines
4.3 KiB
Markdown

## TASK-032 — `geo_cache` table grows unboundedly — no eviction or purge
**Severity:** Medium
### Where found
`backend/app/repositories/geo_cache_repo.py` — has `upsert_entry`, `bulk_upsert_entries`, `upsert_neg_entry` — but **no DELETE functions**. `backend/app/db.py``geo_cache` table has no `last_seen` or `created_at` column.
### Why this is needed
Every unique IP address ever seen by fail2ban gets a row in `geo_cache`. The table is never trimmed. A BanGUI instance monitoring a busy server can accumulate millions of rows over months, increasing the DB file size and degrading query performance on every geo lookup.
### Goal
Implement a retention policy that prunes geo cache entries not referenced recently.
### What to do
1. Add a migration (`_MIGRATIONS[2]`) that adds a `last_seen TEXT NOT NULL DEFAULT CURRENT_TIMESTAMP` column to `geo_cache`.
2. Update `upsert_entry` and `bulk_upsert_entries` to set `last_seen = CURRENT_TIMESTAMP` on every upsert.
3. Add `delete_stale_entries(db: aiosqlite.Connection, cutoff_iso: str) -> int` to `geo_cache_repo.py`.
4. Create `backend/app/tasks/geo_cache_cleanup.py` — a nightly task that calls `delete_stale_entries` with a 90-day cutoff.
5. Register the task in `startup_shared_resources`.
### Possible traps and issues
- Adding a column requires a migration. Coordinate with TASK-023 (migration atomicity) and TASK-022 (session hash migration) — all three migrations must be sequenced correctly as `_MIGRATIONS[2]`, `[3]`, etc.
- IPs that have not been seen in 90 days will lose their geo data — on their next appearance they will be re-resolved from ip-api.com or the MMDB. This is acceptable.
### Docs changes needed
- `Architekture.md` — update the `geo_cache` table description and add the cleanup task.
- `Backend-Development.md` — document the geo cache retention policy.
### Doc references
- [Architekture.md](Architekture.md) — application database schema
- [Backend-Development.md](Backend-Development.md) — background tasks
---
## TASK-033 — Session token returned in JSON body alongside HttpOnly cookie
**Severity:** Medium
### Where found
`backend/app/routers/auth.py``login()` returns `LoginResponse(token=signed_token, expires_at=expires_at)` in the JSON body **and** sets the HttpOnly cookie. `backend/app/models/auth.py``LoginResponse.token` field.
### Why this is needed
The `LoginResponse` JSON body contains the full signed session token. JavaScript running on the page (including third-party analytics scripts or a future XSS injection) can read the response body from a `fetch()` call and store the token in `localStorage` or a non-HttpOnly cookie. The Bearer-header authentication path (`Authorization: Bearer <token>`) then allows using that extracted token, completely bypassing the protections provided by the HttpOnly cookie.
### Goal
Prevent the session token from being accessible to JavaScript when using cookie-based authentication.
### What to do
1. For browser SPA consumers: Remove the `token` field from `LoginResponse`. The HttpOnly cookie is the only token the browser needs.
2. If an API-first (non-browser) token flow is required, create a separate endpoint `POST /api/auth/token` that returns a token in the body and does **not** set a cookie. Document this endpoint as "for programmatic API clients only, not for browser use".
3. Update the frontend — verify that `AuthProvider` does not use `response.token` (confirmed: it currently does not).
### Possible traps and issues
- Any existing API client that relies on the token in the `LoginResponse` body will break. Check tests.
- The `expires_at` field in `LoginResponse` is useful for the frontend to know when to prompt for re-login — this can remain.
- The Bearer-token path in `require_auth` (`Authorization: Bearer`) remains functional for programmatic clients using the dedicated token endpoint.
### Docs changes needed
- `Features.md` — document the authentication flow (cookie for browser, token endpoint for API clients).
- `Backend-Development.md` — authentication endpoint design.
- `Web-Development.md` — document that the frontend uses only the HttpOnly cookie.
### Doc references
- [Features.md](Features.md) — authentication
- [Backend-Development.md](Backend-Development.md) — auth router design
- [Web-Development.md](Web-Development.md) — AuthProvider