Files
BanGUI/Docs/Tasks.md
Lukas d931e8c6a3 Reduce per-request DB overhead (Task 4)
- Cache setup_completed flag in app.state._setup_complete_cached after
  first successful is_setup_complete() call; all subsequent API requests
  skip the DB query entirely (one-way transition, cleared on restart).
- Add in-memory session token TTL cache (10 s) in require_auth; the second
  request with the same token within the window skips session_repo.get_session.
- Call invalidate_session_cache() on logout so revoked tokens are evicted
  immediately rather than waiting for TTL expiry.
- Add clear_session_cache() for test isolation.
- 5 new tests covering the cached fast-path for both optimisations.
- 460 tests pass, 83% coverage, zero ruff/mypy warnings.
2026-03-10 19:16:00 +01:00

228 lines
15 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# BanGUI — Task List
This document breaks the entire BanGUI project into development stages, ordered so that each stage builds on the previous one. Every task is described in prose with enough detail for a developer to begin work. References point to the relevant documentation.
---
## Issue: World Map Loading Time — Architecture Fix
### Problem Summary
The `GET /api/dashboard/bans/by-country` endpoint is extremely slow on first load. A single request with ~5,200 unique IPs produces **10,400 SQLite commits** and **6,000 INSERT statements** against the app database — all during a read-only GET request. The log shows 21,000+ lines of SQL trace for just 18 HTTP requests.
Root causes (ordered by impact):
1. **Per-IP commit during geo cache writes**`geo_service._persist_entry()` and `_persist_neg_entry()` each call `await db.commit()` after every single INSERT. With 5,200 uncached IPs this means 5,200+ individual commits, each forcing an `fsync`. This is the dominant bottleneck.
2. **DB writes on a GET request** — The bans/by-country endpoint passes `app_db` to `geo_service.lookup_batch()`, which triggers INSERT+COMMIT for every resolved IP. A GET request should never produce database writes/commits. Users do not expect loading a map page to mutate the database.
3. **Same pattern exists in other endpoints** — The following GET endpoints also trigger geo cache commits: `/api/dashboard/bans`, `/api/bans/active`, `/api/history`, `/api/history/{ip}`, `/api/geo/lookup/{ip}`.
### Evidence from `log.log`
- Log line count: **21,117 lines** for 18 HTTP requests
- `INSERT INTO geo_cache`: **6,000** executions
- `db.commit()`: **10,400** calls (each INSERT + its commit = 2 ops per IP)
- `geo_batch_lookup_start`: reports `total=5200` uncached IPs
- The bans/by-country response is at line **21,086** out of 21,117 — the entire log is essentially one request's geo persist work
- Other requests (`/api/dashboard/status`, `/api/blocklists/schedule`, `/api/config/map-color-thresholds`) interleave with the geo persist loop because they share the same single async DB connection
---
### Task 1: Batch geo cache writes — eliminate per-IP commits ✅ DONE
**File:** `backend/app/services/geo_service.py`
**What to change:**
The functions `_persist_entry()` and `_persist_neg_entry()` each call `await db.commit()` after every INSERT. Instead, the commit should happen once after the entire batch is processed.
1. Remove `await db.commit()` from both `_persist_entry()` and `_persist_neg_entry()`.
2. In `lookup_batch()`, after the loop over all chunks is complete and all `_persist_entry()` / `_persist_neg_entry()` calls have been made, issue a single `await db.commit()` if `db is not None`.
3. Wrap the single commit in a try/except to handle any errors gracefully.
**Expected impact:** Reduces commits from ~5,200 to **1** per request. This alone should cut the endpoint response time dramatically.
**Testing:** Existing tests in `test_services/test_ban_service.py` and `test_services/test_geo_service.py` should continue to pass. Verify the geo_cache table still gets populated after a batch lookup by checking the DB contents in an integration test.
---
### Task 2: Do not write geo cache during GET requests ✅ DONE
**Files:** `backend/app/routers/dashboard.py`, `backend/app/routers/bans.py`, `backend/app/routers/history.py`, `backend/app/routers/geo.py`
**What to change:**
GET endpoints should not pass `app_db` (or equivalent) to geo_service functions. The geo resolution should still populate the in-memory cache (which is fast, free, and ephemeral), but should NOT write to SQLite during a read request.
For each of these GET endpoints:
- `GET /api/dashboard/bans/by-country` in `dashboard.py` — stop passing `app_db=db` to `bans_by_country()`; pass `app_db=None` instead.
- `GET /api/dashboard/bans` in `dashboard.py` — stop passing `app_db=db` to `list_bans()`; pass `app_db=None` instead.
- `GET /api/bans/active` in `bans.py` — the enricher callback should not pass `db` to `geo_service.lookup()`.
- `GET /api/history` and `GET /api/history/{ip}` in `history.py` — same: enricher should not pass `db`.
- `GET /api/geo/lookup/{ip}` in `geo.py` — do not pass `db` to `geo_service.lookup()`.
The persistent geo cache should only be written during explicit write operations:
- `POST /api/geo/re-resolve` (already a POST — this is correct)
- Blocklist import tasks (`blocklist_service.py`)
- Application startup via `load_cache_from_db()`
**Expected impact:** GET requests become truly read-only. No commits, no `fsync`, no write contention on the app DB during map loads.
**Testing:** Run the full test suite. Verify that:
1. The bans/by-country endpoint still returns correct country data (from in-memory cache).
2. The `geo_cache` table is still populated when `POST /api/geo/re-resolve` is called or after blocklist import.
3. After a server restart, geo data is still available (because `load_cache_from_db()` warms memory from previously persisted data).
---
### Task 3: Periodically persist the in-memory geo cache (background task) ✅ DONE
**Files:** `backend/app/services/geo_service.py`, `backend/app/tasks/` (new task file)
**What to change:**
After Task 2, GET requests no longer write to the DB. But newly resolved IPs during GET requests only live in the in-memory cache and would be lost on restart. To avoid this, add a background task that periodically flushes new in-memory entries to the `geo_cache` table.
1. In `geo_service.py`, add a module-level `_dirty: set[str]` that tracks IPs added to `_cache` but not yet persisted. When `_store()` adds an entry, also add the IP to `_dirty`.
2. Add a new function `flush_dirty(db: aiosqlite.Connection) -> int` that:
- Takes the current `_dirty` set and clears it atomically.
- Uses `executemany()` to batch-INSERT all dirty entries in one transaction.
- Calls `db.commit()` once.
- Returns the count of flushed entries.
3. Register a periodic task (using the existing APScheduler setup) that calls `flush_dirty()` every 60 seconds (configurable). This is similar to how other background tasks already run.
**Expected impact:** Geo data is persisted without blocking any request. If the server restarts, at most 60 seconds of new geo data is lost (and it will simply be re-fetched from ip-api.com on the next request).
**Testing:** Write a test that:
- Calls `lookup_batch()` without a DB reference.
- Verifies IPs are in `_dirty`.
- Calls `flush_dirty(db)`.
- Verifies the geo_cache table contains the entries and `_dirty` is empty.
---
### Task 4: Reduce redundant SQL queries per request (settings / auth) ✅ DONE
**Files:** `backend/app/dependencies.py`, `backend/app/main.py`, `backend/app/repositories/settings_repo.py`
**What to change:**
The log shows that every single HTTP request executes at least 2 separate SQL queries before reaching the actual endpoint logic:
1. `SELECT value FROM settings WHERE key = 'setup_completed'` (SetupRedirectMiddleware)
2. `SELECT id, token, ... FROM sessions WHERE token = ?` (require_auth dependency)
When multiple requests arrive concurrently (as seen in the log — 3 parallel requests trigger 3× setup_completed + 3× session token queries), this adds unnecessary DB contention.
Options (implement one or both):
- **Cache `setup_completed` in memory:** Once setup is complete, it never goes back to incomplete. Cache the result in `app.state` and skip the DB query on subsequent requests. Set it on first `True` read and clear it only if the app restarts.
- **Cache session validation briefly:** Use a short TTL in-memory cache (e.g., 510 seconds) for validated session tokens. This reduces repeated DB lookups for the same token across near-simultaneous requests.
**Expected impact:** Reduces per-request overhead from 2+ SQL queries to 0 for the common case (setup done, session recently validated).
**Testing:** Existing auth and setup tests must continue to pass. Add a test that validates the cached path (second request skips DB).
---
### Task 5: Audit and verify — run full test suite ✅ DONE
After tasks 14 are implemented, run:
```bash
cd backend && python -m pytest tests/ -x -q
```
Verify:
- All tests pass (460 passing, up from 443 baseline).
- `ruff check backend/app/` passes.
- `mypy --strict` passes on all changed files.
- 83% overall coverage (above the 80% threshold).
---
## Developer Notes — Learnings & Gotchas
These notes capture non-obvious findings from the investigation. Read them before you start coding.
### Architecture Overview
BanGUI has **two separate SQLite databases**:
1. **fail2ban DB** — owned by fail2ban, opened read-only (`?mode=ro`) via `aiosqlite.connect(f"file:{path}?mode=ro", uri=True)`. Path is discovered at runtime by asking the fail2ban daemon (`get dbfile` via Unix socket). Contains the `bans` table.
2. **App DB** (`bangui.db`) — BanGUI's own database. Holds `settings`, `sessions`, `geo_cache`, `blocklist_sources`, `import_log`. This is the one being hammered by commits during GET requests.
There is a **single shared app DB connection** living at `request.app.state.db`. All concurrent requests share it. This means long-running writes (like 5,200 sequential INSERT+COMMIT loops) block other requests that need the same connection. The log confirms this: `setup_completed` checks and session lookups from parallel requests interleave with the geo persist loop.
### The Geo Resolution Pipeline
`geo_service.py` implements a two-tier cache:
1. **In-memory dict** (`_cache: dict[str, GeoInfo]`) — module-level, lives for the process lifetime. Fast, no I/O.
2. **SQLite `geo_cache` table** — survives restarts. Loaded into `_cache` at startup via `load_cache_from_db()`.
There is also a **negative cache** (`_neg_cache: dict[str, float]`) for failed lookups with a 5-minute TTL. Failed IPs are not retried within that window.
The batch resolution flow in `lookup_batch()`:
1. Check `_cache` and `_neg_cache` for each IP → split into cached vs uncached.
2. Send uncached IPs to `ip-api.com/batch` in chunks of 100.
3. For each resolved IP: update `_cache` (fast) AND call `_persist_entry(db, ip, info)` (slow — INSERT + COMMIT).
4. For failed IPs: try MaxMind GeoLite2 local DB fallback (`_geoip_lookup()`). If that also fails, add to `_neg_cache` and call `_persist_neg_entry()`.
**Critical insight:** Step 3 is where the bottleneck lives. The `_persist_entry` function issues a separate `await db.commit()` after each INSERT. With 5,200 IPs, that's 5,200 `fsync` calls — each one waits for the disk.
### Specific File Locations You Need
| File | Key functions | Notes |
|------|--------------|-------|
| `backend/app/services/geo_service.py` L231260 | `_persist_entry()` | The INSERT + COMMIT per IP — **this is the hot path** |
| `backend/app/services/geo_service.py` L262280 | `_persist_neg_entry()` | Same pattern for failed lookups |
| `backend/app/services/geo_service.py` L374460 | `lookup_batch()` | Main batch function — calls `_persist_entry` in a loop |
| `backend/app/services/geo_service.py` L130145 | `_store()` | Updates the in-memory `_cache` dict — fast, no I/O |
| `backend/app/services/geo_service.py` L202230 | `load_cache_from_db()` | Startup warm-up, reads entire `geo_cache` table into memory |
| `backend/app/services/ban_service.py` L326430 | `bans_by_country()` | Calls `lookup_batch()` with `db=app_db` |
| `backend/app/services/ban_service.py` L130210 | `list_bans()` | Also calls `lookup_batch()` with `app_db` |
| `backend/app/routers/dashboard.py` | `get_bans_by_country()` | Passes `app_db=db` — this is where db gets threaded through |
| `backend/app/routers/bans.py` | `get_active_bans()` | Uses single-IP `lookup()` via enricher callback with `db` |
| `backend/app/routers/history.py` | `get_history()`, `get_ip_history()` | Same enricher-with-db pattern |
| `backend/app/routers/geo.py` | `lookup_ip()` | Single IP lookup, passes `db` |
| `backend/app/main.py` L268306 | `SetupRedirectMiddleware` | Runs `get_setting(db, "setup_completed")` on every request |
| `backend/app/dependencies.py` L54100 | `require_auth()` | Runs session token SELECT on every authenticated request |
| `backend/app/repositories/settings_repo.py` | `get_setting()` | Individual SELECT per key; `get_all_settings()` exists but is unused in middleware |
### Endpoints That Commit During GET Requests
All of these GET endpoints currently write to the app DB via geo_service:
| Endpoint | How | Commit count per request |
|----------|-----|--------------------------|
| `GET /api/dashboard/bans/by-country` | `bans_by_country()``lookup_batch()``_persist_entry()` per IP | Up to N (N = uncached IPs, can be thousands) |
| `GET /api/dashboard/bans` | `list_bans()``lookup_batch()``_persist_entry()` per IP | Up to page_size (max 500) |
| `GET /api/bans/active` | enricher → `lookup()``_persist_entry()` per IP | 1 per ban in response |
| `GET /api/history` | enricher → `lookup()``_persist_entry()` per IP | 1 per row |
| `GET /api/history/{ip}` | enricher → `lookup()``_persist_entry()` | 1 |
| `GET /api/geo/lookup/{ip}` | `lookup()``_persist_entry()` | 1 |
The only endpoint that **should** write geo data is `POST /api/geo/re-resolve` (already a POST).
### Concurrency / Connection Sharing Issue
The app DB connection (`app.state.db`) is a single `aiosqlite.Connection`. aiosqlite serialises operations through a background thread, so concurrent `await db.execute()` calls from different request handlers are queued. This is visible in the log: while the geo persist loop runs its 5,200 INSERT+COMMITs, other requests' `setup_completed` and session-token queries get interleaved between commits. They all complete, but everything is slower because they wait in the queue.
This is not a bug to fix right now, but keep it in mind: if you batch the commits (Task 1) and stop writing on GETs (Task 2), the contention problem largely goes away because the long-running write loop no longer exists.
### Test Infrastructure
- **443 tests** currently passing, **82% coverage**.
- Tests use `pytest` + `pytest-asyncio` + `httpx.AsyncClient`.
- External dependencies (fail2ban socket, ip-api.com) are fully mocked in tests.
- Run with: `cd backend && python -m pytest tests/ -x -q`
- Lint: `ruff check backend/app/`
- Types: `mypy --strict` on changed files
- All code must follow rules in `Docs/Backend-Development.md`.
### What NOT to Do
1. **Do not add a second DB connection** to "fix" the concurrency issue. The single-connection model is intentional for SQLite (WAL mode notwithstanding). Batching commits is the correct fix.
2. **Do not remove the SQLite geo_cache entirely.** It serves a real purpose: surviving process restarts without re-fetching thousands of IPs from ip-api.com.
3. **Do not cache geo data in Redis or add a new dependency.** The two-tier cache (in-memory dict + SQLite) is the right architecture for this app's scale. The problem is purely commit frequency.
4. **Do not change the `_cache` dict to an LRU or TTL cache.** The current eviction strategy (flush at 50,000 entries) is fine. The issue is the persistent layer, not the in-memory layer.
5. **Do not skip writing test cases.** The project enforces >80% coverage. Every change needs tests.