- Cache setup_completed flag in app.state._setup_complete_cached after first successful is_setup_complete() call; all subsequent API requests skip the DB query entirely (one-way transition, cleared on restart). - Add in-memory session token TTL cache (10 s) in require_auth; the second request with the same token within the window skips session_repo.get_session. - Call invalidate_session_cache() on logout so revoked tokens are evicted immediately rather than waiting for TTL expiry. - Add clear_session_cache() for test isolation. - 5 new tests covering the cached fast-path for both optimisations. - 460 tests pass, 83% coverage, zero ruff/mypy warnings.
228 lines
15 KiB
Markdown
228 lines
15 KiB
Markdown
# BanGUI — Task List
|
||
|
||
This document breaks the entire BanGUI project into development stages, ordered so that each stage builds on the previous one. Every task is described in prose with enough detail for a developer to begin work. References point to the relevant documentation.
|
||
|
||
---
|
||
|
||
## Issue: World Map Loading Time — Architecture Fix
|
||
|
||
### Problem Summary
|
||
|
||
The `GET /api/dashboard/bans/by-country` endpoint is extremely slow on first load. A single request with ~5,200 unique IPs produces **10,400 SQLite commits** and **6,000 INSERT statements** against the app database — all during a read-only GET request. The log shows 21,000+ lines of SQL trace for just 18 HTTP requests.
|
||
|
||
Root causes (ordered by impact):
|
||
|
||
1. **Per-IP commit during geo cache writes** — `geo_service._persist_entry()` and `_persist_neg_entry()` each call `await db.commit()` after every single INSERT. With 5,200 uncached IPs this means 5,200+ individual commits, each forcing an `fsync`. This is the dominant bottleneck.
|
||
2. **DB writes on a GET request** — The bans/by-country endpoint passes `app_db` to `geo_service.lookup_batch()`, which triggers INSERT+COMMIT for every resolved IP. A GET request should never produce database writes/commits. Users do not expect loading a map page to mutate the database.
|
||
3. **Same pattern exists in other endpoints** — The following GET endpoints also trigger geo cache commits: `/api/dashboard/bans`, `/api/bans/active`, `/api/history`, `/api/history/{ip}`, `/api/geo/lookup/{ip}`.
|
||
|
||
### Evidence from `log.log`
|
||
|
||
- Log line count: **21,117 lines** for 18 HTTP requests
|
||
- `INSERT INTO geo_cache`: **6,000** executions
|
||
- `db.commit()`: **10,400** calls (each INSERT + its commit = 2 ops per IP)
|
||
- `geo_batch_lookup_start`: reports `total=5200` uncached IPs
|
||
- The bans/by-country response is at line **21,086** out of 21,117 — the entire log is essentially one request's geo persist work
|
||
- Other requests (`/api/dashboard/status`, `/api/blocklists/schedule`, `/api/config/map-color-thresholds`) interleave with the geo persist loop because they share the same single async DB connection
|
||
|
||
---
|
||
|
||
### Task 1: Batch geo cache writes — eliminate per-IP commits ✅ DONE
|
||
|
||
**File:** `backend/app/services/geo_service.py`
|
||
|
||
**What to change:**
|
||
|
||
The functions `_persist_entry()` and `_persist_neg_entry()` each call `await db.commit()` after every INSERT. Instead, the commit should happen once after the entire batch is processed.
|
||
|
||
1. Remove `await db.commit()` from both `_persist_entry()` and `_persist_neg_entry()`.
|
||
2. In `lookup_batch()`, after the loop over all chunks is complete and all `_persist_entry()` / `_persist_neg_entry()` calls have been made, issue a single `await db.commit()` if `db is not None`.
|
||
3. Wrap the single commit in a try/except to handle any errors gracefully.
|
||
|
||
**Expected impact:** Reduces commits from ~5,200 to **1** per request. This alone should cut the endpoint response time dramatically.
|
||
|
||
**Testing:** Existing tests in `test_services/test_ban_service.py` and `test_services/test_geo_service.py` should continue to pass. Verify the geo_cache table still gets populated after a batch lookup by checking the DB contents in an integration test.
|
||
|
||
---
|
||
|
||
### Task 2: Do not write geo cache during GET requests ✅ DONE
|
||
|
||
**Files:** `backend/app/routers/dashboard.py`, `backend/app/routers/bans.py`, `backend/app/routers/history.py`, `backend/app/routers/geo.py`
|
||
|
||
**What to change:**
|
||
|
||
GET endpoints should not pass `app_db` (or equivalent) to geo_service functions. The geo resolution should still populate the in-memory cache (which is fast, free, and ephemeral), but should NOT write to SQLite during a read request.
|
||
|
||
For each of these GET endpoints:
|
||
- `GET /api/dashboard/bans/by-country` in `dashboard.py` — stop passing `app_db=db` to `bans_by_country()`; pass `app_db=None` instead.
|
||
- `GET /api/dashboard/bans` in `dashboard.py` — stop passing `app_db=db` to `list_bans()`; pass `app_db=None` instead.
|
||
- `GET /api/bans/active` in `bans.py` — the enricher callback should not pass `db` to `geo_service.lookup()`.
|
||
- `GET /api/history` and `GET /api/history/{ip}` in `history.py` — same: enricher should not pass `db`.
|
||
- `GET /api/geo/lookup/{ip}` in `geo.py` — do not pass `db` to `geo_service.lookup()`.
|
||
|
||
The persistent geo cache should only be written during explicit write operations:
|
||
- `POST /api/geo/re-resolve` (already a POST — this is correct)
|
||
- Blocklist import tasks (`blocklist_service.py`)
|
||
- Application startup via `load_cache_from_db()`
|
||
|
||
**Expected impact:** GET requests become truly read-only. No commits, no `fsync`, no write contention on the app DB during map loads.
|
||
|
||
**Testing:** Run the full test suite. Verify that:
|
||
1. The bans/by-country endpoint still returns correct country data (from in-memory cache).
|
||
2. The `geo_cache` table is still populated when `POST /api/geo/re-resolve` is called or after blocklist import.
|
||
3. After a server restart, geo data is still available (because `load_cache_from_db()` warms memory from previously persisted data).
|
||
|
||
---
|
||
|
||
### Task 3: Periodically persist the in-memory geo cache (background task) ✅ DONE
|
||
|
||
**Files:** `backend/app/services/geo_service.py`, `backend/app/tasks/` (new task file)
|
||
|
||
**What to change:**
|
||
|
||
After Task 2, GET requests no longer write to the DB. But newly resolved IPs during GET requests only live in the in-memory cache and would be lost on restart. To avoid this, add a background task that periodically flushes new in-memory entries to the `geo_cache` table.
|
||
|
||
1. In `geo_service.py`, add a module-level `_dirty: set[str]` that tracks IPs added to `_cache` but not yet persisted. When `_store()` adds an entry, also add the IP to `_dirty`.
|
||
2. Add a new function `flush_dirty(db: aiosqlite.Connection) -> int` that:
|
||
- Takes the current `_dirty` set and clears it atomically.
|
||
- Uses `executemany()` to batch-INSERT all dirty entries in one transaction.
|
||
- Calls `db.commit()` once.
|
||
- Returns the count of flushed entries.
|
||
3. Register a periodic task (using the existing APScheduler setup) that calls `flush_dirty()` every 60 seconds (configurable). This is similar to how other background tasks already run.
|
||
|
||
**Expected impact:** Geo data is persisted without blocking any request. If the server restarts, at most 60 seconds of new geo data is lost (and it will simply be re-fetched from ip-api.com on the next request).
|
||
|
||
**Testing:** Write a test that:
|
||
- Calls `lookup_batch()` without a DB reference.
|
||
- Verifies IPs are in `_dirty`.
|
||
- Calls `flush_dirty(db)`.
|
||
- Verifies the geo_cache table contains the entries and `_dirty` is empty.
|
||
|
||
---
|
||
|
||
### Task 4: Reduce redundant SQL queries per request (settings / auth) ✅ DONE
|
||
|
||
**Files:** `backend/app/dependencies.py`, `backend/app/main.py`, `backend/app/repositories/settings_repo.py`
|
||
|
||
**What to change:**
|
||
|
||
The log shows that every single HTTP request executes at least 2 separate SQL queries before reaching the actual endpoint logic:
|
||
1. `SELECT value FROM settings WHERE key = 'setup_completed'` (SetupRedirectMiddleware)
|
||
2. `SELECT id, token, ... FROM sessions WHERE token = ?` (require_auth dependency)
|
||
|
||
When multiple requests arrive concurrently (as seen in the log — 3 parallel requests trigger 3× setup_completed + 3× session token queries), this adds unnecessary DB contention.
|
||
|
||
Options (implement one or both):
|
||
- **Cache `setup_completed` in memory:** Once setup is complete, it never goes back to incomplete. Cache the result in `app.state` and skip the DB query on subsequent requests. Set it on first `True` read and clear it only if the app restarts.
|
||
- **Cache session validation briefly:** Use a short TTL in-memory cache (e.g., 5–10 seconds) for validated session tokens. This reduces repeated DB lookups for the same token across near-simultaneous requests.
|
||
|
||
**Expected impact:** Reduces per-request overhead from 2+ SQL queries to 0 for the common case (setup done, session recently validated).
|
||
|
||
**Testing:** Existing auth and setup tests must continue to pass. Add a test that validates the cached path (second request skips DB).
|
||
|
||
---
|
||
|
||
### Task 5: Audit and verify — run full test suite ✅ DONE
|
||
|
||
After tasks 1–4 are implemented, run:
|
||
|
||
```bash
|
||
cd backend && python -m pytest tests/ -x -q
|
||
```
|
||
|
||
Verify:
|
||
- All tests pass (460 passing, up from 443 baseline).
|
||
- `ruff check backend/app/` passes.
|
||
- `mypy --strict` passes on all changed files.
|
||
- 83% overall coverage (above the 80% threshold).
|
||
|
||
---
|
||
|
||
## Developer Notes — Learnings & Gotchas
|
||
|
||
These notes capture non-obvious findings from the investigation. Read them before you start coding.
|
||
|
||
### Architecture Overview
|
||
|
||
BanGUI has **two separate SQLite databases**:
|
||
|
||
1. **fail2ban DB** — owned by fail2ban, opened read-only (`?mode=ro`) via `aiosqlite.connect(f"file:{path}?mode=ro", uri=True)`. Path is discovered at runtime by asking the fail2ban daemon (`get dbfile` via Unix socket). Contains the `bans` table.
|
||
2. **App DB** (`bangui.db`) — BanGUI's own database. Holds `settings`, `sessions`, `geo_cache`, `blocklist_sources`, `import_log`. This is the one being hammered by commits during GET requests.
|
||
|
||
There is a **single shared app DB connection** living at `request.app.state.db`. All concurrent requests share it. This means long-running writes (like 5,200 sequential INSERT+COMMIT loops) block other requests that need the same connection. The log confirms this: `setup_completed` checks and session lookups from parallel requests interleave with the geo persist loop.
|
||
|
||
### The Geo Resolution Pipeline
|
||
|
||
`geo_service.py` implements a two-tier cache:
|
||
|
||
1. **In-memory dict** (`_cache: dict[str, GeoInfo]`) — module-level, lives for the process lifetime. Fast, no I/O.
|
||
2. **SQLite `geo_cache` table** — survives restarts. Loaded into `_cache` at startup via `load_cache_from_db()`.
|
||
|
||
There is also a **negative cache** (`_neg_cache: dict[str, float]`) for failed lookups with a 5-minute TTL. Failed IPs are not retried within that window.
|
||
|
||
The batch resolution flow in `lookup_batch()`:
|
||
1. Check `_cache` and `_neg_cache` for each IP → split into cached vs uncached.
|
||
2. Send uncached IPs to `ip-api.com/batch` in chunks of 100.
|
||
3. For each resolved IP: update `_cache` (fast) AND call `_persist_entry(db, ip, info)` (slow — INSERT + COMMIT).
|
||
4. For failed IPs: try MaxMind GeoLite2 local DB fallback (`_geoip_lookup()`). If that also fails, add to `_neg_cache` and call `_persist_neg_entry()`.
|
||
|
||
**Critical insight:** Step 3 is where the bottleneck lives. The `_persist_entry` function issues a separate `await db.commit()` after each INSERT. With 5,200 IPs, that's 5,200 `fsync` calls — each one waits for the disk.
|
||
|
||
### Specific File Locations You Need
|
||
|
||
| File | Key functions | Notes |
|
||
|------|--------------|-------|
|
||
| `backend/app/services/geo_service.py` L231–260 | `_persist_entry()` | The INSERT + COMMIT per IP — **this is the hot path** |
|
||
| `backend/app/services/geo_service.py` L262–280 | `_persist_neg_entry()` | Same pattern for failed lookups |
|
||
| `backend/app/services/geo_service.py` L374–460 | `lookup_batch()` | Main batch function — calls `_persist_entry` in a loop |
|
||
| `backend/app/services/geo_service.py` L130–145 | `_store()` | Updates the in-memory `_cache` dict — fast, no I/O |
|
||
| `backend/app/services/geo_service.py` L202–230 | `load_cache_from_db()` | Startup warm-up, reads entire `geo_cache` table into memory |
|
||
| `backend/app/services/ban_service.py` L326–430 | `bans_by_country()` | Calls `lookup_batch()` with `db=app_db` |
|
||
| `backend/app/services/ban_service.py` L130–210 | `list_bans()` | Also calls `lookup_batch()` with `app_db` |
|
||
| `backend/app/routers/dashboard.py` | `get_bans_by_country()` | Passes `app_db=db` — this is where db gets threaded through |
|
||
| `backend/app/routers/bans.py` | `get_active_bans()` | Uses single-IP `lookup()` via enricher callback with `db` |
|
||
| `backend/app/routers/history.py` | `get_history()`, `get_ip_history()` | Same enricher-with-db pattern |
|
||
| `backend/app/routers/geo.py` | `lookup_ip()` | Single IP lookup, passes `db` |
|
||
| `backend/app/main.py` L268–306 | `SetupRedirectMiddleware` | Runs `get_setting(db, "setup_completed")` on every request |
|
||
| `backend/app/dependencies.py` L54–100 | `require_auth()` | Runs session token SELECT on every authenticated request |
|
||
| `backend/app/repositories/settings_repo.py` | `get_setting()` | Individual SELECT per key; `get_all_settings()` exists but is unused in middleware |
|
||
|
||
### Endpoints That Commit During GET Requests
|
||
|
||
All of these GET endpoints currently write to the app DB via geo_service:
|
||
|
||
| Endpoint | How | Commit count per request |
|
||
|----------|-----|--------------------------|
|
||
| `GET /api/dashboard/bans/by-country` | `bans_by_country()` → `lookup_batch()` → `_persist_entry()` per IP | Up to N (N = uncached IPs, can be thousands) |
|
||
| `GET /api/dashboard/bans` | `list_bans()` → `lookup_batch()` → `_persist_entry()` per IP | Up to page_size (max 500) |
|
||
| `GET /api/bans/active` | enricher → `lookup()` → `_persist_entry()` per IP | 1 per ban in response |
|
||
| `GET /api/history` | enricher → `lookup()` → `_persist_entry()` per IP | 1 per row |
|
||
| `GET /api/history/{ip}` | enricher → `lookup()` → `_persist_entry()` | 1 |
|
||
| `GET /api/geo/lookup/{ip}` | `lookup()` → `_persist_entry()` | 1 |
|
||
|
||
The only endpoint that **should** write geo data is `POST /api/geo/re-resolve` (already a POST).
|
||
|
||
### Concurrency / Connection Sharing Issue
|
||
|
||
The app DB connection (`app.state.db`) is a single `aiosqlite.Connection`. aiosqlite serialises operations through a background thread, so concurrent `await db.execute()` calls from different request handlers are queued. This is visible in the log: while the geo persist loop runs its 5,200 INSERT+COMMITs, other requests' `setup_completed` and session-token queries get interleaved between commits. They all complete, but everything is slower because they wait in the queue.
|
||
|
||
This is not a bug to fix right now, but keep it in mind: if you batch the commits (Task 1) and stop writing on GETs (Task 2), the contention problem largely goes away because the long-running write loop no longer exists.
|
||
|
||
### Test Infrastructure
|
||
|
||
- **443 tests** currently passing, **82% coverage**.
|
||
- Tests use `pytest` + `pytest-asyncio` + `httpx.AsyncClient`.
|
||
- External dependencies (fail2ban socket, ip-api.com) are fully mocked in tests.
|
||
- Run with: `cd backend && python -m pytest tests/ -x -q`
|
||
- Lint: `ruff check backend/app/`
|
||
- Types: `mypy --strict` on changed files
|
||
- All code must follow rules in `Docs/Backend-Development.md`.
|
||
|
||
### What NOT to Do
|
||
|
||
1. **Do not add a second DB connection** to "fix" the concurrency issue. The single-connection model is intentional for SQLite (WAL mode notwithstanding). Batching commits is the correct fix.
|
||
2. **Do not remove the SQLite geo_cache entirely.** It serves a real purpose: surviving process restarts without re-fetching thousands of IPs from ip-api.com.
|
||
3. **Do not cache geo data in Redis or add a new dependency.** The two-tier cache (in-memory dict + SQLite) is the right architecture for this app's scale. The problem is purely commit frequency.
|
||
4. **Do not change the `_cache` dict to an LRU or TTL cache.** The current eviction strategy (flush at 50,000 entries) is fine. The issue is the persistent layer, not the in-memory layer.
|
||
5. **Do not skip writing test cases.** The project enforces >80% coverage. Every change needs tests.
|