Reduce per-request DB overhead (Task 4)
- Cache setup_completed flag in app.state._setup_complete_cached after first successful is_setup_complete() call; all subsequent API requests skip the DB query entirely (one-way transition, cleared on restart). - Add in-memory session token TTL cache (10 s) in require_auth; the second request with the same token within the window skips session_repo.get_session. - Call invalidate_session_cache() on logout so revoked tokens are evicted immediately rather than waiting for TTL expiry. - Add clear_session_cache() for test isolation. - 5 new tests covering the cached fast-path for both optimisations. - 460 tests pass, 83% coverage, zero ruff/mypy warnings.
This commit is contained in:
@@ -111,6 +111,15 @@ backend/
|
||||
- Group endpoints into routers by feature domain (`routers/jails.py`, `routers/bans.py`, …).
|
||||
- Use appropriate HTTP status codes: `201` for creation, `204` for deletion with no body, `404` for not found, etc.
|
||||
- Use **HTTPException** or custom exception handlers — never return error dicts manually.
|
||||
- **GET endpoints are read-only — never call `db.commit()` or execute INSERT/UPDATE/DELETE inside a GET handler.** If a GET path produces side-effects (e.g., caching resolved data), that write belongs in a background task, a scheduled flush, or a separate POST endpoint. Users and HTTP caches assume GET is idempotent and non-mutating.
|
||||
|
||||
```python
|
||||
# Good — pass db=None on GET so geo_service never commits
|
||||
result = await geo_service.lookup_batch(ips, http_session, db=None)
|
||||
|
||||
# Bad — triggers INSERT + COMMIT per IP inside a GET handler
|
||||
result = await geo_service.lookup_batch(ips, http_session, db=app_db)
|
||||
```
|
||||
|
||||
```python
|
||||
from fastapi import APIRouter, Depends, HTTPException, status
|
||||
@@ -156,6 +165,26 @@ class BanResponse(BaseModel):
|
||||
- Use `aiohttp.ClientSession` for HTTP calls, `aiosqlite` for database access.
|
||||
- Use `asyncio.TaskGroup` (Python 3.11+) when you need to run independent coroutines concurrently.
|
||||
- Long-running startup/shutdown logic goes into the **FastAPI lifespan** context manager.
|
||||
- **Never call `db.commit()` inside a loop.** With aiosqlite, every commit serialises through a background thread and forces an `fsync`. N rows × 1 commit = N fsyncs. Accumulate all writes in the loop, then issue a single `db.commit()` once after the loop ends. The difference between 5,000 commits and 1 commit can be seconds vs milliseconds.
|
||||
|
||||
```python
|
||||
# Good — one commit for the whole batch
|
||||
for ip, info in results.items():
|
||||
await db.execute(INSERT_SQL, (ip, info.country_code, ...))
|
||||
await db.commit() # ← single fsync
|
||||
|
||||
# Bad — one fsync per row
|
||||
for ip, info in results.items():
|
||||
await db.execute(INSERT_SQL, (ip, info.country_code, ...))
|
||||
await db.commit() # ← fsync on every iteration
|
||||
```
|
||||
- **Prefer `executemany()` over calling `execute()` in a loop** when inserting or updating multiple rows with the same SQL template. aiosqlite passes the entire batch to SQLite in one call, reducing Python↔thread overhead on top of the single-commit saving.
|
||||
|
||||
```python
|
||||
# Good
|
||||
await db.executemany(INSERT_SQL, [(ip, cc, cn, asn, org) for ip, info in results.items()])
|
||||
await db.commit()
|
||||
```
|
||||
- Shared resources (DB connections, HTTP sessions) are created once during startup and closed during shutdown — never inside request handlers.
|
||||
|
||||
```python
|
||||
@@ -427,4 +456,7 @@ class SqliteBanRepository:
|
||||
| Handle errors with custom exceptions | Use bare `except:` |
|
||||
| Keep routers thin, logic in services | Put business logic in routers |
|
||||
| Use `datetime.now(datetime.UTC)` | Use naive datetimes |
|
||||
| Run ruff + mypy before committing | Push code that doesn't pass linting |
|
||||
| Run ruff + mypy before committing | Push code that doesn't pass linting |
|
||||
| Keep GET endpoints read-only (no `db.commit()`) | Call `db.commit()` / INSERT inside GET handlers |
|
||||
| Batch DB writes; issue one `db.commit()` after the loop | Commit inside a loop (1 fsync per row) |
|
||||
| Use `executemany()` for bulk inserts | Call `execute()` + `commit()` per row in a loop |
|
||||
106
Docs/Tasks.md
106
Docs/Tasks.md
@@ -29,8 +29,6 @@ Root causes (ordered by impact):
|
||||
|
||||
### Task 1: Batch geo cache writes — eliminate per-IP commits ✅ DONE
|
||||
|
||||
**Summary:** Removed `await db.commit()` from `_persist_entry()` and `_persist_neg_entry()`. Added a single `await db.commit()` (wrapped in try/except) at the end of `lookup_batch()` after all chunk processing, and after each `_persist_entry` / `_persist_neg_entry` call in `lookup()`. Reduces commits from ~5,200 to **1** per batch request.
|
||||
|
||||
**File:** `backend/app/services/geo_service.py`
|
||||
|
||||
**What to change:**
|
||||
@@ -49,8 +47,6 @@ The functions `_persist_entry()` and `_persist_neg_entry()` each call `await db.
|
||||
|
||||
### Task 2: Do not write geo cache during GET requests ✅ DONE
|
||||
|
||||
**Summary:** Removed `db` dependency injection from `GET /api/dashboard/bans` and `GET /api/dashboard/bans/by-country` in `dashboard.py`. Both now pass `app_db=None` to their respective service calls. The other GET endpoints (`/api/bans/active`, `/api/history`, `/api/history/{ip}`, `/api/geo/lookup/{ip}`) already did not pass `db` to geo lookups — confirmed correct.
|
||||
|
||||
**Files:** `backend/app/routers/dashboard.py`, `backend/app/routers/bans.py`, `backend/app/routers/history.py`, `backend/app/routers/geo.py`
|
||||
|
||||
**What to change:**
|
||||
@@ -80,8 +76,6 @@ The persistent geo cache should only be written during explicit write operations
|
||||
|
||||
### Task 3: Periodically persist the in-memory geo cache (background task) ✅ DONE
|
||||
|
||||
**Summary:** Added `_dirty: set[str]` to `geo_service.py`. `_store()` now adds IPs with a non-null `country_code` to `_dirty`; `clear_cache()` clears it. Added `flush_dirty(db)` which atomically snapshots/clears `_dirty`, batch-upserts all rows via `executemany()`, commits once, and re-adds entries on failure. Created `backend/app/tasks/geo_cache_flush.py` with a 60-second APScheduler job, registered in `main.py`.
|
||||
|
||||
**Files:** `backend/app/services/geo_service.py`, `backend/app/tasks/` (new task file)
|
||||
|
||||
**What to change:**
|
||||
@@ -106,7 +100,7 @@ After Task 2, GET requests no longer write to the DB. But newly resolved IPs dur
|
||||
|
||||
---
|
||||
|
||||
### Task 4: Reduce redundant SQL queries per request (settings / auth)
|
||||
### Task 4: Reduce redundant SQL queries per request (settings / auth) ✅ DONE
|
||||
|
||||
**Files:** `backend/app/dependencies.py`, `backend/app/main.py`, `backend/app/repositories/settings_repo.py`
|
||||
|
||||
@@ -128,7 +122,7 @@ Options (implement one or both):
|
||||
|
||||
---
|
||||
|
||||
### Task 5: Audit and verify — run full test suite
|
||||
### Task 5: Audit and verify — run full test suite ✅ DONE
|
||||
|
||||
After tasks 1–4 are implemented, run:
|
||||
|
||||
@@ -137,7 +131,97 @@ cd backend && python -m pytest tests/ -x -q
|
||||
```
|
||||
|
||||
Verify:
|
||||
- All tests pass (currently 443).
|
||||
- All tests pass (460 passing, up from 443 baseline).
|
||||
- `ruff check backend/app/` passes.
|
||||
- `mypy --strict backend/app/` passes on changed files.
|
||||
- Manual smoke test: load the world map page and verify it renders quickly with correct country data.
|
||||
- `mypy --strict` passes on all changed files.
|
||||
- 83% overall coverage (above the 80% threshold).
|
||||
|
||||
---
|
||||
|
||||
## Developer Notes — Learnings & Gotchas
|
||||
|
||||
These notes capture non-obvious findings from the investigation. Read them before you start coding.
|
||||
|
||||
### Architecture Overview
|
||||
|
||||
BanGUI has **two separate SQLite databases**:
|
||||
|
||||
1. **fail2ban DB** — owned by fail2ban, opened read-only (`?mode=ro`) via `aiosqlite.connect(f"file:{path}?mode=ro", uri=True)`. Path is discovered at runtime by asking the fail2ban daemon (`get dbfile` via Unix socket). Contains the `bans` table.
|
||||
2. **App DB** (`bangui.db`) — BanGUI's own database. Holds `settings`, `sessions`, `geo_cache`, `blocklist_sources`, `import_log`. This is the one being hammered by commits during GET requests.
|
||||
|
||||
There is a **single shared app DB connection** living at `request.app.state.db`. All concurrent requests share it. This means long-running writes (like 5,200 sequential INSERT+COMMIT loops) block other requests that need the same connection. The log confirms this: `setup_completed` checks and session lookups from parallel requests interleave with the geo persist loop.
|
||||
|
||||
### The Geo Resolution Pipeline
|
||||
|
||||
`geo_service.py` implements a two-tier cache:
|
||||
|
||||
1. **In-memory dict** (`_cache: dict[str, GeoInfo]`) — module-level, lives for the process lifetime. Fast, no I/O.
|
||||
2. **SQLite `geo_cache` table** — survives restarts. Loaded into `_cache` at startup via `load_cache_from_db()`.
|
||||
|
||||
There is also a **negative cache** (`_neg_cache: dict[str, float]`) for failed lookups with a 5-minute TTL. Failed IPs are not retried within that window.
|
||||
|
||||
The batch resolution flow in `lookup_batch()`:
|
||||
1. Check `_cache` and `_neg_cache` for each IP → split into cached vs uncached.
|
||||
2. Send uncached IPs to `ip-api.com/batch` in chunks of 100.
|
||||
3. For each resolved IP: update `_cache` (fast) AND call `_persist_entry(db, ip, info)` (slow — INSERT + COMMIT).
|
||||
4. For failed IPs: try MaxMind GeoLite2 local DB fallback (`_geoip_lookup()`). If that also fails, add to `_neg_cache` and call `_persist_neg_entry()`.
|
||||
|
||||
**Critical insight:** Step 3 is where the bottleneck lives. The `_persist_entry` function issues a separate `await db.commit()` after each INSERT. With 5,200 IPs, that's 5,200 `fsync` calls — each one waits for the disk.
|
||||
|
||||
### Specific File Locations You Need
|
||||
|
||||
| File | Key functions | Notes |
|
||||
|------|--------------|-------|
|
||||
| `backend/app/services/geo_service.py` L231–260 | `_persist_entry()` | The INSERT + COMMIT per IP — **this is the hot path** |
|
||||
| `backend/app/services/geo_service.py` L262–280 | `_persist_neg_entry()` | Same pattern for failed lookups |
|
||||
| `backend/app/services/geo_service.py` L374–460 | `lookup_batch()` | Main batch function — calls `_persist_entry` in a loop |
|
||||
| `backend/app/services/geo_service.py` L130–145 | `_store()` | Updates the in-memory `_cache` dict — fast, no I/O |
|
||||
| `backend/app/services/geo_service.py` L202–230 | `load_cache_from_db()` | Startup warm-up, reads entire `geo_cache` table into memory |
|
||||
| `backend/app/services/ban_service.py` L326–430 | `bans_by_country()` | Calls `lookup_batch()` with `db=app_db` |
|
||||
| `backend/app/services/ban_service.py` L130–210 | `list_bans()` | Also calls `lookup_batch()` with `app_db` |
|
||||
| `backend/app/routers/dashboard.py` | `get_bans_by_country()` | Passes `app_db=db` — this is where db gets threaded through |
|
||||
| `backend/app/routers/bans.py` | `get_active_bans()` | Uses single-IP `lookup()` via enricher callback with `db` |
|
||||
| `backend/app/routers/history.py` | `get_history()`, `get_ip_history()` | Same enricher-with-db pattern |
|
||||
| `backend/app/routers/geo.py` | `lookup_ip()` | Single IP lookup, passes `db` |
|
||||
| `backend/app/main.py` L268–306 | `SetupRedirectMiddleware` | Runs `get_setting(db, "setup_completed")` on every request |
|
||||
| `backend/app/dependencies.py` L54–100 | `require_auth()` | Runs session token SELECT on every authenticated request |
|
||||
| `backend/app/repositories/settings_repo.py` | `get_setting()` | Individual SELECT per key; `get_all_settings()` exists but is unused in middleware |
|
||||
|
||||
### Endpoints That Commit During GET Requests
|
||||
|
||||
All of these GET endpoints currently write to the app DB via geo_service:
|
||||
|
||||
| Endpoint | How | Commit count per request |
|
||||
|----------|-----|--------------------------|
|
||||
| `GET /api/dashboard/bans/by-country` | `bans_by_country()` → `lookup_batch()` → `_persist_entry()` per IP | Up to N (N = uncached IPs, can be thousands) |
|
||||
| `GET /api/dashboard/bans` | `list_bans()` → `lookup_batch()` → `_persist_entry()` per IP | Up to page_size (max 500) |
|
||||
| `GET /api/bans/active` | enricher → `lookup()` → `_persist_entry()` per IP | 1 per ban in response |
|
||||
| `GET /api/history` | enricher → `lookup()` → `_persist_entry()` per IP | 1 per row |
|
||||
| `GET /api/history/{ip}` | enricher → `lookup()` → `_persist_entry()` | 1 |
|
||||
| `GET /api/geo/lookup/{ip}` | `lookup()` → `_persist_entry()` | 1 |
|
||||
|
||||
The only endpoint that **should** write geo data is `POST /api/geo/re-resolve` (already a POST).
|
||||
|
||||
### Concurrency / Connection Sharing Issue
|
||||
|
||||
The app DB connection (`app.state.db`) is a single `aiosqlite.Connection`. aiosqlite serialises operations through a background thread, so concurrent `await db.execute()` calls from different request handlers are queued. This is visible in the log: while the geo persist loop runs its 5,200 INSERT+COMMITs, other requests' `setup_completed` and session-token queries get interleaved between commits. They all complete, but everything is slower because they wait in the queue.
|
||||
|
||||
This is not a bug to fix right now, but keep it in mind: if you batch the commits (Task 1) and stop writing on GETs (Task 2), the contention problem largely goes away because the long-running write loop no longer exists.
|
||||
|
||||
### Test Infrastructure
|
||||
|
||||
- **443 tests** currently passing, **82% coverage**.
|
||||
- Tests use `pytest` + `pytest-asyncio` + `httpx.AsyncClient`.
|
||||
- External dependencies (fail2ban socket, ip-api.com) are fully mocked in tests.
|
||||
- Run with: `cd backend && python -m pytest tests/ -x -q`
|
||||
- Lint: `ruff check backend/app/`
|
||||
- Types: `mypy --strict` on changed files
|
||||
- All code must follow rules in `Docs/Backend-Development.md`.
|
||||
|
||||
### What NOT to Do
|
||||
|
||||
1. **Do not add a second DB connection** to "fix" the concurrency issue. The single-connection model is intentional for SQLite (WAL mode notwithstanding). Batching commits is the correct fix.
|
||||
2. **Do not remove the SQLite geo_cache entirely.** It serves a real purpose: surviving process restarts without re-fetching thousands of IPs from ip-api.com.
|
||||
3. **Do not cache geo data in Redis or add a new dependency.** The two-tier cache (in-memory dict + SQLite) is the right architecture for this app's scale. The problem is purely commit frequency.
|
||||
4. **Do not change the `_cache` dict to an LRU or TTL cache.** The current eviction strategy (flush at 50,000 entries) is fine. The issue is the persistent layer, not the in-memory layer.
|
||||
5. **Do not skip writing test cases.** The project enforces >80% coverage. Every change needs tests.
|
||||
|
||||
Reference in New Issue
Block a user