Files
BanGUI/Docs/Tasks.md

428 lines
26 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# BanGUI — Task List
This document breaks the entire BanGUI project into development stages, ordered so that each stage builds on the previous one. Every task is described in prose with enough detail for a developer to begin work. References point to the relevant documentation.
Reference: `Docs/Refactoring.md` for full analysis of each issue.
---
## Open Issues
---
### TASK-01 — Fix undefined names in `routers/config_misc.py` 🔴
**Where:**
`backend/app/routers/config_misc.py` — lines 86, 87, 123, 174, 196, 337, 386.
**Status:** Completed ✅
**Goal:**
The file references four symbols that are never defined or imported. At runtime, any request that hits an error path in the affected endpoints will raise a bare `NameError` instead of returning the correct HTTP response.
- `ConfigOperationError` (caught on lines 86, 385) — defined in `app.exceptions` but missing from the import block.
- `JailOperationError` (caught on lines 123, 174) — same module, same omission.
- `_bad_request` (called on lines 87, 337, 386) — the file defines `_bad_gateway()` but the symmetrical `_bad_request()` helper was never added.
- `log` (used on line 196 in the restart handler) — a `structlog` logger that every other module in the project creates with `log = structlog.get_logger()`, but this module skips it entirely.
Additionally, `setup_service` is imported at the top level (line 20) and then redundantly re-imported inside two function bodies (lines 289, 327), which triggers a flake8/Pylance redefinition warning. The top-level import is unused.
Fix steps:
1. Add `from app.exceptions import ConfigOperationError, JailOperationError` to the import block.
2. Add `log: structlog.stdlib.BoundLogger = structlog.get_logger()` at module level, following the same pattern as every other service/router.
3. Add the `_bad_request` helper directly below `_bad_gateway`:
```python
def _bad_request(message: str) -> HTTPException:
return HTTPException(
status_code=status.HTTP_400_BAD_REQUEST,
detail=message,
)
```
4. Remove `setup_service` from the top-level `from app.services import ...` line; keep only the two local imports where it is actually used, or hoist one of them to the top and remove the other.
**Possible traps:**
- The file also uses `config_service.ConfigOperationError` in one place (line 385) — that reference goes through the module attribute and therefore works today even without a direct import. After adding the bare import, both references will resolve to the same class, which is correct. Verify nothing accidentally catches the wrong exception.
- Removing `setup_service` from the top-level import while keeping the local imports means the local imports must stay in sync. Consider hoisting to top-level for consistency with every other router.
**Docs changes needed:**
None — this is a pure bug fix.
**Why:**
Any user action that exercises an error path on the `/config/global`, `/config/reload`, `/config/restart`, or `/config/log` endpoints will receive an unhandled `NameError` 500 response instead of the intended 400/409/502. These are common failure modes (fail2ban unreachable, bad config value).
---
### TASK-02 — Wrap blocking `mkdir()` calls in `run_blocking()` 🔴
**Status:** Completed ✅
**Where:**
- `backend/app/services/setup_service.py` — line 178, inside `async def _ensure_database_initialized()`
- `backend/app/startup.py` — line 73, inside `async def startup_shared_resources()`
**Goal:**
Both sites call `Path.mkdir()` synchronously inside async functions. Because the entire backend runs on a single-threaded asyncio event loop, a synchronous filesystem call — even one that normally returns in microseconds — blocks all other coroutines from running until it completes, including health-check tasks and in-flight requests. The fix is to wrap each call with the project's existing `run_blocking()` helper (from `app.utils.async_utils`), which offloads the call to a thread-pool executor.
`startup.py` already imports `run_blocking` indirectly (it calls services that use it), but needs to import it directly. `setup_service.py` already imports and uses `run_blocking` elsewhere in the same file (line 93) — this is therefore a straightforward omission.
Fix steps:
1. In `setup_service.py` around line 178:
```python
# Before
parent_dir.mkdir(parents=True, exist_ok=True)
# After
await run_blocking(parent_dir.mkdir, parents=True, exist_ok=True)
```
2. In `startup.py` around line 73, add the import if absent and apply the same pattern.
**Possible traps:**
- `Path.mkdir` takes keyword-only args. `run_blocking` must accept `**kwargs` or use a lambda: `await run_blocking(lambda: db_path.parent.mkdir(parents=True, exist_ok=True))`. Check the signature of `run_blocking` before picking the form.
- The `PermissionError` that `setup_service` deliberately catches must still be handled in the same try/except block — the await must stay inside the try.
**Docs changes needed:**
None — this is a pure bug fix.
**Why:**
Blocking I/O on the event loop is forbidden by the architecture's "Async Everything" principle. Under load or during startup, a slow filesystem (e.g. NFS, encrypted volume) could stall the entire server for hundreds of milliseconds.
---
### TASK-03 — Remove inverted dependency: service importing a task ✅
**Status:** Completed ✅
**Where:**
`backend/app/services/jail_config_service.py` — line 37:
```python
from app.tasks.health_check import run_probe
```
Used on lines 295 and 571 to fire a health probe after activating or deactivating a jail.
**Goal:**
The dependency graph mandates `Tasks → Services`, never `Services → Tasks`. `run_probe` is already exposed by `health_service.py` (or can be). The fix is to remove the task import from the service and call the equivalent functionality through the service layer instead.
Options:
1. If `health_service.probe()` is already equivalent, replace the two `run_probe(app)` calls with `await health_service.probe(socket_path)` (passing the socket path directly, avoiding the `FastAPI` argument entirely — see also TASK-05).
2. If `run_probe` does more than just probe (e.g. it also writes to `app.state`), extract that logic into a function in `health_service` and call that from both the task and the service.
**Possible traps:**
- `run_probe` in `health_check.py` accepts `app: FastAPI` and uses `app.state` to cache the probe result. Replacing it with a direct `health_service` call means the cached status in `app.state` will not be updated. Decide whether the caller needs the cache updated (if yes, pass the relevant state as a parameter; if no, a direct service call is sufficient).
- After this change, verify that `tasks/health_check.py` still calls `run_probe` internally and that no other module relied on the cross-import.
**Docs changes needed:**
Architecture doc section 2.2 (Tasks) should note that tasks may call services, but services must not call tasks.
**Why:**
The inverted import creates a hidden circular-dependency risk: if `health_check` ever imports anything from `jail_config_service`, Python will raise an `ImportError`. It also makes `jail_config_service` harder to test in isolation because importing it pulls in the scheduler task module.
---
### TASK-04 — Remove `FastAPI` app instance from service signatures 🟠
**Status:** Completed ✅
**Where:**
`backend/app/services/jail_config_service.py` — `activate_jail()` and `deactivate_jail()` function signatures accept `app: FastAPI` (guarded by `TYPE_CHECKING` on line 48, used at runtime on lines ~295, ~571).
**Goal:**
Services must not depend on the FastAPI framework. The `app` object is passed solely so `run_probe(app)` can update `app.state`. After TASK-03 removes the `run_probe` dependency, this parameter becomes unnecessary. If any state from `app.state` is still needed (e.g. `socket_path`, `scheduler`), pass only those specific values — not the entire app.
Fix steps:
1. Complete TASK-03 first.
2. Remove the `app: FastAPI` parameter from `activate_jail` and `deactivate_jail`.
3. Remove the `if TYPE_CHECKING: from fastapi import FastAPI` block if it is no longer referenced anywhere in the file.
4. Update all callers (routers and tests) to stop passing `app`.
**Possible traps:**
- The routers that call `activate_jail` inject `request: Request` specifically to access `request.app`. After removing the parameter, those routers should no longer need `Request` — check whether it is used for anything else in those handlers.
- Any test that constructs a mock `FastAPI` instance just to pass it here can be simplified.
**Docs changes needed:**
None beyond removing the `FastAPI` reference from the service signature.
**Why:**
Framework types in the service layer violate the Dependency Inversion principle, tightly couple service logic to the HTTP layer, and make unit testing significantly harder — requiring a FastAPI application instance just to call a domain function.
---
### TASK-05 — Extract business logic out of `tasks/history_sync.py` into a service ✅
**Where:**
`backend/app/tasks/history_sync.py` — lines 1516 import repositories directly; the entire `_run_sync_with_settings()` function contains paging logic, backfill window calculation, and `archive_ban_event` calls that constitute business logic.
**Goal:**
Tasks must only schedule and invoke. All logic — the backfill window decision, the paging loop, the SQL cursor over the fail2ban DB, the archive write — belongs in a service. Create (or extend) `history_service.py` with a `sync_from_fail2ban_db()` function that accepts a DB connection and socket path, encapsulates the entire sync algorithm, and returns a count of synced records. The task then becomes a thin wrapper:
```python
async def _run_sync_with_settings(settings: Settings) -> None:
db = await open_db(settings.database_path)
try:
synced = await history_service.sync_from_fail2ban_db(db, settings.fail2ban_socket)
log.info("history_sync_complete", synced=synced)
finally:
await db.close()
```
**Possible traps:**
- `history_service` currently uses the fail2ban database read-only. The new function writes to the BanGUI archive table — make sure the connection passed is the BanGUI DB, not the fail2ban DB. The fail2ban DB path is resolved via `get_fail2ban_db_path()` inside the service function.
- The `BACKFILL_WINDOW` and `HISTORY_SYNC_INTERVAL` constants can stay in the task file (scheduling concern) but the logic that decides whether to backfill must move to the service.
- Ensure existing tests for `history_sync` are updated to test the service function, not the task internals.
**Docs changes needed:**
Architecture doc section on Services: add `history_service.sync_from_fail2ban_db` to the service table. Architecture doc section on Tasks: update `history_sync` description to reflect that it delegates to `history_service`.
**Why:**
Business logic in the task layer cannot be unit-tested without spinning up a scheduler. It also means any future consumer of the sync logic (e.g. a manual trigger endpoint) would have to duplicate the code or import from the task module.
---
### TASK-06 — Break circular dependency between `config_file_service` and the three config sub-services ✅
**Status:** Completed ✅
**Where:**
- `services/jail_config_service.py` line 33 → imports `config_file_service` at module level
- `services/filter_config_service.py` line 26 → same
- `services/action_config_service.py` line 28 → same
- `services/config_file_service.py` lines 1027, 1039, 1107, 1118, 1128, 1140 → lazy-imports `jail_config_service`; lines 13031367 → lazy-imports `filter_config_service`; lines 1752+ → lazy-imports `action_config_service`
**Goal:**
The three sub-services depend on `config_file_service` for shared parsing utilities, while `config_file_service` calls back into the three sub-services via lazy in-function imports — the only reason it works is that the circular import is deferred until first call. This is fragile: any reorganisation that removes the lazy-import guard will cause an `ImportError` at startup.
The correct fix is to break the cycle by extracting the shared logic that `config_file_service` currently delegates to the sub-services into a lower-level module. Concretely:
1. Identify every function that `config_file_service` calls via lazy import from the sub-services (e.g. `list_inactive_jails`, `_activate_jail`, `list_filters`, `get_filter`, `list_actions`, etc.).
2. Determine whether `config_file_service` should own those functions directly (moving them out of the sub-services), or whether `config_file_service` should simply be split so the shared parsing utilities live in a `config_file_utils` module that all four services can import without circularity.
3. Remove all lazy in-function imports from `config_file_service`.
The safest first step is option 2: rename the pure parsing/utility functions in `config_file_service` into a new `utils/config_file_utils.py` (or extend the existing one), then have all four service files import from the util, leaving `config_file_service` to handle only orchestration.
**Possible traps:**
- `config_file_service` is over 1700 lines. Splitting it requires careful identification of which functions are pure utilities (no side effects, no DB/socket calls) vs. orchestration. Start with utility functions only.
- Any function moved to utils must not import from services. If it currently does, the abstraction is wrong and needs further decomposition.
- Test coverage for existing behaviour must be maintained throughout the refactor. Run the full test suite after each extraction step.
**Docs changes needed:**
Architecture doc section 2.1 (Project Structure): add `utils/config_file_utils.py` if created. Section 2.2 (Utils): describe the new module. Update the Service descriptions for all four affected services to accurately describe their responsibilities post-split.
**Why:**
Lazy imports mask circular dependencies at import time but surface as confusing runtime errors if import order changes. The current design also makes it impossible to import any one of the four services for testing without transitively loading all four.
---
### TASK-07 — Move schedule management out of `routers/blocklist.py` into a service ✅
**Status:** Completed ✅
**Summary:** Router schedule endpoints now delegate schedule persistence and APScheduler job management to `blocklist_service`, removing the router's direct dependency on `app.tasks.blocklist_import`.
**Where:**
`backend/app/routers/blocklist.py` — line 51:
```python
from app.tasks import blocklist_import as blocklist_import_task
```
Used on lines 173, 206, 208 to read `blocklist_import_task.JOB_ID` and call `blocklist_import_task.reschedule(app)`.
**Goal:**
Routers must not know about APScheduler job IDs or the task module internals. Schedule management is business logic and belongs in a service. Create a `get_schedule()` and `update_schedule()` function in `blocklist_service.py` that encapsulate the scheduler interaction. The router calls the service function and returns its result.
The router's `GET /schedule` and `PUT /schedule` handlers should be reduced to:
```python
return await blocklist_service.get_schedule(db, scheduler)
await blocklist_service.update_schedule(db, scheduler, body)
```
The `JOB_ID` constant and `reschedule()` call details move into `blocklist_service`. `blocklist_service` may import from the task module (tasks can be imported by services for constants/types), but the router should have no direct task dependency.
**Possible traps:**
- `blocklist_import_task.reschedule(app)` uses `app` to read `app.state`. If the service function receives the scheduler directly (already available via `SchedulerDep`), it can avoid the `app` dependency. Pass `scheduler` explicitly instead.
- The `SchedulerDep` injection in the router should remain unchanged — the scheduler is a dependency, not a business concern, and injecting it via `Depends()` is correct.
**Docs changes needed:**
Architecture doc Service table: add `get_schedule` and `update_schedule` to the `blocklist_service` entry.
**Why:**
The router knows the internal job ID string and directly manipulates the scheduler — both are implementation details of the task/scheduling layer. This makes the endpoint hard to test without a running scheduler instance.
---
### TASK-08 — Remove inverted layer dependency: `utils/fail2ban_db_utils.py` importing a service 🟠
**Where:**
`backend/app/utils/fail2ban_db_utils.py` — line 8:
```python
from app.services.fail2ban_metadata_service import default_fail2ban_metadata_service
```
**Goal:**
`utils/` must be the lowest layer — it may not import from `services/`. The `fail2ban_db_utils` module is currently a thin wrapper around `default_fail2ban_metadata_service`. Two options exist:
1. **Move the singleton down**: Move `Fail2BanMetadataService` (and its `default_fail2ban_metadata_service` singleton) from `services/` to `utils/`. The class has no service dependencies — it only uses `Fail2BanClient` from `utils/` — so it belongs in `utils/` naturally.
2. **Eliminate the util wrapper**: Delete `fail2ban_db_utils.get_fail2ban_db_path()` and have all callers import from `fail2ban_metadata_service` directly (since it is a service and services can import other services).
Option 1 is cleaner and consistent with the architecture: `Fail2BanMetadataService` is essentially an async-safe caching wrapper around a socket command, which is a utility concern.
**Possible traps:**
- If `Fail2BanMetadataService` is moved to `utils/`, the import path changes everywhere it is referenced. Use IDE rename-symbol to update all references atomically.
- The architecture doc currently does not list `fail2ban_metadata_service` at all. After moving it to utils, it must be documented in the Utils table.
- Any service that currently imports `fail2ban_db_utils` (rather than `fail2ban_metadata_service`) will need their imports updated if `fail2ban_db_utils` is removed.
**Docs changes needed:**
Architecture doc section 2.2 (Utils): add `fail2ban_db_utils.py` (or its replacement) and `fail2ban_metadata_service` to the Utils table.
**Why:**
A util importing a service creates an implicit upward dependency that the architecture explicitly forbids. It also makes `fail2ban_db_utils` untestable in isolation — importing it drags in the service and its lazy-loaded metadata cache.
---
### TASK-09 — Move `re_resolve_geo` orchestration into `geo_service` 🟠
**Where:**
`backend/app/routers/geo.py` — `async def re_resolve_geo()` (lines ~142174).
**Goal:**
The endpoint currently performs a multi-step workflow inline:
1. Fetch all unresolved IPs from the DB.
2. Early-return if the list is empty.
3. Clear the negative cache.
4. Batch-resolve all IPs.
5. Aggregate the count of newly resolved entries.
Steps 15 are business logic. The router should call a single service function and return its result:
```python
result = await geo_service.re_resolve_all(db, http_session)
return result # {"resolved": N, "total": M}
```
Add `async def re_resolve_all(db, http_session) -> dict[str, int]` to `geo_service.py` containing the current implementation.
**Possible traps:**
- `clear_neg_cache()` is a synchronous function that mutates the module-level `_neg_cache` dict. Calling it from inside the service function is fine, but be aware that it runs on the event loop thread with no lock. This is an existing race condition (see TASK-11) — do not introduce additional concurrent callers without addressing that issue first.
- The return type `dict[str, int]` is an untyped dict. Consider adding a proper response model `GeoReResolveResponse` to `models/geo.py` to make the API contract explicit.
**Docs changes needed:**
Architecture doc Service table: add `re_resolve_all` to the `geo_service` entry.
**Why:**
Multi-step orchestration in the router violates the zero-business-logic rule. It also prevents the re-resolve logic from being triggered from any other context (e.g. a scheduled task or CLI tool) without duplicating code.
---
### TASK-10 — Move `GeoInfo → GeoDetail` translation out of the router 🟡
**Where:**
`backend/app/routers/geo.py` — `async def lookup_ip()`, lines ~8593:
```python
raw_geo = result["geo"]
geo_detail: GeoDetail | None = None
if isinstance(raw_geo, GeoInfo):
geo_detail = GeoDetail(
country_code=raw_geo.country_code,
...
)
```
**Goal:**
The router performs conditional type checking and cross-schema mapping. The cleanest fix is to have `jail_service.lookup_ip()` return a typed dict whose `"geo"` field is already `GeoDetail | None`, eliminating the isinstance check entirely. Alternatively, add a `GeoInfo.to_detail()` method or a `geo_service.to_detail()` converter that the router calls in one line.
This is a small but clean change: update `IpLookupResult` TypedDict in `jail_service` so that `"geo"` is typed as `GeoDetail | None`, and have the service construct the `GeoDetail` before returning.
**Possible traps:**
- `IpLookupResult` is currently typed with `"geo": GeoInfo | None`. Changing it to `GeoDetail` means callers that rely on `GeoInfo`-specific fields (like `asn`) must use `GeoDetail` instead — confirm `GeoDetail` exposes the same fields.
- If `GeoInfo` and `GeoDetail` are nearly identical, consider merging them into a single model to avoid this translation layer entirely.
**Docs changes needed:**
None — this is an internal type alignment.
**Why:**
Schema translation in the router adds fragility: if either model changes, the mapping code must be updated and the connection is not enforced by the type system. Having the service return the correct type is safer and is consistent with the "routers return typed responses" principle.
---
### TASK-11 — Fix `asyncio.Lock()` created at module import time in `jail_service.py` 🟡
**Where:**
`backend/app/services/jail_service.py` — lines 71 and 78:
```python
_reload_all_lock: asyncio.Lock = asyncio.Lock()
_backend_cmd_lock: asyncio.Lock = asyncio.Lock()
```
**Goal:**
`asyncio.Lock()` created before the event loop starts is bound to whichever loop exists at import time. In Python 3.10+ this is generally safe for a single-worker deployment using uvicorn, but it breaks test suites that create a new event loop per test (e.g. `pytest-asyncio` with `loop_scope="function"`), causing a `RuntimeError: Task attached to a different loop` error when the second test tries to acquire the lock.
Fix: initialise the locks lazily on first use:
```python
_reload_all_lock: asyncio.Lock | None = None
_backend_cmd_lock: asyncio.Lock | None = None
def _get_reload_lock() -> asyncio.Lock:
global _reload_all_lock
if _reload_all_lock is None:
_reload_all_lock = asyncio.Lock()
return _reload_all_lock
```
Then replace `async with _reload_all_lock:` with `async with _get_reload_lock():`.
**Possible traps:**
- Lazy initialisation is not thread-safe if multiple coroutines call the getter simultaneously before initialisation. In an asyncio context this is safe (single-threaded event loop), but add a brief comment explaining why.
- If the test suite currently happens to pass despite this issue (because `pytest-asyncio` is configured with `loop_scope="session"`), the bug is latent rather than active — still worth fixing for configuration flexibility.
**Docs changes needed:**
None.
**Why:**
Module-level asyncio primitives are a known footgun in Python async codebases. Fixing this is low-risk and makes the test suite more robust against pytest-asyncio configuration changes.
---
### TASK-12 — Remove duplicate import in `server_service.py` ✅
**Where:**
`backend/app/services/server_service.py` — lines 1718:
```python
from app.exceptions import ServerOperationError
from app.exceptions import ServerOperationError
```
**Goal:**
Delete line 18 (the exact duplicate). One-line change.
**Possible traps:**
None.
**Docs changes needed:**
None.
**Why:**
Duplicate imports generate linter warnings, add noise to diffs, and signal that the file was edited carelessly.
---
### TASK-13 — Document `fail2ban_metadata_service` and `history_sync` task in the architecture doc 🟡
**Where:**
`Docs/Architekture.md` — Services table (section 2.2) and Tasks table (section 2.2).
**Goal:**
Two modules exist in the codebase that are absent from the architecture document:
1. `services/fail2ban_metadata_service.py` — An async-safe caching wrapper that resolves the fail2ban SQLite database path by querying the socket once and caching the result. Exports `default_fail2ban_metadata_service`, a module-level singleton. Should be listed in the Services table (or moved to Utils — see TASK-08) with a description of its caching behaviour and the singleton pattern.
2. `tasks/history_sync.py` — A scheduled background task (interval: 300 seconds) that copies new ban records from the fail2ban database into BanGUI's own archive table to preserve history that would otherwise be purged. Should be added to the Tasks table.
Additionally note in section 6 (Authentication): the `_session_cache` in `dependencies.py` noted there is now implemented in `utils/session_cache.py` — confirm the section still accurately reflects the implementation.
**Possible traps:**
- If TASK-08 is completed before this task, the location of `fail2ban_metadata_service` will change. Write the doc entry after TASK-08 is decided.
**Docs changes needed:**
`Docs/Architekture.md`:
- Add `fail2ban_metadata_service.py` row to the Services (or Utils) table.
- Add `history_sync.py` row to the Tasks table.
- Cross-reference TASK-05 once it is complete (the task should delegate to a service).
**Why:**
The architecture document is a contract between contributors. Undocumented modules lead to inconsistent patterns and duplicate implementations when new developers add features that the undocumented module already handlzttttttttt nbbbbbbbbbbes.