Move Fail2Ban exceptions into central app.exceptions module

This commit is contained in:
2026-04-15 10:22:48 +02:00
parent a79f5339bc
commit 328f3575e2
18 changed files with 305 additions and 365 deletions

View File

@@ -3,3 +3,8 @@
This document catalogues architecture violations, code smells, and structural issues found during a full project review. Issues are grouped by category and prioritised.
---
## Completed Refactors
- Moved `Fail2BanConnectionError` and `Fail2BanProtocolError` from `backend/app/utils/fail2ban_client.py` into `backend/app/exceptions.py`. Updated all router, service, and test call sites to import these domain exceptions from `app.exceptions` and retained backward compatibility through re-exporting in `app.utils.fail2ban_client`.

View File

@@ -10,430 +10,362 @@ Reference: `Docs/Refactoring.md` for full analysis of each issue.
---
### TASK-01 — Fix undefined names in `routers/config_misc.py` 🔴
### Task 1 — Move `Fail2BanConnectionError` and `Fail2BanProtocolError` to `app/exceptions.py`
**Where:**
`backend/app/routers/config_misc.py` — lines 86, 87, 123, 174, 196, 337, 386.
**Found in:** `backend/app/utils/fail2ban_client.py` lines 128 and 142. Every router that catches these must currently `from app.utils.fail2ban_client import Fail2BanConnectionError`, importing a transport-layer utility directly.
**Goal:** Move (or re-export) both exception classes into `app/exceptions.py` alongside the rest of the domain error hierarchy. Routers and services should import them from `app.exceptions`, not from a utility module. `fail2ban_client.py` can keep the class definitions and simply re-export from `exceptions`, or import them from there — whichever direction avoids a circular import.
**Possible traps and issues:**
- `fail2ban_client.py` is in the `utils` layer; `exceptions.py` is in `app`. Moving the definition to `exceptions.py` and importing it back into `fail2ban_client.py` (for raising) is the natural direction, but verify there is no circular import chain (`exceptions``utils``exceptions`).
- All callers across `routers/` and `services/` must be updated to import from the new location. A global grep is needed before starting.
- `main.py` registers global exception handlers for both; that import path must be updated too.
**Docs changes needed:** Update `Docs/Refactoring.md` to mark this issue resolved. No user-facing API change.
**Why this is needed:** All domain exceptions should live in one registry so that global exception handlers, tests, and consumers have a single import target. Having transport-layer utilities define exceptions that the entire router layer must catch defeats the purpose of `app/exceptions.py`.
**Status:** Completed ✅
**Goal:**
The file references four symbols that are never defined or imported. At runtime, any request that hits an error path in the affected endpoints will raise a bare `NameError` instead of returning the correct HTTP response.
---
- `ConfigOperationError` (caught on lines 86, 385) — defined in `app.exceptions` but missing from the import block.
- `JailOperationError` (caught on lines 123, 174) — same module, same omission.
- `_bad_request` (called on lines 87, 337, 386) — the file defines `_bad_gateway()` but the symmetrical `_bad_request()` helper was never added.
- `log` (used on line 196 in the restart handler) — a `structlog` logger that every other module in the project creates with `log = structlog.get_logger()`, but this module skips it entirely.
### Task 2 — Move the five `ConfigFile*` exceptions out of `raw_config_io_service.py`
Additionally, `setup_service` is imported at the top level (line 20) and then redundantly re-imported inside two function bodies (lines 289, 327), which triggers a flake8/Pylance redefinition warning. The top-level import is unused.
**Found in:** `backend/app/services/raw_config_io_service.py` lines 66100 (`ConfigDirError`, `ConfigFileNotFoundError`, `ConfigFileExistsError`, `ConfigFileWriteError`, `ConfigFileNameError`). `backend/app/routers/file_config.py` lines 5561 imports them directly from the service.
Fix steps:
1. Add `from app.exceptions import ConfigOperationError, JailOperationError` to the import block.
2. Add `log: structlog.stdlib.BoundLogger = structlog.get_logger()` at module level, following the same pattern as every other service/router.
3. Add the `_bad_request` helper directly below `_bad_gateway`:
```python
def _bad_request(message: str) -> HTTPException:
return HTTPException(
status_code=status.HTTP_400_BAD_REQUEST,
detail=message,
)
```
4. Remove `setup_service` from the top-level `from app.services import ...` line; keep only the two local imports where it is actually used, or hoist one of them to the top and remove the other.
**Goal:** Move all five class definitions into `app/exceptions.py`. Update `raw_config_io_service.py` to import and raise them from there. Update `file_config.py` to import them from `app.exceptions`.
**Possible traps:**
- The file also uses `config_service.ConfigOperationError` in one place (line 385) — that reference goes through the module attribute and therefore works today even without a direct import. After adding the bare import, both references will resolve to the same class, which is correct. Verify nothing accidentally catches the wrong exception.
- Removing `setup_service` from the top-level import while keeping the local imports means the local imports must stay in sync. Consider hoisting to top-level for consistency with every other router.
**Possible traps and issues:**
- The existing names must be preserved (they are part of the public raise/catch contract); do not rename them.
- `file_config.py` currently imports from `app.services.raw_config_io_service` — that import line must change. Check whether any test files also import these classes directly from the service.
- After the move, confirm that `raw_config_io_service.py` still raises them (not the old local definitions that are now gone).
**Docs changes needed:**
None — this is a pure bug fix.
**Docs changes needed:** Update `Docs/Refactoring.md`.
**Why:**
Any user action that exercises an error path on the `/config/global`, `/config/reload`, `/config/restart`, or `/config/log` endpoints will receive an unhandled `NameError` 500 response instead of the intended 400/409/502. These are common failure modes (fail2ban unreachable, bad config value).
**Why this is needed:** Routers should not reach into service internals to import exception types. All domain exceptions belong in the central registry so `main.py` global handlers and any future middleware can catch them uniformly.
---
### TASK-02 — Wrap blocking `mkdir()` calls in `run_blocking()` 🔴
### Task 3 — Eliminate the `config_file_service.py` delegation façade
**Status:** Completed ✅
**Found in:** `backend/app/services/config_file_service.py` (1 136 lines). Every public function is a one-liner that delegates to `jail_config_service`, `filter_config_service`, or `action_config_service`. Those three sub-services import `config_file_service` back via lazy imports scattered across ~10 function bodies each, creating a hidden circular dependency.
**Where:**
- `backend/app/services/setup_service.py` — line 178, inside `async def _ensure_database_initialized()`
- `backend/app/startup.py` — line 73, inside `async def startup_shared_resources()`
**Goal:** Delete `config_file_service.py`. Update every caller (routers: `config_misc.py`, `jail_config.py`; services: anything that calls through the façade) to import the appropriate sub-service directly. Remove all lazy-import call sites (`from app.services import config_file_service as _cfs`) in `jail_config_service.py`, `filter_config_service.py`, and `action_config_service.py`.
**Goal:**
Both sites call `Path.mkdir()` synchronously inside async functions. Because the entire backend runs on a single-threaded asyncio event loop, a synchronous filesystem call — even one that normally returns in microseconds — blocks all other coroutines from running until it completes, including health-check tasks and in-flight requests. The fix is to wrap each call with the project's existing `run_blocking()` helper (from `app.utils.async_utils`), which offloads the call to a thread-pool executor.
**Possible traps and issues:**
- `config_file_service.py` also re-exports `start_daemon` and `wait_for_fail2ban` from `app.utils`. Find all callers of those two names and update their imports.
- The lazy imports exist specifically to avoid the circular dependency — simply removing them may surface an import cycle. Trace the actual call graph before deleting the file.
- `config_misc.py` router imports `config_file_service` for several operations; it will need multiple new imports to replace the façade.
- Large file change — high risk of regression. Comprehensive test run required before and after.
`startup.py` already imports `run_blocking` indirectly (it calls services that use it), but needs to import it directly. `setup_service.py` already imports and uses `run_blocking` elsewhere in the same file (line 93) — this is therefore a straightforward omission.
**Docs changes needed:** Update `Docs/Refactoring.md` and `Docs/Architekture.md` if it mentions this service.
Fix steps:
1. In `setup_service.py` around line 178:
```python
# Before
parent_dir.mkdir(parents=True, exist_ok=True)
# After
await run_blocking(parent_dir.mkdir, parents=True, exist_ok=True)
```
2. In `startup.py` around line 73, add the import if absent and apply the same pattern.
**Possible traps:**
- `Path.mkdir` takes keyword-only args. `run_blocking` must accept `**kwargs` or use a lambda: `await run_blocking(lambda: db_path.parent.mkdir(parents=True, exist_ok=True))`. Check the signature of `run_blocking` before picking the form.
- The `PermissionError` that `setup_service` deliberately catches must still be handled in the same try/except block — the await must stay inside the try.
**Docs changes needed:**
None — this is a pure bug fix.
**Why:**
Blocking I/O on the event loop is forbidden by the architecture's "Async Everything" principle. Under load or during startup, a slow filesystem (e.g. NFS, encrypted volume) could stall the entire server for hundreds of milliseconds.
**Why this is needed:** Pure delegation façades add indirection with no abstraction benefit and obscure the true dependencies of the system. The hidden circular dependency via lazy imports is a structural risk — a refactor inside any of the three sub-services could easily break the cycle in unexpected ways.
---
### TASK-03 — Remove inverted dependency: service importing a task ✅
### Task 4 — Move raw SQL query out of `history_service.py` into a repository
**Status:** Completed ✅
**Found in:** `backend/app/services/history_service.py` line 70: `db.execute("SELECT MAX(timeofban) FROM history_archive")`.
**Where:**
`backend/app/services/jail_config_service.py` — line 37:
```python
from app.tasks.health_check import run_probe
```
Used on lines 295 and 571 to fire a health probe after activating or deactivating a jail.
**Goal:** Add a `get_max_timeofban() -> datetime | None` function to `backend/app/repositories/history_repo.py` (or the appropriate history archive repository). Replace the inline `db.execute` call in the service with a call to that repository function.
**Goal:**
The dependency graph mandates `Tasks → Services`, never `Services → Tasks`. `run_probe` is already exposed by `health_service.py` (or can be). The fix is to remove the task import from the service and call the equivalent functionality through the service layer instead.
**Possible traps and issues:**
- Check whether `history_archive` is queried via the same `aiosqlite.Connection` that the repository layer already uses. Confirm the repository file exists and is the right home (there may be separate `history_repo` and `history_archive_repo`).
- The column `timeofban` must be mapped to the correct Python type (likely `datetime`) — match the convention of the surrounding repository functions.
- Unit tests for `history_service` that mock the db connection will need to be updated to mock the repository call instead.
Options:
1. If `health_service.probe()` is already equivalent, replace the two `run_probe(app)` calls with `await health_service.probe(socket_path)` (passing the socket path directly, avoiding the `FastAPI` argument entirely — see also TASK-05).
2. If `run_probe` does more than just probe (e.g. it also writes to `app.state`), extract that logic into a function in `health_service` and call that from both the task and the service.
**Docs changes needed:** None beyond `Refactoring.md`.
**Possible traps:**
- `run_probe` in `health_check.py` accepts `app: FastAPI` and uses `app.state` to cache the probe result. Replacing it with a direct `health_service` call means the cached status in `app.state` will not be updated. Decide whether the caller needs the cache updated (if yes, pass the relevant state as a parameter; if no, a direct service call is sufficient).
- After this change, verify that `tasks/health_check.py` still calls `run_probe` internally and that no other module relied on the cross-import.
**Docs changes needed:**
Architecture doc section 2.2 (Tasks) should note that tasks may call services, but services must not call tasks.
**Why:**
The inverted import creates a hidden circular-dependency risk: if `health_check` ever imports anything from `jail_config_service`, Python will raise an `ImportError`. It also makes `jail_config_service` harder to test in isolation because importing it pulls in the scheduler task module.
**Why this is needed:** Raw SQL in the service layer bypasses the repository abstraction. It makes the service harder to test (requires a real DB schema) and harder to maintain (schema changes must be tracked in the service, not just the repository).
---
### TASK-04 — Remove `FastAPI` app instance from service signatures 🟠
### Task 5 — Remove the reversed dependency from `blocklist_service.py` to `app.tasks`
**Status:** Completed ✅
**Found in:** `backend/app/services/blocklist_service.py` line 530: `from app.tasks import blocklist_import as blocklist_import_task`. Services must not import from tasks; tasks are consumers of services, not dependencies of them.
**Where:**
`backend/app/services/jail_config_service.py` — `activate_jail()` and `deactivate_jail()` function signatures accept `app: FastAPI` (guarded by `TYPE_CHECKING` on line 48, used at runtime on lines ~295, ~571).
**Goal:** Invert the dependency. The functionality needed from the task (triggering or querying a scheduled blocklist import) should be expressed as a callable or scheduler reference that is passed into the service function, or the service should expose a pure data/state function that the task calls rather than the service calling into the task.
**Goal:**
Services must not depend on the FastAPI framework. The `app` object is passed solely so `run_probe(app)` can update `app.state`. After TASK-03 removes the `run_probe` dependency, this parameter becomes unnecessary. If any state from `app.state` is still needed (e.g. `socket_path`, `scheduler`), pass only those specific values — not the entire app.
**Possible traps and issues:**
- Understand what exactly the service does with `blocklist_import_task` — if it schedules or triggers a job, the job scheduler (`app.state.scheduler`) should be passed in as a parameter rather than imported from tasks.
- This may require adding a parameter to the affected service function and updating its callers (the router).
- If the task import is conditional/lazy, it still violates the dependency rule regardless of when it resolves.
Fix steps:
1. Complete TASK-03 first.
2. Remove the `app: FastAPI` parameter from `activate_jail` and `deactivate_jail`.
3. Remove the `if TYPE_CHECKING: from fastapi import FastAPI` block if it is no longer referenced anywhere in the file.
4. Update all callers (routers and tests) to stop passing `app`.
**Docs changes needed:** Update `Docs/Refactoring.md` and `Docs/Architekture.md` (task→service dependency direction).
**Possible traps:**
- The routers that call `activate_jail` inject `request: Request` specifically to access `request.app`. After removing the parameter, those routers should no longer need `Request` — check whether it is used for anything else in those handlers.
- Any test that constructs a mock `FastAPI` instance just to pass it here can be simplified.
**Docs changes needed:**
None beyond removing the `FastAPI` reference from the service signature.
**Why:**
Framework types in the service layer violate the Dependency Inversion principle, tightly couple service logic to the HTTP layer, and make unit testing significantly harder — requiring a FastAPI application instance just to call a domain function.
**Why this is needed:** Circular layer dependencies (`service → task → service`) make it impossible to test services in isolation and create hidden initialisation-order coupling that can cause import errors or subtle bugs at startup.
---
### TASK-05 — Extract business logic out of `tasks/history_sync.py` into a service
### Task 6 — Extract log reading and service status out of `config_service.py`
**Where:**
`backend/app/tasks/history_sync.py` — lines 1516 import repositories directly; the entire `_run_sync_with_settings()` function contains paging logic, backfill window calculation, and `archive_ban_event` calls that constitute business logic.
**Found in:** `backend/app/services/config_service.py` line 687 (`read_fail2ban_log`) and line 781 (`get_service_status`). Both are unrelated to configuration management and do not belong in this service.
**Goal:**
Tasks must only schedule and invoke. All logic — the backfill window decision, the paging loop, the SQL cursor over the fail2ban DB, the archive write — belongs in a service. Create (or extend) `history_service.py` with a `sync_from_fail2ban_db()` function that accepts a DB connection and socket path, encapsulates the entire sync algorithm, and returns a count of synced records. The task then becomes a thin wrapper:
**Goal:** Move `read_fail2ban_log` into `log_service.py` (the dedicated log-reading service). Move `get_service_status` into `health_service.py`. Update the calling routers (`config_misc.py` uses both) to import from the correct service.
```python
async def _run_sync_with_settings(settings: Settings) -> None:
db = await open_db(settings.database_path)
try:
synced = await history_service.sync_from_fail2ban_db(db, settings.fail2ban_socket)
log.info("history_sync_complete", synced=synced)
finally:
await db.close()
```
**Possible traps and issues:**
- Check whether `log_service.py` and `health_service.py` already have similar functions; avoid duplication and reconcile signatures.
- `config_misc.py` currently imports `config_service` for these; after the move it will need two additional service imports.
- Existing tests targeting `config_service` that cover these functions will need to be moved to the appropriate test file.
**Possible traps:**
- `history_service` currently uses the fail2ban database read-only. The new function writes to the BanGUI archive table — make sure the connection passed is the BanGUI DB, not the fail2ban DB. The fail2ban DB path is resolved via `get_fail2ban_db_path()` inside the service function.
- The `BACKFILL_WINDOW` and `HISTORY_SYNC_INTERVAL` constants can stay in the task file (scheduling concern) but the logic that decides whether to backfill must move to the service.
- Ensure existing tests for `history_sync` are updated to test the service function, not the task internals.
**Docs changes needed:** Update `Docs/Refactoring.md`. Adjust any architecture diagram that shows `config_service` as the home for log/status operations.
**Docs changes needed:**
Architecture doc section on Services: add `history_service.sync_from_fail2ban_db` to the service table. Architecture doc section on Tasks: update `history_sync` description to reflect that it delegates to `history_service`.
**Why:**
Business logic in the task layer cannot be unit-tested without spinning up a scheduler. It also means any future consumer of the sync logic (e.g. a manual trigger endpoint) would have to duplicate the code or import from the task module.
**Why this is needed:** Single-responsibility principle. A configuration management service owning log reading and daemon status checks makes the service harder to reason about and violates the established service split.
---
### TASK-06 — Break circular dependency between `config_file_service` and the three config sub-services ✅
### Task 7 — Deduplicate `get_map_color_thresholds` / `update_map_color_thresholds`
**Status:** Completed ✅
**Found in:** Implementations exist in `backend/app/services/config_service.py`, `backend/app/services/setup_service.py`, and `backend/app/utils/setup_utils.py` (the latter used by both). The router `config_misc.py` calls `setup_service.get_map_color_thresholds()` for ongoing map color operations that have nothing to do with first-run setup.
**Where:**
- `services/jail_config_service.py` line 33 → imports `config_file_service` at module level
- `services/filter_config_service.py` line 26 → same
- `services/action_config_service.py` line 28 → same
- `services/config_file_service.py` lines 1027, 1039, 1107, 1118, 1128, 1140 → lazy-imports `jail_config_service`; lines 13031367 → lazy-imports `filter_config_service`; lines 1752+ → lazy-imports `action_config_service`
**Goal:** Consolidate the single authoritative implementation in one appropriate service — likely a dedicated `settings_service.py` or inside `config_service.py`. Remove the duplicate copies. Update `config_misc.py` to import from the canonical location. Remove `get_map_color_thresholds` / `set_map_color_thresholds` from `setup_utils.py` if they only exist to serve this use case.
**Goal:**
The three sub-services depend on `config_file_service` for shared parsing utilities, while `config_file_service` calls back into the three sub-services via lazy in-function imports — the only reason it works is that the circular import is deferred until first call. This is fragile: any reorganisation that removes the lazy-import guard will cause an `ImportError` at startup.
**Possible traps and issues:**
- `setup_utils.py` imports from `repositories` (itself a violation — see Task 10). Consolidating here creates a dependency chain; resolve Task 10 either before or as part of this task.
- `setup_service.py` is conceptually a first-run service; if callers of the consolidated function include both first-run setup and runtime settings, choose the runtime home and let `setup_service` call it, not the other way.
- Tests that mock `setup_service.get_map_color_thresholds` in the router tests will need updating.
The correct fix is to break the cycle by extracting the shared logic that `config_file_service` currently delegates to the sub-services into a lower-level module. Concretely:
**Docs changes needed:** Update `Docs/Refactoring.md`.
1. Identify every function that `config_file_service` calls via lazy import from the sub-services (e.g. `list_inactive_jails`, `_activate_jail`, `list_filters`, `get_filter`, `list_actions`, etc.).
2. Determine whether `config_file_service` should own those functions directly (moving them out of the sub-services), or whether `config_file_service` should simply be split so the shared parsing utilities live in a `config_file_utils` module that all four services can import without circularity.
3. Remove all lazy in-function imports from `config_file_service`.
The safest first step is option 2: rename the pure parsing/utility functions in `config_file_service` into a new `utils/config_file_utils.py` (or extend the existing one), then have all four service files import from the util, leaving `config_file_service` to handle only orchestration.
**Possible traps:**
- `config_file_service` is over 1700 lines. Splitting it requires careful identification of which functions are pure utilities (no side effects, no DB/socket calls) vs. orchestration. Start with utility functions only.
- Any function moved to utils must not import from services. If it currently does, the abstraction is wrong and needs further decomposition.
- Test coverage for existing behaviour must be maintained throughout the refactor. Run the full test suite after each extraction step.
**Docs changes needed:**
Architecture doc section 2.1 (Project Structure): add `utils/config_file_utils.py` if created. Section 2.2 (Utils): describe the new module. Update the Service descriptions for all four affected services to accurately describe their responsibilities post-split.
**Why:**
Lazy imports mask circular dependencies at import time but surface as confusing runtime errors if import order changes. The current design also makes it impossible to import any one of the four services for testing without transitively loading all four.
**Why this is needed:** Triplicated implementation violates DRY and means a change to the threshold schema must be made in three places. Using `setup_service` for ongoing runtime settings is conceptually wrong and misleads maintainers.
---
### TASK-07 — Move schedule management out of `routers/blocklist.py` into a service ✅
### Task 8 — Move the `activate_jail` 3-step restart workflow out of `config_misc.py` router
**Status:** Completed ✅
**Found in:** `backend/app/routers/config_misc.py` lines 193197: the `restart_fail2ban` handler directly orchestrates `config_file_service.stop_daemon()``config_file_service.start_daemon()``config_file_service.wait_for_fail2ban()` as a three-step inline sequence.
**Summary:** Router schedule endpoints now delegate schedule persistence and APScheduler job management to `blocklist_service`, removing the router's direct dependency on `app.tasks.blocklist_import`.
**Goal:** Extract this orchestration into a `restart_daemon()` function inside `config_file_service.py` (or a more appropriate service). The router handler should call that single function.
**Where:**
`backend/app/routers/blocklist.py` — line 51:
```python
from app.tasks import blocklist_import as blocklist_import_task
```
Used on lines 173, 206, 208 to read `blocklist_import_task.JOB_ID` and call `blocklist_import_task.reschedule(app)`.
**Possible traps and issues:**
- Error handling differs between the three steps; the new service function must preserve the existing error semantics (what happens if `start_daemon` fails must be the same as before).
- If `config_file_service.py` is being deleted (Task 3), coordinate which service owns this function after the façade is removed.
- `wait_for_fail2ban` is a polling operation — confirm any timeout behaviour is preserved.
**Goal:**
Routers must not know about APScheduler job IDs or the task module internals. Schedule management is business logic and belongs in a service. Create a `get_schedule()` and `update_schedule()` function in `blocklist_service.py` that encapsulate the scheduler interaction. The router calls the service function and returns its result.
**Docs changes needed:** Update `Docs/Refactoring.md`.
The router's `GET /schedule` and `PUT /schedule` handlers should be reduced to:
```python
return await blocklist_service.get_schedule(db, scheduler)
await blocklist_service.update_schedule(db, scheduler, body)
```
The `JOB_ID` constant and `reschedule()` call details move into `blocklist_service`. `blocklist_service` may import from the task module (tasks can be imported by services for constants/types), but the router should have no direct task dependency.
**Possible traps:**
- `blocklist_import_task.reschedule(app)` uses `app` to read `app.state`. If the service function receives the scheduler directly (already available via `SchedulerDep`), it can avoid the `app` dependency. Pass `scheduler` explicitly instead.
- The `SchedulerDep` injection in the router should remain unchanged — the scheduler is a dependency, not a business concern, and injecting it via `Depends()` is correct.
**Docs changes needed:**
Architecture doc Service table: add `get_schedule` and `update_schedule` to the `blocklist_service` entry.
**Why:**
The router knows the internal job ID string and directly manipulates the scheduler — both are implementation details of the task/scheduling layer. This makes the endpoint hard to test without a running scheduler instance.
**Why this is needed:** Multi-step orchestration workflows are business logic and do not belong in HTTP handler functions. Keeping them in the service layer makes the workflow testable in isolation and reusable if a second endpoint ever needs to restart the daemon.
---
### TASK-08 — Remove inverted layer dependency: `utils/fail2ban_db_utils.py` importing a service ✅
### Task 9 — Move `sign_session_token` call out of `auth.py` router into `auth_service.login()`
**Status:** Completed ✅
**Found in:** `backend/app/routers/auth.py` line 74 calls `sign_session_token(token, secret)` after calling `auth_service.login()`. The raw token is returned by the service; the signing is performed in the router.
**Where:**
`backend/app/utils/fail2ban_db_utils.py` — line 8:
```python
from app.services.fail2ban_metadata_service import default_fail2ban_metadata_service
```
**Goal:** `auth_service.login()` should return an already-signed token. The router should receive the final cookie value from the service without needing to know about the signing step. Move the `sign_session_token` call into `auth_service.login()`.
**Goal:**
`utils/` must be the lowest layer — it may not import from `services/`. The `fail2ban_db_utils` module is currently a thin wrapper around `default_fail2ban_metadata_service`. Two options exist:
**Possible traps and issues:**
- `auth_service.login()` currently has a specific return type; changing what it returns (raw token → signed token) must be reflected in the return type annotation and any tests that assert on the raw token value.
- `sign_session_token` is currently a public function exported from `auth_service`. After this change it may become a private implementation detail (`_sign_session_token`). Check for any other callers before hiding it.
- Tests that test `login()` in isolation and currently check for the raw token will need to be updated to verify the signed format instead (or mock `sign_session_token` separately).
1. **Move the singleton down**: Move `Fail2BanMetadataService` (and its `default_fail2ban_metadata_service` singleton) from `services/` to `utils/`. The class has no service dependencies — it only uses `Fail2BanClient` from `utils/` — so it belongs in `utils/` naturally.
2. **Eliminate the util wrapper**: Delete `fail2ban_db_utils.get_fail2ban_db_path()` and have all callers import from `fail2ban_metadata_service` directly (since it is a service and services can import other services).
**Docs changes needed:** Update `Docs/Refactoring.md`.
Option 1 is cleaner and consistent with the architecture: `Fail2BanMetadataService` is essentially an async-safe caching wrapper around a socket command, which is a utility concern.
**Possible traps:**
- If `Fail2BanMetadataService` is moved to `utils/`, the import path changes everywhere it is referenced. Use IDE rename-symbol to update all references atomically.
- The architecture doc currently does not list `fail2ban_metadata_service` at all. After moving it to utils, it must be documented in the Utils table.
- Any service that currently imports `fail2ban_db_utils` (rather than `fail2ban_metadata_service`) will need their imports updated if `fail2ban_db_utils` is removed.
**Docs changes needed:**
Architecture doc section 2.2 (Utils): add `fail2ban_db_utils.py` (or its replacement) and `fail2ban_metadata_service` to the Utils table.
**Why:**
A util importing a service creates an implicit upward dependency that the architecture explicitly forbids. It also makes `fail2ban_db_utils` untestable in isolation — importing it drags in the service and its lazy-loaded metadata cache.
**Why this is needed:** Security-sensitive token construction is business logic. Letting the HTTP layer build cookie values means a future maintainer adding a second login endpoint could forget the signing step, creating a silent security regression.
---
### TASK-09 — Move `re_resolve_geo` orchestration into `geo_service` ✅
### Task 10 — Remove the repository import from `utils/setup_utils.py`
**Status:** Completed ✅
**Found in:** `backend/app/utils/setup_utils.py` line 5: `from app.repositories import settings_repo`. The `utils` layer must not import from `repositories`; that is the service layer's job.
**Where:**
`backend/app/routers/geo.py` — `async def re_resolve_geo()` (lines ~142174).
**Goal:** Move any function in `setup_utils.py` that depends on `settings_repo` into `setup_service.py` (which is already allowed to import repositories). The remaining pure utility functions (password hashing, etc.) can stay in `setup_utils.py`.
**Goal:**
The endpoint currently performs a multi-step workflow inline:
1. Fetch all unresolved IPs from the DB.
2. Early-return if the list is empty.
3. Clear the negative cache.
4. Batch-resolve all IPs.
5. Aggregate the count of newly resolved entries.
**Possible traps and issues:**
- `setup_utils.py` is imported by `auth_service.py` (`from app.utils.setup_utils import get_password_hash`). This import is fine and must be preserved; only the repository-dependent functions move.
- Identify every function in `setup_utils.py` that directly or indirectly calls `settings_repo`. Move the entire call chain, not just the top-level function.
- The functions being moved may already have analogues in `setup_service.py`; check for duplication before moving.
Steps 15 are business logic. The router should call a single service function and return its result:
**Docs changes needed:** Update `Docs/Refactoring.md` and `Docs/Architekture.md` (layer diagram).
```python
result = await geo_service.re_resolve_all(db, http_session)
return result # {"resolved": N, "total": M}
```
Add `async def re_resolve_all(db, http_session) -> dict[str, int]` to `geo_service.py` containing the current implementation.
**Possible traps:**
- `clear_neg_cache()` is a synchronous function that mutates the module-level `_neg_cache` dict. Calling it from inside the service function is fine, but be aware that it runs on the event loop thread with no lock. This is an existing race condition (see TASK-11) — do not introduce additional concurrent callers without addressing that issue first.
- The return type `dict[str, int]` is an untyped dict. Consider adding a proper response model `GeoReResolveResponse` to `models/geo.py` to make the API contract explicit.
**Docs changes needed:**
Architecture doc Service table: add `re_resolve_all` to the `geo_service` entry.
**Why:**
Multi-step orchestration in the router violates the zero-business-logic rule. It also prevents the re-resolve logic from being triggered from any other context (e.g. a scheduled task or CLI tool) without duplicating code.
**Why this is needed:** Utils are stateless helpers with no external dependencies. Allowing them to import from the repository layer breaks the architectural contract, makes the utilities untestable without a database, and obscures what is actually a service-layer operation.
---
### TASK-10 — Move `GeoInfo → GeoDetail` translation out of the router ✅
### Task 11 — Centralise `DbDep` — remove local redefinitions in `blocklist.py` and `geo.py`
**Status:** Completed
**Found in:** `backend/app/routers/blocklist.py` line 54 redefines `DbDep = Annotated[aiosqlite.Connection, Depends(get_db)]` locally. `backend/app/routers/geo.py` uses the inline form `Annotated[aiosqlite.Connection, Depends(get_db)]` directly in function signatures instead of importing `DbDep`.
**Where:**
`backend/app/routers/geo.py` — `async def lookup_ip()`, lines ~8593:
```python
raw_geo = result["geo"]
geo_detail: GeoDetail | None = None
if isinstance(raw_geo, GeoInfo):
geo_detail = GeoDetail(
country_code=raw_geo.country_code,
...
)
```
**Goal:** Delete the local redefinitions. Import `DbDep` from `app.dependencies` in both routers.
**Goal:**
The router performs conditional type checking and cross-schema mapping. The cleanest fix is to have `jail_service.lookup_ip()` return a typed dict whose `"geo"` field is already `GeoDetail | None`, eliminating the isinstance check entirely. Alternatively, add a `GeoInfo.to_detail()` method or a `geo_service.to_detail()` converter that the router calls in one line.
**Possible traps and issues:**
- Confirm that `DbDep` is already exported from `app/dependencies.py` (verify the exact name used there).
- The inline form in `geo.py` may appear in multiple function signatures; all occurrences must be replaced.
- No behaviour change — purely a refactor. Run the test suite to confirm.
This is a small but clean change: update `IpLookupResult` TypedDict in `jail_service` so that `"geo"` is typed as `GeoDetail | None`, and have the service construct the `GeoDetail` before returning.
**Docs changes needed:** None beyond `Refactoring.md`.
**Possible traps:**
- `IpLookupResult` is currently typed with `"geo": GeoInfo | None`. Changing it to `GeoDetail` means callers that rely on `GeoInfo`-specific fields (like `asn`) must use `GeoDetail` instead — confirm `GeoDetail` exposes the same fields.
- If `GeoInfo` and `GeoDetail` are nearly identical, consider merging them into a single model to avoid this translation layer entirely.
**Docs changes needed:**
None — this is an internal type alignment.
**Why:**
Schema translation in the router adds fragility: if either model changes, the mapping code must be updated and the connection is not enforced by the type system. Having the service return the correct type is safer and is consistent with the "routers return typed responses" principle.
**Why this is needed:** A single definition of `DbDep` ensures that any future change to the db dependency (e.g. adding row factory configuration) is applied uniformly across all routers.
---
### TASK-11 — Fix `asyncio.Lock()` created at module import time in `jail_service.py` ✅
### Task 12 — Complete the protocol injection layer or remove it
**Status:** Completed ✅
**Found in:** `backend/app/dependencies.py` defines `AuthServiceDep`, `JailServiceDep`, `ConfigServiceDep`, `GeoServiceDep`, `HistoryServiceDep`, `BlocklistServiceDep`, `HealthServiceDep`, `ServerServiceDep` as protocol-typed dependency aliases. Only `routers/auth.py` and `routers/jails.py` actually use these; the other 16 routers import concrete service modules directly.
**Where:**
`backend/app/services/jail_service.py` — lines 71 and 78:
```python
_reload_all_lock: asyncio.Lock = asyncio.Lock()
_backend_cmd_lock: asyncio.Lock = asyncio.Lock()
```
**Goal:** Make a deliberate decision and enforce it consistently. Option A: adopt the protocol injection pattern everywhere — update all 16 non-compliant routers to accept their service via the typed `Dep` alias. Option B: acknowledge the pattern is unused overhead and remove the protocol aliases and `cast()` wrappers from `dependencies.py`, letting all routers import concrete services directly (the current de facto standard). Option B is cheaper; Option A makes service substitution in tests easier.
**Goal:**
`asyncio.Lock()` created before the event loop starts is bound to whichever loop exists at import time. In Python 3.10+ this is generally safe for a single-worker deployment using uvicorn, but it breaks test suites that create a new event loop per test (e.g. `pytest-asyncio` with `loop_scope="function"`), causing a `RuntimeError: Task attached to a different loop` error when the second test tries to acquire the lock.
**Possible traps and issues:**
- Option A requires updating all 16 routers and their test files simultaneously; this is a broad change with high regression risk. Stage it one router at a time.
- Option B means deleting `services/protocols.py` (398 lines) and all `cast("…Service", …)` calls in `dependencies.py`. Ensure nothing outside this layer references the Protocol classes (check test files).
- `dependencies.py` also has repository protocol aliases (`BlocklistRepository`, `ImportLogRepository`, etc.) — decide whether those follow the same fate.
Fix: initialise the locks lazily on first use:
```python
_reload_all_lock: asyncio.Lock | None = None
_backend_cmd_lock: asyncio.Lock | None = None
**Docs changes needed:** Update `Docs/Architekture.md` to reflect the chosen pattern. Update `Docs/Refactoring.md`.
def _get_reload_lock() -> asyncio.Lock:
global _reload_all_lock
if _reload_all_lock is None:
_reload_all_lock = asyncio.Lock()
return _reload_all_lock
```
Then replace `async with _reload_all_lock:` with `async with _get_reload_lock():`.
**Possible traps:**
- Lazy initialisation is not thread-safe if multiple coroutines call the getter simultaneously before initialisation. In an asyncio context this is safe (single-threaded event loop), but add a brief comment explaining why.
- If the test suite currently happens to pass despite this issue (because `pytest-asyncio` is configured with `loop_scope="session"`), the bug is latent rather than active — still worth fixing for configuration flexibility.
**Docs changes needed:**
None.
**Why:**
Module-level asyncio primitives are a known footgun in Python async codebases. Fixing this is low-risk and makes the test suite more robust against pytest-asyncio configuration changes.
**Why this is needed:** Inconsistency between `auth.py`/`jails.py` (uses protocols) and every other router (bypasses them) makes the codebase confusing and the injection layer misleading. This is the most widespread single structural inconsistency in the router layer.
---
### TASK-12 — Remove duplicate import in `server_service.py` ✅
### Task 13 — Move `ban_ip`, `unban_ip`, and `get_active_bans` from `jail_service` to `ban_service`
**Where:**
`backend/app/services/server_service.py` — lines 1718:
```python
from app.exceptions import ServerOperationError
from app.exceptions import ServerOperationError
```
**Found in:** `backend/app/services/jail_service.py` contains `ban_ip`, `unban_ip`, and `get_active_bans`. These operations conceptually belong in `ban_service.py`, which is the declared home for ban management. Routers `bans.py` and `blocklist.py` already import `jail_service` specifically for these functions.
**Goal:**
Delete line 18 (the exact duplicate). One-line change.
**Goal:** Move the three functions into `ban_service.py`. Update `bans.py` and `blocklist.py` to import from `ban_service` instead of `jail_service`. Remove the now-redundant `jail_service` import from those routers if it is no longer needed.
**Possible traps:**
None.
**Possible traps and issues:**
- `ban_ip` and `unban_ip` likely use the fail2ban socket through utilities shared with the rest of `jail_service`. Confirm there is no shared private helper that needs to move with them.
- `get_active_bans` may overlap with existing functions in `ban_service`. Check for duplication and merge if needed.
- If `jail_service` tests cover these functions, move those test cases to the `ban_service` test file.
**Docs changes needed:**
None.
**Docs changes needed:** Update `Docs/Architekture.md` service responsibility table. Update `Docs/Refactoring.md`.
**Why:**
Duplicate imports generate linter warnings, add noise to diffs, and signal that the file was edited carelessly.
**Why this is needed:** Function placement that contradicts the declared service boundary makes it harder to find behaviour during maintenance and violates the principle of least surprise. The service name should predict where operations live.
---
### TASK-13 — Document `fail2ban_metadata_service` and `history_sync` task in the architecture doc 🟡
### Task 14 — Lift geo-enrichment closure construction out of `history.py` router
**Status:** Completed ✅
**Found in:** `backend/app/routers/history.py` lines 110, 147, and 189 each build a `_enricher` async closure and pass it to `history_service` calls. The closure captures `geo_service` and `http_session` from the outer scope and adapts the geo lookup interface for the service.
**Summary:** Added `fail2ban_metadata_service.py` to the service documentation and confirmed `history_sync.py` is documented as delegating to `history_service.py` in `Docs/Architekture.md`.
**Goal:** Move the closure construction into `history_service.py` or into a shared helper in `geo_service.py`. The router should pass the `http_session` and let the service build the enricher internally, or the service should accept `geo_service` as an injected callable.
**Where:**
`Docs/Architekture.md` — Services table (section 2.2) and Tasks table (section 2.2).
**Possible traps and issues:**
- `history_service` already accepts an optional `geo_enricher: GeoEnricher | None` parameter — this is the hook. The router's closures are just adapters for that parameter. Either the service gains a higher-level function that takes `http_session` directly, or a factory function in `geo_service` builds the enricher.
- The same pattern exists in `geo.py` and `ban_service.py` — a consistent solution should handle all three callsites, not just `history.py`.
- Changing the `history_service` function signature will require updating tests.
**Goal:**
Two modules exist in the codebase that are absent from the architecture document:
**Docs changes needed:** Update `Docs/Refactoring.md`.
1. `services/fail2ban_metadata_service.py` — An async-safe caching wrapper that resolves the fail2ban SQLite database path by querying the socket once and caching the result. Exports `default_fail2ban_metadata_service`, a module-level singleton. Should be listed in the Services table (or moved to Utils — see TASK-08) with a description of its caching behaviour and the singleton pattern.
**Why this is needed:** Adapter/closure construction is glue code that belongs at the service boundary or in a factory, not in the HTTP handler. Routers should not need to understand the geo service's lookup interface to serve history requests.
2. `tasks/history_sync.py` — A scheduled background task (interval: 300 seconds) that copies new ban records from the fail2ban database into BanGUI's own archive table to preserve history that would otherwise be purged. Should be added to the Tasks table.
---
Additionally note in section 6 (Authentication): the `_session_cache` in `dependencies.py` noted there is now implemented in `utils/session_cache.py` — confirm the section still accurately reflects the implementation.
### Task 15 — Fix stale activation record on failed `activate_jail`
**Possible traps:**
- If TASK-08 is completed before this task, the location of `fail2ban_metadata_service` will change. Write the doc entry after TASK-08 is decided.
**Found in:** `backend/app/routers/jail_config.py` line 385. `record_activation(app, name)` is called unconditionally before the service call. When the service raises any exception, no handler clears `last_activation`. For the following 60 seconds, the health-check task will misattribute any fail2ban offline event to this failed activation, potentially creating a spurious `PendingRecovery`.
**Docs changes needed:**
`Docs/Architekture.md`:
- Add `fail2ban_metadata_service.py` row to the Services (or Utils) table.
- Add `history_sync.py` row to the Tasks table.
- Cross-reference TASK-05 once it is complete (the task should delegate to a service).
**Goal:** Wrap the service call in a try/except that clears `last_activation` on any exception before re-raising. Alternatively, only call `record_activation` after a successful return.
**Why:**
The architecture document is a contract between contributors. Undocumented modules lead to inconsistent patterns and duplicate implementations when new developers add features that the undocumented module already handlzttttttttt nbbbbbbbbbbes.
**Possible traps and issues:**
- Moving `record_activation` to after the `await` means the timing is slightly later (by the service call duration). This is acceptable because the 60-second window is generous.
- Ensure every exception branch (not just the ones currently handled) triggers the cleanup. Use a `try/finally` to clear the record on failure, not individual `except` blocks.
- The unused `activation_time` return value of `record_activation` should be removed if moving the call eliminates the need for it.
**Docs changes needed:** Update `Docs/Refactoring.md`.
**Why this is needed:** A false `PendingRecovery` presented to the user offers a rollback for a jail that was never successfully activated. This could confuse operators into performing a rollback that modifies configuration unnecessarily.
---
### Task 16 — Add a lock to `geo_service.py` module-level mutable state
**Found in:** `backend/app/services/geo_service.py` lines 99110: `_cache: dict`, `_neg_cache: dict`, `_dirty: set`, and `_geoip_reader` are mutable module-level singletons with no concurrency control. Background tasks (`geo_cache_flush`, `geo_re_resolve`) and request handlers all read and write these concurrently.
**Goal:** Add an `asyncio.Lock` protecting mutations of `_cache`, `_neg_cache`, and `_dirty`. The lock should be acquired for any write and for the copy-then-clear pattern in `flush_dirty`. Read-only accesses that are single-await-free are safe without a lock in the single-threaded asyncio model, but mutation sites must be explicit.
**Possible traps and issues:**
- The comment in `flush_dirty` states that no `await` between copy and clear makes it safe today; a lock is still preferable to ensure the invariant is enforced rather than relied upon implicitly.
- Avoid holding the lock across network I/O (e.g. the ip-api.com fetch in `lookup`). Acquire it only around the dict/set mutation itself.
- `asyncio.Lock` is not thread-safe if `run_in_executor` is used anywhere in the geo path — verify `_cache` is never read from a thread pool.
- Introducing a lock is low risk but adds overhead on every cache write; profile if the geo cache is a hot path.
**Docs changes needed:** Update `Docs/Refactoring.md`.
**Why this is needed:** The current safety is implicit and fragile. A future change that adds an `await` inside the critical section (e.g. logging to a remote sink) would silently introduce data loss in the dirty-flush path. An explicit lock documents the intent and makes the safety guarantee unconditional.
---
### Task 17 — Move crash-detection logic out of the `health_check` task
**Found in:** `backend/app/tasks/health_check.py` lines 70113. The `_run_probe_with_resources` function contains a state machine that checks `last_activation` timing, detects `online→offline` transitions, and writes `PendingRecovery` records. This is domain business logic embedded in a scheduling function.
**Goal:** Extract the crash-detection state machine into a function in `backend/app/utils/runtime_state.py` or into a new `recovery_service.py`. The health-check task should call that function with the previous status, the new status, and the current runtime state — and receive back the updated state without containing the decision logic itself.
**Possible traps and issues:**
- The state machine references `datetime.datetime.now(UTC)` — ensure the extracted function remains deterministic/injectable for testing by accepting `now` as an optional parameter.
- `runtime_state.py` already holds `record_activation`, `create_pending_recovery`, `clear_pending_recovery`. The extracted logic is a natural neighbour there.
- When moving the logic, preserve the exact 60-second window (`_ACTIVATION_CRASH_WINDOW`) and the guard that prevents overwriting an unresolved record.
- Tests for the health check task will need to be refactored: the task becomes trivially thin; the logic can now be unit-tested without a running scheduler.
**Docs changes needed:** Update `Docs/Refactoring.md`.
**Why this is needed:** Business rules about crash attribution timing should not live inside a scheduling artifact. Embedding decision logic in a background job makes it invisible, hard to test without the scheduler, and impossible to reuse from another trigger (e.g. a manual probe endpoint).
---
### Task 18 — Replace `__import__("datetime")` antipattern in `health_check.py`
**Found in:** `backend/app/tasks/health_check.py` lines 165166: `next_run_time=__import__("datetime").datetime.now(tz=__import__("datetime").timezone.utc)`. The `datetime` module is already imported at the top of the file on line 6.
**Goal:** Replace the `__import__` calls with the already-imported `datetime` name: `next_run_time=datetime.datetime.now(tz=datetime.timezone.utc)`.
**Possible traps and issues:**
- This is a trivial one-line change. Confirm the module-level `import datetime` exists and is not inside a `TYPE_CHECKING` block.
- Run the health-check related tests to confirm the scheduler still fires on startup.
**Docs changes needed:** None.
**Why this is needed:** `__import__()` is an implementation detail of Python's import system and should never appear in application code. It obscures the dependency, bypasses linting checks, and signals that the code was written reactively to solve a circular import that no longer exists.
---
### Task 19 — Remove the `setup_service` misuse in `config_misc.py` for runtime map-color settings
**Found in:** `backend/app/routers/config_misc.py` uses `setup_service.get_map_color_thresholds()` and `setup_service.update_map_color_thresholds()` for the ongoing `/api/config/map-colors` endpoints. `setup_service` is intended for first-run setup only.
**Goal:** After Task 7 consolidates the map-color functions into the canonical runtime service, update `config_misc.py` to call that service instead of `setup_service`. Ensure `setup_service` no longer exports these functions publicly.
**Possible traps and issues:**
- This task is a follow-on to Task 7; do not attempt it before Task 7 is complete.
- If `setup_service` still needs these functions during first-run setup, it should call the canonical service rather than owning the implementation.
- Router tests that mock `setup_service.get_map_color_thresholds` will need updating.
**Docs changes needed:** Update `Docs/Refactoring.md`.
**Why this is needed:** Using a first-run setup service for ongoing runtime operations misleads developers into modifying the wrong service when behaviour needs to change. It also creates an implicit dependency between setup state and runtime configuration.
---
### Task 20 — Add global exception handlers for domain exceptions in `main.py`
**Found in:** `backend/app/main.py`. Domain exceptions raised by services (`JailNotFoundError`, `ConfigValidationError`, `ConfigWriteError`, `FilterNotFoundError`, etc.) are caught individually with repetitive try/except blocks in each router handler. There are no global `@app.exception_handler` registrations for domain exceptions beyond `Fail2BanConnectionError` and `Fail2BanProtocolError`.
**Goal:** Register `@app.exception_handler` entries in `main.py` for the most commonly caught domain exception classes (at minimum: `JailNotFoundError` → 404, `ConfigValidationError` → 400, `ConfigWriteError` → 500). This allows router handlers to let these exceptions propagate naturally, removing boilerplate try/except blocks.
**Possible traps and issues:**
- Not all routers want the default status code; some endpoints return 409 or 400 for conditions that another endpoint maps to 500. Global handlers are only appropriate for exceptions with a consistent HTTP mapping across all consumers.
- Introduce global handlers incrementally: add the handler first, then remove the try/except from one router at a time and verify the test suite still passes.
- Ensure the handler response format matches the existing `{"detail": "…"}` convention so the frontend does not need updating.
- `HTTPException` must remain a local catch and not be accidentally swallowed by a broad domain handler.
**Docs changes needed:** Update `Docs/Refactoring.md`.
**Why this is needed:** The same try/except→HTTPException conversion is duplicated across every router endpoint. Global handlers reduce boilerplate, make error mapping auditable in one place, and prevent inconsistencies where the same exception produces different status codes in different routes.
---
### Task 21 — Restrict tasks to opening their own DB connections only where justified; document the pattern
**Found in:** `backend/app/tasks/blocklist_import.py`, `geo_cache_flush.py`, `geo_re_resolve.py`, and `history_sync.py`. Each task calls `open_db(settings.database_path)` directly rather than receiving a DB connection via dependency injection. This is structurally inconsistent with the request-level `get_db` dependency used by routers.
**Goal:** Document in `Docs/Architekture.md` that background tasks are intentionally outside FastAPI's dependency injection scope and must manage their own DB connections. Add a short shared helper (e.g. `tasks/_db.py` or reuse `open_db`) with a consistent context-manager pattern so all four tasks open/close connections the same way. If a task already has correct cleanup (try/finally close), confirm it and leave it; do not change what works.
**Possible traps and issues:**
- Do not attempt to force APScheduler background jobs through FastAPI's DI system — this is not supported without a running request context and would require thread-unsafe hacks.
- Focus on consistency: all tasks should use the same open/close pattern and handle exceptions the same way.
- If a task currently leaks a connection on exception, fix the cleanup; if it is correct, only add the documentation.
**Docs changes needed:** Add a "Background Tasks and Database Access" section to `Docs/Architekture.md` explaining why tasks own their connections and how to write a new task.
**Why this is needed:** Without explanation, the inconsistency between router DI-provided connections and task-managed connections looks like an oversight. Documentation prevents future developers from incorrectly trying to inject a DB connection into a task via `Depends`, which would fail silently at runtime.

View File

@@ -27,6 +27,24 @@ class ServerOperationError(Exception):
"""Raised when a server control command (e.g. refresh) fails."""
class Fail2BanConnectionError(Exception):
"""Raised when the fail2ban socket is unreachable or returns an error."""
def __init__(self, message: str, socket_path: str) -> None:
"""Initialize with a human-readable message and the socket path.
Args:
message: Description of the connection problem.
socket_path: The fail2ban socket path that was targeted.
"""
self.socket_path: str = socket_path
super().__init__(f"{message} (socket: {socket_path})")
class Fail2BanProtocolError(Exception):
"""Raised when the response from fail2ban cannot be parsed."""
class FilterInvalidRegexError(Exception):
"""Raised when a regex pattern fails to compile."""

View File

@@ -44,7 +44,7 @@ from app.routers import (
setup,
)
from app.startup import startup_shared_resources
from app.utils.fail2ban_client import Fail2BanConnectionError, Fail2BanProtocolError
from app.exceptions import Fail2BanConnectionError, Fail2BanProtocolError
from app.utils.runtime_state import ApplicationState, RuntimeState
from app.utils.session_cache import InMemorySessionCache, NoOpSessionCache
from app.utils.setup_state import is_setup_complete_cached, set_setup_complete_cache
@@ -162,7 +162,7 @@ async def _fail2ban_connection_handler(
Args:
request: The incoming FastAPI request.
exc: The :class:`~app.utils.fail2ban_client.Fail2BanConnectionError`.
exc: The :class:`~app.exceptions.Fail2BanConnectionError`.
Returns:
A :class:`fastapi.responses.JSONResponse` with status 502.
@@ -187,7 +187,7 @@ async def _fail2ban_protocol_handler(
Args:
request: The incoming FastAPI request.
exc: The :class:`~app.utils.fail2ban_client.Fail2BanProtocolError`.
exc: The :class:`~app.exceptions.Fail2BanProtocolError`.
Returns:
A :class:`fastapi.responses.JSONResponse` with status 502.

View File

@@ -23,7 +23,7 @@ from app.exceptions import JailNotFoundError, JailOperationError
from app.models.ban import ActiveBanListResponse, BanRequest, UnbanAllResponse, UnbanRequest
from app.models.jail import JailCommandResponse
from app.services import jail_service
from app.utils.fail2ban_client import Fail2BanConnectionError
from app.exceptions import Fail2BanConnectionError
router: APIRouter = APIRouter(prefix="/api/bans", tags=["Bans"])

View File

@@ -20,7 +20,7 @@ from app.models.config import (
ServiceStatusResponse,
)
from app.services import config_file_service, config_service, jail_service, log_service, setup_service
from app.utils.fail2ban_client import Fail2BanConnectionError
from app.exceptions import Fail2BanConnectionError
log: structlog.stdlib.BoundLogger = structlog.get_logger()

View File

@@ -24,7 +24,7 @@ from app.dependencies import (
)
from app.models.geo import GeoCacheStatsResponse, GeoReResolveResponse, IpLookupResponse
from app.services import geo_service, jail_service
from app.utils.fail2ban_client import Fail2BanConnectionError
from app.exceptions import Fail2BanConnectionError
router: APIRouter = APIRouter(prefix="/api/geo", tags=["Geo"])

View File

@@ -46,7 +46,7 @@ from app.services import (
filter_config_service,
jail_config_service,
)
from app.utils.fail2ban_client import Fail2BanConnectionError
from app.exceptions import Fail2BanConnectionError
from app.utils.runtime_state import (
clear_activation_record,
clear_pending_recovery,

View File

@@ -40,7 +40,7 @@ from app.models.jail import (
JailListResponse,
)
from app.services import jail_service
from app.utils.fail2ban_client import Fail2BanConnectionError
from app.exceptions import Fail2BanConnectionError
router: APIRouter = APIRouter(prefix="/api/jails", tags=["Jails"])

View File

@@ -16,7 +16,7 @@ from app.dependencies import AuthDep, Fail2BanSocketDep
from app.models.server import ServerSettingsResponse, ServerSettingsUpdate
from app.services import server_service
from app.exceptions import ServerOperationError
from app.utils.fail2ban_client import Fail2BanConnectionError
from app.exceptions import Fail2BanConnectionError
router: APIRouter = APIRouter(prefix="/api/server", tags=["Server"])

View File

@@ -27,6 +27,8 @@ from typing import TYPE_CHECKING, Protocol
import structlog
from app.exceptions import Fail2BanConnectionError, Fail2BanProtocolError
# ---------------------------------------------------------------------------
# Types
# ---------------------------------------------------------------------------
@@ -125,23 +127,6 @@ _RETRY_INITIAL_BACKOFF: float = 0.15 # seconds; doubles on each attempt
_COMMAND_SEMAPHORE_CONCURRENCY: int = 10
class Fail2BanConnectionError(Exception):
"""Raised when the fail2ban socket is unreachable or returns an error."""
def __init__(self, message: str, socket_path: str) -> None:
"""Initialise with a human-readable message and the socket path.
Args:
message: Description of the connection problem.
socket_path: The fail2ban socket path that was targeted.
"""
self.socket_path: str = socket_path
super().__init__(f"{message} (socket: {socket_path})")
class Fail2BanProtocolError(Exception):
"""Raised when the response from fail2ban cannot be parsed."""
def _send_command_sync(
socket_path: str,

View File

@@ -13,7 +13,7 @@ from app.config import Settings
from app.db import init_db
from app.main import create_app
from app.models.ban import ActiveBan, ActiveBanListResponse
from app.utils.fail2ban_client import Fail2BanConnectionError
from app.exceptions import Fail2BanConnectionError
# ---------------------------------------------------------------------------
# Fixtures

View File

@@ -121,7 +121,7 @@ class TestGetJailConfigs:
async def test_502_on_connection_error(self, config_client: AsyncClient) -> None:
"""GET /api/config/jails returns 502 when fail2ban is unreachable."""
from app.utils.fail2ban_client import Fail2BanConnectionError
from app.exceptions import Fail2BanConnectionError
with patch(
"app.routers.jail_config.config_service.list_jail_configs",
@@ -373,7 +373,7 @@ class TestReloadFail2ban:
async def test_502_when_fail2ban_unreachable(self, config_client: AsyncClient) -> None:
"""POST /api/config/reload returns 502 when fail2ban socket is unreachable."""
from app.utils.fail2ban_client import Fail2BanConnectionError
from app.exceptions import Fail2BanConnectionError
with patch(
"app.routers.config_misc.jail_service.reload_all",
@@ -458,7 +458,7 @@ class TestRestartFail2ban:
async def test_502_when_fail2ban_unreachable(self, config_client: AsyncClient) -> None:
"""POST /api/config/restart returns 502 when fail2ban socket is unreachable."""
from app.utils.fail2ban_client import Fail2BanConnectionError
from app.exceptions import Fail2BanConnectionError
with patch(
"app.routers.config_misc.jail_service.restart",
@@ -1965,7 +1965,7 @@ class TestGetFail2BanLog:
async def test_502_when_fail2ban_unreachable(self, config_client: AsyncClient) -> None:
"""GET /api/config/fail2ban-log returns 502 when fail2ban is down."""
from app.utils.fail2ban_client import Fail2BanConnectionError
from app.exceptions import Fail2BanConnectionError
with patch(
"app.routers.config_misc.config_service.read_fail2ban_log",

View File

@@ -423,7 +423,7 @@ class TestIgnoreIpEndpoints:
async def test_get_ignore_list_502_on_connection_error(self, jails_client: AsyncClient) -> None:
"""GET /api/jails/sshd/ignoreip returns 502 when fail2ban is unreachable."""
from app.utils.fail2ban_client import Fail2BanConnectionError
from app.exceptions import Fail2BanConnectionError
with patch(
"app.routers.jails.jail_service.get_ignore_list",
@@ -465,7 +465,7 @@ class TestIgnoreIpEndpoints:
async def test_add_ignore_ip_502_on_connection_error(self, jails_client: AsyncClient) -> None:
"""POST /api/jails/sshd/ignoreip returns 502 when fail2ban is unreachable."""
from app.utils.fail2ban_client import Fail2BanConnectionError
from app.exceptions import Fail2BanConnectionError
with patch(
"app.routers.jails.jail_service.add_ignore_ip",
@@ -512,7 +512,7 @@ class TestIgnoreIpEndpoints:
async def test_delete_ignore_ip_502_on_connection_error(self, jails_client: AsyncClient) -> None:
"""DELETE /api/jails/sshd/ignoreip returns 502 when fail2ban is unreachable."""
from app.utils.fail2ban_client import Fail2BanConnectionError
from app.exceptions import Fail2BanConnectionError
with patch(
"app.routers.jails.jail_service.del_ignore_ip",
@@ -599,7 +599,7 @@ class TestToggleIgnoreSelf:
async def test_502_on_connection_error(self, jails_client: AsyncClient) -> None:
"""POST /api/jails/sshd/ignoreself returns 502 when fail2ban is unreachable."""
from app.utils.fail2ban_client import Fail2BanConnectionError
from app.exceptions import Fail2BanConnectionError
with patch(
"app.routers.jails.jail_service.set_ignore_self",
@@ -624,7 +624,7 @@ class TestFail2BanConnectionErrors:
async def test_get_jails_502(self, jails_client: AsyncClient) -> None:
"""GET /api/jails returns 502 when fail2ban socket is unreachable."""
from app.utils.fail2ban_client import Fail2BanConnectionError
from app.exceptions import Fail2BanConnectionError
with patch(
"app.routers.jails.jail_service.list_jails",
@@ -636,7 +636,7 @@ class TestFail2BanConnectionErrors:
async def test_get_jail_502(self, jails_client: AsyncClient) -> None:
"""GET /api/jails/sshd returns 502 when fail2ban is unreachable."""
from app.utils.fail2ban_client import Fail2BanConnectionError
from app.exceptions import Fail2BanConnectionError
with patch(
"app.routers.jails.jail_service.get_jail",
@@ -660,7 +660,7 @@ class TestFail2BanConnectionErrors:
async def test_reload_all_502(self, jails_client: AsyncClient) -> None:
"""POST /api/jails/reload-all returns 502 when fail2ban is unreachable."""
from app.utils.fail2ban_client import Fail2BanConnectionError
from app.exceptions import Fail2BanConnectionError
with patch(
"app.routers.jails.jail_service.reload_all",
@@ -672,7 +672,7 @@ class TestFail2BanConnectionErrors:
async def test_start_jail_502(self, jails_client: AsyncClient) -> None:
"""POST /api/jails/sshd/start returns 502 when fail2ban is unreachable."""
from app.utils.fail2ban_client import Fail2BanConnectionError
from app.exceptions import Fail2BanConnectionError
with patch(
"app.routers.jails.jail_service.start_jail",
@@ -696,7 +696,7 @@ class TestFail2BanConnectionErrors:
async def test_stop_jail_502(self, jails_client: AsyncClient) -> None:
"""POST /api/jails/sshd/stop returns 502 when fail2ban is unreachable."""
from app.utils.fail2ban_client import Fail2BanConnectionError
from app.exceptions import Fail2BanConnectionError
with patch(
"app.routers.jails.jail_service.stop_jail",
@@ -740,7 +740,7 @@ class TestFail2BanConnectionErrors:
async def test_toggle_idle_502(self, jails_client: AsyncClient) -> None:
"""POST /api/jails/sshd/idle returns 502 when fail2ban is unreachable."""
from app.utils.fail2ban_client import Fail2BanConnectionError
from app.exceptions import Fail2BanConnectionError
with patch(
"app.routers.jails.jail_service.set_idle",
@@ -780,7 +780,7 @@ class TestFail2BanConnectionErrors:
async def test_reload_jail_502(self, jails_client: AsyncClient) -> None:
"""POST /api/jails/sshd/reload returns 502 when fail2ban is unreachable."""
from app.utils.fail2ban_client import Fail2BanConnectionError
from app.exceptions import Fail2BanConnectionError
with patch(
"app.routers.jails.jail_service.reload_jail",
@@ -894,7 +894,7 @@ class TestGetJailBannedIps:
async def test_502_when_fail2ban_unreachable(self, jails_client: AsyncClient) -> None:
"""GET /api/jails/sshd/banned returns 502 when fail2ban is unreachable."""
from app.utils.fail2ban_client import Fail2BanConnectionError
from app.exceptions import Fail2BanConnectionError
with patch(
"app.routers.jails.jail_service.get_jail_banned_ips",

View File

@@ -106,7 +106,7 @@ class TestGetServerSettings:
async def test_502_on_connection_error(self, server_client: AsyncClient) -> None:
"""GET /api/server/settings returns 502 when fail2ban is unreachable."""
from app.utils.fail2ban_client import Fail2BanConnectionError
from app.exceptions import Fail2BanConnectionError
with patch(
"app.routers.server.server_service.get_settings",
@@ -163,7 +163,7 @@ class TestUpdateServerSettings:
async def test_502_on_connection_error(self, server_client: AsyncClient) -> None:
"""PUT /api/server/settings returns 502 when fail2ban is unreachable."""
from app.utils.fail2ban_client import Fail2BanConnectionError
from app.exceptions import Fail2BanConnectionError
with patch(
"app.routers.server.server_service.update_settings",
@@ -218,7 +218,7 @@ class TestFlushLogs:
async def test_502_on_connection_error(self, server_client: AsyncClient) -> None:
"""POST /api/server/flush-logs returns 502 when fail2ban is unreachable."""
from app.utils.fail2ban_client import Fail2BanConnectionError
from app.exceptions import Fail2BanConnectionError
with patch(
"app.routers.server.server_service.flush_logs",

View File

@@ -5,7 +5,7 @@ from __future__ import annotations
from unittest.mock import AsyncMock, patch
from app.services.fail2ban_metadata_service import Fail2BanMetadataService
from app.utils.fail2ban_client import Fail2BanConnectionError
from app.exceptions import Fail2BanConnectionError
async def test_get_db_path_caches_result() -> None:

View File

@@ -9,7 +9,7 @@ import pytest
from app.models.server import ServerStatus
from app.services import health_service
from app.utils.fail2ban_client import Fail2BanConnectionError, Fail2BanProtocolError
from app.exceptions import Fail2BanConnectionError, Fail2BanProtocolError
# ---------------------------------------------------------------------------
# Helpers

View File

@@ -13,7 +13,7 @@ from app.models.geo import GeoDetail, GeoInfo
from app.models.jail import JailDetailResponse, JailListResponse
from app.services import jail_service
from app.services.jail_service import JailNotFoundError, JailOperationError
from app.utils.fail2ban_client import Fail2BanConnectionError
from app.exceptions import Fail2BanConnectionError
# ---------------------------------------------------------------------------
# Helpers