This document catalogues architecture violations, code smells, and structural issues found during a full project review. Issues are grouped by category and prioritised.
This document catalogues architecture violations, code smells, and structural issues found during a full project review. Issues are grouped by category and prioritised.
---
---
## Completed Refactors
- Moved `Fail2BanConnectionError` and `Fail2BanProtocolError` from `backend/app/utils/fail2ban_client.py` into `backend/app/exceptions.py`. Updated all router, service, and test call sites to import these domain exceptions from `app.exceptions` and retained backward compatibility through re-exporting in `app.utils.fail2ban_client`.
@@ -10,430 +10,362 @@ Reference: `Docs/Refactoring.md` for full analysis of each issue.
---
---
### TASK-01 — Fix undefined names in `routers/config_misc.py` 🔴
### Task 1 — Move `Fail2BanConnectionError` and `Fail2BanProtocolError` to `app/exceptions.py`
**Where:**
**Found in:**`backend/app/utils/fail2ban_client.py` lines 128 and 142. Every router that catches these must currently `from app.utils.fail2ban_client import Fail2BanConnectionError`, importing a transport-layer utility directly.
**Goal:** Move (or re-export) both exception classes into `app/exceptions.py` alongside the rest of the domain error hierarchy. Routers and services should import them from `app.exceptions`, not from a utility module. `fail2ban_client.py` can keep the class definitions and simply re-export from `exceptions`, or import them from there — whichever direction avoids a circular import.
**Possible traps and issues:**
-`fail2ban_client.py` is in the `utils` layer; `exceptions.py` is in `app`. Moving the definition to `exceptions.py` and importing it back into `fail2ban_client.py` (for raising) is the natural direction, but verify there is no circular import chain (`exceptions` → `utils` → `exceptions`).
- All callers across `routers/` and `services/` must be updated to import from the new location. A global grep is needed before starting.
-`main.py` registers global exception handlers for both; that import path must be updated too.
**Docs changes needed:** Update `Docs/Refactoring.md` to mark this issue resolved. No user-facing API change.
**Why this is needed:** All domain exceptions should live in one registry so that global exception handlers, tests, and consumers have a single import target. Having transport-layer utilities define exceptions that the entire router layer must catch defeats the purpose of `app/exceptions.py`.
**Status:** Completed ✅
**Status:** Completed ✅
**Goal:**
---
The file references four symbols that are never defined or imported. At runtime, any request that hits an error path in the affected endpoints will raise a bare `NameError` instead of returning the correct HTTP response.
-`ConfigOperationError` (caught on lines 86, 385) — defined in `app.exceptions` but missing from the import block.
### Task 2 — Move the five `ConfigFile*` exceptions out of `raw_config_io_service.py`
-`JailOperationError` (caught on lines 123, 174) — same module, same omission.
-`_bad_request` (called on lines 87, 337, 386) — the file defines `_bad_gateway()` but the symmetrical `_bad_request()` helper was never added.
-`log` (used on line 196 in the restart handler) — a `structlog` logger that every other module in the project creates with `log = structlog.get_logger()`, but this module skips it entirely.
Additionally, `setup_service` is imported at the top level (line 20) and then redundantly re-imported inside two function bodies (lines 289, 327), which triggers a flake8/Pylance redefinition warning. The top-level import is unused.
**Found in:**`backend/app/services/raw_config_io_service.py` lines 66–100 (`ConfigDirError`, `ConfigFileNotFoundError`, `ConfigFileExistsError`, `ConfigFileWriteError`, `ConfigFileNameError`). `backend/app/routers/file_config.py` lines 55–61 imports them directly from the service.
Fix steps:
**Goal:** Move all five class definitions into `app/exceptions.py`. Update `raw_config_io_service.py` to import and raise them from there. Update `file_config.py` to import them from `app.exceptions`.
1. Add `from app.exceptions import ConfigOperationError, JailOperationError` to the import block.
2. Add `log: structlog.stdlib.BoundLogger = structlog.get_logger()` at module level, following the same pattern as every other service/router.
3. Add the `_bad_request` helper directly below `_bad_gateway`:
```python
def _bad_request(message: str) -> HTTPException:
return HTTPException(
status_code=status.HTTP_400_BAD_REQUEST,
detail=message,
)
```
4. Remove `setup_service` from the top-level `from app.services import ...` line; keep only the two local imports where it is actually used, or hoist one of them to the top and remove the other.
**Possible traps:**
**Possible traps and issues:**
- The file also uses `config_service.ConfigOperationError` in one place (line 385) — that reference goes through the module attribute and therefore works today even without a direct import. After adding the bare import, both references will resolve to the same class, which is correct. Verify nothing accidentally catches the wrong exception.
- The existing names must be preserved (they are part of the public raise/catch contract); do not rename them.
- Removing `setup_service` from the top-level import while keeping the local imports means the local imports must stay in sync. Consider hoisting to top-level for consistency with every other router.
-`file_config.py` currently imports from `app.services.raw_config_io_service` — that import line must change. Check whether any test files also import these classes directly from the service.
- After the move, confirm that `raw_config_io_service.py` still raises them (not the old local definitions that are now gone).
**Why this is needed:** Routers should not reach into service internals to import exception types. All domain exceptions belong in the central registry so `main.py` global handlers and any future middleware can catch them uniformly.
Any user action that exercises an error path on the `/config/global`, `/config/reload`, `/config/restart`, or `/config/log` endpoints will receive an unhandled `NameError` 500 response instead of the intended 400/409/502. These are common failure modes (fail2ban unreachable, bad config value).
---
---
### TASK-02 — Wrap blocking `mkdir()` calls in `run_blocking()` 🔴
### Task 3 — Eliminate the `config_file_service.py` delegation façade
**Status:** Completed ✅
**Found in:**`backend/app/services/config_file_service.py` (1 136 lines). Every public function is a one-liner that delegates to `jail_config_service`, `filter_config_service`, or `action_config_service`. Those three sub-services import `config_file_service` back via lazy imports scattered across ~10 function bodies each, creating a hidden circular dependency.
**Where:**
**Goal:** Delete `config_file_service.py`. Update every caller (routers: `config_misc.py`, `jail_config.py`; services: anything that calls through the façade) to import the appropriate sub-service directly. Remove all lazy-import call sites (`from app.services import config_file_service as _cfs`) in `jail_config_service.py`, `filter_config_service.py`, and `action_config_service.py`.
- `backend/app/services/setup_service.py` — line 178, inside `async def _ensure_database_initialized()`
- `backend/app/startup.py` — line 73, inside `async def startup_shared_resources()`
**Goal:**
**Possible traps and issues:**
Both sites call `Path.mkdir()` synchronously inside async functions. Because the entire backend runs on a single-threaded asyncio event loop, a synchronous filesystem call — even one that normally returns in microseconds — blocks all other coroutines from running until it completes, including health-check tasks and in-flight requests. The fix is to wrap each call with the project's existing `run_blocking()` helper (from `app.utils.async_utils`), which offloads the call to a thread-pool executor.
-`config_file_service.py` also re-exports `start_daemon` and `wait_for_fail2ban` from `app.utils`. Find all callers of those two names and update their imports.
- The lazy imports exist specifically to avoid the circular dependency — simply removing them may surface an import cycle. Trace the actual call graph before deleting the file.
-`config_misc.py` router imports `config_file_service` for several operations; it will need multiple new imports to replace the façade.
- Large file change — high risk of regression. Comprehensive test run required before and after.
`startup.py` already imports `run_blocking` indirectly (it calls services that use it), but needs to import it directly. `setup_service.py` already imports and uses `run_blocking` elsewhere in the same file (line 93) — this is therefore a straightforward omission.
**Docs changes needed:** Update `Docs/Refactoring.md` and `Docs/Architekture.md` if it mentions this service.
Fix steps:
**Why this is needed:** Pure delegation façades add indirection with no abstraction benefit and obscure the true dependencies of the system. The hidden circular dependency via lazy imports is a structural risk — a refactor inside any of the three sub-services could easily break the cycle in unexpected ways.
2. In `startup.py` around line 73, add the import if absent and apply the same pattern.
**Possible traps:**
- `Path.mkdir` takes keyword-only args. `run_blocking` must accept `**kwargs` or use a lambda: `await run_blocking(lambda: db_path.parent.mkdir(parents=True, exist_ok=True))`. Check the signature of `run_blocking` before picking the form.
- The `PermissionError` that `setup_service` deliberately catches must still be handled in the same try/except block — the await must stay inside the try.
**Docs changes needed:**
None — this is a pure bug fix.
**Why:**
Blocking I/O on the event loop is forbidden by the architecture's "Async Everything" principle. Under load or during startup, a slow filesystem (e.g. NFS, encrypted volume) could stall the entire server for hundreds of milliseconds.
---
---
### TASK-03 — Remove inverted dependency: service importing a task ✅
### Task 4 — Move raw SQL query out of `history_service.py` into a repository
**Status:** Completed ✅
**Found in:**`backend/app/services/history_service.py` line 70: `db.execute("SELECT MAX(timeofban) FROM history_archive")`.
**Where:**
**Goal:** Add a `get_max_timeofban() -> datetime | None` function to `backend/app/repositories/history_repo.py` (or the appropriate history archive repository). Replace the inline `db.execute` call in the service with a call to that repository function.
`backend/app/services/jail_config_service.py` — line 37:
```python
from app.tasks.health_check import run_probe
```
Used on lines 295 and 571 to fire a health probe after activating or deactivating a jail.
**Goal:**
**Possible traps and issues:**
The dependency graph mandates `Tasks → Services`, never `Services → Tasks`. `run_probe` is already exposed by `health_service.py` (or can be). The fix is to remove the task import from the service and call the equivalent functionality through the service layer instead.
- Check whether `history_archive` is queried via the same `aiosqlite.Connection` that the repository layer already uses. Confirm the repository file exists and is the right home (there may be separate `history_repo` and `history_archive_repo`).
- The column `timeofban` must be mapped to the correct Python type (likely `datetime`) — match the convention of the surrounding repository functions.
- Unit tests for `history_service` that mock the db connection will need to be updated to mock the repository call instead.
1. If `health_service.probe()` is already equivalent, replace the two `run_probe(app)` calls with `await health_service.probe(socket_path)` (passing the socket path directly, avoiding the `FastAPI` argument entirely — see also TASK-05).
2. If `run_probe` does more than just probe (e.g. it also writes to `app.state`), extract that logic into a function in `health_service` and call that from both the task and the service.
**Possible traps:**
**Why this is needed:** Raw SQL in the service layer bypasses the repository abstraction. It makes the service harder to test (requires a real DB schema) and harder to maintain (schema changes must be tracked in the service, not just the repository).
- `run_probe` in `health_check.py` accepts `app: FastAPI` and uses `app.state` to cache the probe result. Replacing it with a direct `health_service` call means the cached status in `app.state` will not be updated. Decide whether the caller needs the cache updated (if yes, pass the relevant state as a parameter; if no, a direct service call is sufficient).
- After this change, verify that `tasks/health_check.py` still calls `run_probe` internally and that no other module relied on the cross-import.
**Docs changes needed:**
Architecture doc section 2.2 (Tasks) should note that tasks may call services, but services must not call tasks.
**Why:**
The inverted import creates a hidden circular-dependency risk: if `health_check` ever imports anything from `jail_config_service`, Python will raise an `ImportError`. It also makes `jail_config_service` harder to test in isolation because importing it pulls in the scheduler task module.
---
---
### TASK-04 — Remove `FastAPI` app instance from service signatures 🟠
### Task 5 — Remove the reversed dependency from `blocklist_service.py` to `app.tasks`
**Status:** Completed ✅
**Found in:**`backend/app/services/blocklist_service.py` line 530: `from app.tasks import blocklist_import as blocklist_import_task`. Services must not import from tasks; tasks are consumers of services, not dependencies of them.
**Where:**
**Goal:** Invert the dependency. The functionality needed from the task (triggering or querying a scheduled blocklist import) should be expressed as a callable or scheduler reference that is passed into the service function, or the service should expose a pure data/state function that the task calls rather than the service calling into the task.
`backend/app/services/jail_config_service.py` — `activate_jail()` and `deactivate_jail()` function signatures accept `app: FastAPI` (guarded by `TYPE_CHECKING` on line 48, used at runtime on lines ~295, ~571).
**Goal:**
**Possible traps and issues:**
Services must not depend on the FastAPI framework. The `app` object is passed solely so `run_probe(app)` can update `app.state`. After TASK-03 removes the `run_probe` dependency, this parameter becomes unnecessary. If any state from `app.state` is still needed (e.g. `socket_path`, `scheduler`), pass only those specific values — not the entire app.
- Understand what exactly the service does with `blocklist_import_task` — if it schedules or triggers a job, the job scheduler (`app.state.scheduler`) should be passed in as a parameter rather than imported from tasks.
- This may require adding a parameter to the affected service function and updating its callers (the router).
- If the task import is conditional/lazy, it still violates the dependency rule regardless of when it resolves.
Fix steps:
**Docs changes needed:** Update `Docs/Refactoring.md` and `Docs/Architekture.md` (task→service dependency direction).
1. Complete TASK-03 first.
2. Remove the `app: FastAPI` parameter from `activate_jail` and `deactivate_jail`.
3. Remove the `if TYPE_CHECKING: from fastapi import FastAPI` block if it is no longer referenced anywhere in the file.
4. Update all callers (routers and tests) to stop passing `app`.
**Possible traps:**
**Why this is needed:** Circular layer dependencies (`service → task → service`) make it impossible to test services in isolation and create hidden initialisation-order coupling that can cause import errors or subtle bugs at startup.
- The routers that call `activate_jail` inject `request: Request` specifically to access `request.app`. After removing the parameter, those routers should no longer need `Request` — check whether it is used for anything else in those handlers.
- Any test that constructs a mock `FastAPI` instance just to pass it here can be simplified.
**Docs changes needed:**
None beyond removing the `FastAPI` reference from the service signature.
**Why:**
Framework types in the service layer violate the Dependency Inversion principle, tightly couple service logic to the HTTP layer, and make unit testing significantly harder — requiring a FastAPI application instance just to call a domain function.
---
---
### TASK-05 — Extract business logic out of `tasks/history_sync.py` into a service ✅
### Task 6 — Extract log reading and service status out of `config_service.py`
**Where:**
**Found in:**`backend/app/services/config_service.py` line 687 (`read_fail2ban_log`) and line 781 (`get_service_status`). Both are unrelated to configuration management and do not belong in this service.
`backend/app/tasks/history_sync.py` — lines 15–16 import repositories directly; the entire `_run_sync_with_settings()` function contains paging logic, backfill window calculation, and `archive_ban_event` calls that constitute business logic.
**Goal:**
**Goal:** Move `read_fail2ban_log` into `log_service.py` (the dedicated log-reading service). Move `get_service_status` into `health_service.py`. Update the calling routers (`config_misc.py` uses both) to import from the correct service.
Tasks must only schedule and invoke. All logic — the backfill window decision, the paging loop, the SQL cursor over the fail2ban DB, the archive write — belongs in a service. Create (or extend) `history_service.py` with a `sync_from_fail2ban_db()` function that accepts a DB connection and socket path, encapsulates the entire sync algorithm, and returns a count of synced records. The task then becomes a thin wrapper:
**Docs changes needed:** Update `Docs/Refactoring.md`. Adjust any architecture diagram that shows `config_service` as the home for log/status operations.
- `history_service` currently uses the fail2ban database read-only. The new function writes to the BanGUI archive table — make sure the connection passed is the BanGUI DB, not the fail2ban DB. The fail2ban DB path is resolved via `get_fail2ban_db_path()` inside the service function.
- The `BACKFILL_WINDOW` and `HISTORY_SYNC_INTERVAL` constants can stay in the task file (scheduling concern) but the logic that decides whether to backfill must move to the service.
- Ensure existing tests for `history_sync` are updated to test the service function, not the task internals.
**Docs changes needed:**
**Why this is needed:** Single-responsibility principle. A configuration management service owning log reading and daemon status checks makes the service harder to reason about and violates the established service split.
Architecture doc section on Services: add `history_service.sync_from_fail2ban_db` to the service table. Architecture doc section on Tasks: update `history_sync` description to reflect that it delegates to `history_service`.
**Why:**
Business logic in the task layer cannot be unit-tested without spinning up a scheduler. It also means any future consumer of the sync logic (e.g. a manual trigger endpoint) would have to duplicate the code or import from the task module.
---
---
### TASK-06 — Break circular dependency between `config_file_service` and the three config sub-services ✅
**Found in:** Implementations exist in `backend/app/services/config_service.py`, `backend/app/services/setup_service.py`, and `backend/app/utils/setup_utils.py` (the latter used by both). The router `config_misc.py` calls `setup_service.get_map_color_thresholds()` for ongoing map color operations that have nothing to do with first-run setup.
**Where:**
**Goal:** Consolidate the single authoritative implementation in one appropriate service — likely a dedicated `settings_service.py` or inside `config_service.py`. Remove the duplicate copies. Update `config_misc.py` to import from the canonical location. Remove `get_map_color_thresholds` / `set_map_color_thresholds` from `setup_utils.py` if they only exist to serve this use case.
- `services/jail_config_service.py` line 33 → imports `config_file_service` at module level
- `services/filter_config_service.py` line 26 → same
- `services/action_config_service.py` line 28 → same
The three sub-services depend on `config_file_service` for shared parsing utilities, while `config_file_service` calls back into the three sub-services via lazy in-function imports — the only reason it works is that the circular import is deferred until first call. This is fragile: any reorganisation that removes the lazy-import guard will cause an `ImportError` at startup.
-`setup_utils.py` imports from `repositories` (itself a violation — see Task 10). Consolidating here creates a dependency chain; resolve Task 10 either before or as part of this task.
-`setup_service.py` is conceptually a first-run service; if callers of the consolidated function include both first-run setup and runtime settings, choose the runtime home and let `setup_service` call it, not the other way.
- Tests that mock `setup_service.get_map_color_thresholds` in the router tests will need updating.
The correct fix is to break the cycle by extracting the shared logic that `config_file_service` currently delegates to the sub-services into a lower-level module. Concretely:
1. Identify every function that `config_file_service` calls via lazy import from the sub-services (e.g. `list_inactive_jails`, `_activate_jail`, `list_filters`, `get_filter`, `list_actions`, etc.).
**Why this is needed:** Triplicated implementation violates DRY and means a change to the threshold schema must be made in three places. Using `setup_service` for ongoing runtime settings is conceptually wrong and misleads maintainers.
2. Determine whether `config_file_service` should own those functions directly (moving them out of the sub-services), or whether `config_file_service` should simply be split so the shared parsing utilities live in a `config_file_utils` module that all four services can import without circularity.
3. Remove all lazy in-function imports from `config_file_service`.
The safest first step is option 2: rename the pure parsing/utility functions in `config_file_service` into a new `utils/config_file_utils.py` (or extend the existing one), then have all four service files import from the util, leaving `config_file_service` to handle only orchestration.
**Possible traps:**
- `config_file_service` is over 1700 lines. Splitting it requires careful identification of which functions are pure utilities (no side effects, no DB/socket calls) vs. orchestration. Start with utility functions only.
- Any function moved to utils must not import from services. If it currently does, the abstraction is wrong and needs further decomposition.
- Test coverage for existing behaviour must be maintained throughout the refactor. Run the full test suite after each extraction step.
**Docs changes needed:**
Architecture doc section 2.1 (Project Structure): add `utils/config_file_utils.py` if created. Section 2.2 (Utils): describe the new module. Update the Service descriptions for all four affected services to accurately describe their responsibilities post-split.
**Why:**
Lazy imports mask circular dependencies at import time but surface as confusing runtime errors if import order changes. The current design also makes it impossible to import any one of the four services for testing without transitively loading all four.
---
---
### TASK-07 — Move schedule management out of `routers/blocklist.py` into a service ✅
### Task 8 — Move the `activate_jail` 3-step restart workflow out of `config_misc.py` router
**Status:** Completed ✅
**Found in:**`backend/app/routers/config_misc.py` lines 193–197: the `restart_fail2ban` handler directly orchestrates `config_file_service.stop_daemon()` → `config_file_service.start_daemon()` → `config_file_service.wait_for_fail2ban()` as a three-step inline sequence.
**Summary:** Router schedule endpoints now delegate schedule persistence and APScheduler job management to `blocklist_service`, removing the router's direct dependency on `app.tasks.blocklist_import`.
**Goal:** Extract this orchestration into a `restart_daemon()` function inside `config_file_service.py` (or a more appropriate service). The router handler should call that single function.
**Where:**
**Possible traps and issues:**
`backend/app/routers/blocklist.py` — line 51:
- Error handling differs between the three steps; the new service function must preserve the existing error semantics (what happens if `start_daemon` fails must be the same as before).
```python
- If `config_file_service.py` is being deleted (Task 3), coordinate which service owns this function after the façade is removed.
from app.tasks import blocklist_import as blocklist_import_task
-`wait_for_fail2ban` is a polling operation — confirm any timeout behaviour is preserved.
```
Used on lines 173, 206, 208 to read `blocklist_import_task.JOB_ID` and call `blocklist_import_task.reschedule(app)`.
Routers must not know about APScheduler job IDs or the task module internals. Schedule management is business logic and belongs in a service. Create a `get_schedule()` and `update_schedule()` function in `blocklist_service.py` that encapsulate the scheduler interaction. The router calls the service function and returns its result.
The router's `GET /schedule` and `PUT /schedule` handlers should be reduced to:
**Why this is needed:** Multi-step orchestration workflows are business logic and do not belong in HTTP handler functions. Keeping them in the service layer makes the workflow testable in isolation and reusable if a second endpoint ever needs to restart the daemon.
The `JOB_ID` constant and `reschedule()` call details move into `blocklist_service`. `blocklist_service` may import from the task module (tasks can be imported by services for constants/types), but the router should have no direct task dependency.
**Possible traps:**
- `blocklist_import_task.reschedule(app)` uses `app` to read `app.state`. If the service function receives the scheduler directly (already available via `SchedulerDep`), it can avoid the `app` dependency. Pass `scheduler` explicitly instead.
- The `SchedulerDep` injection in the router should remain unchanged — the scheduler is a dependency, not a business concern, and injecting it via `Depends()` is correct.
**Docs changes needed:**
Architecture doc Service table: add `get_schedule` and `update_schedule` to the `blocklist_service` entry.
**Why:**
The router knows the internal job ID string and directly manipulates the scheduler — both are implementation details of the task/scheduling layer. This makes the endpoint hard to test without a running scheduler instance.
---
---
### TASK-08 — Remove inverted layer dependency: `utils/fail2ban_db_utils.py` importing a service ✅
### Task 9 — Move `sign_session_token` call out of `auth.py` router into `auth_service.login()`
**Status:** Completed ✅
**Found in:**`backend/app/routers/auth.py` line 74 calls `sign_session_token(token, secret)` after calling `auth_service.login()`. The raw token is returned by the service; the signing is performed in the router.
**Where:**
**Goal:**`auth_service.login()` should return an already-signed token. The router should receive the final cookie value from the service without needing to know about the signing step. Move the `sign_session_token` call into `auth_service.login()`.
`backend/app/utils/fail2ban_db_utils.py` — line 8:
```python
from app.services.fail2ban_metadata_service import default_fail2ban_metadata_service
```
**Goal:**
**Possible traps and issues:**
`utils/` must be the lowest layer — it may not import from `services/`. The `fail2ban_db_utils` module is currently a thin wrapper around `default_fail2ban_metadata_service`. Two options exist:
-`auth_service.login()` currently has a specific return type; changing what it returns (raw token → signed token) must be reflected in the return type annotation and any tests that assert on the raw token value.
-`sign_session_token` is currently a public function exported from `auth_service`. After this change it may become a private implementation detail (`_sign_session_token`). Check for any other callers before hiding it.
- Tests that test `login()` in isolation and currently check for the raw token will need to be updated to verify the signed format instead (or mock `sign_session_token` separately).
1. **Move the singleton down**: Move `Fail2BanMetadataService` (and its `default_fail2ban_metadata_service` singleton) from `services/` to `utils/`. The class has no service dependencies — it only uses `Fail2BanClient` from `utils/` — so it belongs in `utils/` naturally.
2. **Eliminate the util wrapper**: Delete `fail2ban_db_utils.get_fail2ban_db_path()` and have all callers import from `fail2ban_metadata_service` directly (since it is a service and services can import other services).
Option 1 is cleaner and consistent with the architecture: `Fail2BanMetadataService` is essentially an async-safe caching wrapper around a socket command, which is a utility concern.
**Why this is needed:** Security-sensitive token construction is business logic. Letting the HTTP layer build cookie values means a future maintainer adding a second login endpoint could forget the signing step, creating a silent security regression.
**Possible traps:**
- If `Fail2BanMetadataService` is moved to `utils/`, the import path changes everywhere it is referenced. Use IDE rename-symbol to update all references atomically.
- The architecture doc currently does not list `fail2ban_metadata_service` at all. After moving it to utils, it must be documented in the Utils table.
- Any service that currently imports `fail2ban_db_utils` (rather than `fail2ban_metadata_service`) will need their imports updated if `fail2ban_db_utils` is removed.
**Docs changes needed:**
Architecture doc section 2.2 (Utils): add `fail2ban_db_utils.py` (or its replacement) and `fail2ban_metadata_service` to the Utils table.
**Why:**
A util importing a service creates an implicit upward dependency that the architecture explicitly forbids. It also makes `fail2ban_db_utils` untestable in isolation — importing it drags in the service and its lazy-loaded metadata cache.
---
---
### TASK-09 — Move `re_resolve_geo` orchestration into `geo_service` ✅
### Task 10 — Remove the repository import from `utils/setup_utils.py`
**Status:** Completed ✅
**Found in:**`backend/app/utils/setup_utils.py` line 5: `from app.repositories import settings_repo`. The `utils` layer must not import from `repositories`; that is the service layer's job.
**Where:**
**Goal:** Move any function in `setup_utils.py` that depends on `settings_repo` into `setup_service.py` (which is already allowed to import repositories). The remaining pure utility functions (password hashing, etc.) can stay in `setup_utils.py`.
The endpoint currently performs a multi-step workflow inline:
-`setup_utils.py` is imported by `auth_service.py` (`from app.utils.setup_utils import get_password_hash`). This import is fine and must be preserved; only the repository-dependent functions move.
1. Fetch all unresolved IPs from the DB.
- Identify every function in `setup_utils.py` that directly or indirectly calls `settings_repo`. Move the entire call chain, not just the top-level function.
2. Early-return if the list is empty.
- The functions being moved may already have analogues in `setup_service.py`; check for duplication before moving.
3. Clear the negative cache.
4. Batch-resolve all IPs.
5. Aggregate the count of newly resolved entries.
Steps 1–5 are business logic. The router should call a single service function and return its result:
**Docs changes needed:** Update `Docs/Refactoring.md` and `Docs/Architekture.md` (layer diagram).
```python
**Why this is needed:** Utils are stateless helpers with no external dependencies. Allowing them to import from the repository layer breaks the architectural contract, makes the utilities untestable without a database, and obscures what is actually a service-layer operation.
result = await geo_service.re_resolve_all(db, http_session)
return result # {"resolved": N, "total": M}
```
Add `async def re_resolve_all(db, http_session) -> dict[str, int]` to `geo_service.py` containing the current implementation.
**Possible traps:**
- `clear_neg_cache()` is a synchronous function that mutates the module-level `_neg_cache` dict. Calling it from inside the service function is fine, but be aware that it runs on the event loop thread with no lock. This is an existing race condition (see TASK-11) — do not introduce additional concurrent callers without addressing that issue first.
- The return type `dict[str, int]` is an untyped dict. Consider adding a proper response model `GeoReResolveResponse` to `models/geo.py` to make the API contract explicit.
**Docs changes needed:**
Architecture doc Service table: add `re_resolve_all` to the `geo_service` entry.
**Why:**
Multi-step orchestration in the router violates the zero-business-logic rule. It also prevents the re-resolve logic from being triggered from any other context (e.g. a scheduled task or CLI tool) without duplicating code.
---
---
### TASK-10 — Move `GeoInfo → GeoDetail` translation out of the router ✅
### Task 11 — Centralise `DbDep` — remove local redefinitions in `blocklist.py` and `geo.py`
**Status:** Completed ✅
**Found in:**`backend/app/routers/blocklist.py` line 54 redefines `DbDep = Annotated[aiosqlite.Connection, Depends(get_db)]` locally. `backend/app/routers/geo.py` uses the inline form `Annotated[aiosqlite.Connection, Depends(get_db)]` directly in function signatures instead of importing `DbDep`.
**Where:**
**Goal:** Delete the local redefinitions. Import `DbDep` from `app.dependencies` in both routers.
The router performs conditional type checking and cross-schema mapping. The cleanest fix is to have `jail_service.lookup_ip()` return a typed dict whose `"geo"` field is already `GeoDetail | None`, eliminating the isinstance check entirely. Alternatively, add a `GeoInfo.to_detail()` method or a `geo_service.to_detail()` converter that the router calls in one line.
- Confirm that `DbDep` is already exported from `app/dependencies.py` (verify the exact name used there).
- The inline form in `geo.py` may appear in multiple function signatures; all occurrences must be replaced.
- No behaviour change — purely a refactor. Run the test suite to confirm.
This is a small but clean change: update `IpLookupResult` TypedDict in `jail_service` so that `"geo"` is typed as `GeoDetail | None`, and have the service construct the `GeoDetail` before returning.
**Why this is needed:** A single definition of `DbDep` ensures that any future change to the db dependency (e.g. adding row factory configuration) is applied uniformly across all routers.
- `IpLookupResult` is currently typed with `"geo": GeoInfo | None`. Changing it to `GeoDetail` means callers that rely on `GeoInfo`-specific fields (like `asn`) must use `GeoDetail` instead — confirm `GeoDetail` exposes the same fields.
- If `GeoInfo` and `GeoDetail` are nearly identical, consider merging them into a single model to avoid this translation layer entirely.
**Docs changes needed:**
None — this is an internal type alignment.
**Why:**
Schema translation in the router adds fragility: if either model changes, the mapping code must be updated and the connection is not enforced by the type system. Having the service return the correct type is safer and is consistent with the "routers return typed responses" principle.
---
---
### TASK-11 — Fix `asyncio.Lock()` created at module import time in `jail_service.py` ✅
### Task 12 — Complete the protocol injection layer or remove it
**Status:** Completed ✅
**Found in:**`backend/app/dependencies.py` defines `AuthServiceDep`, `JailServiceDep`, `ConfigServiceDep`, `GeoServiceDep`, `HistoryServiceDep`, `BlocklistServiceDep`, `HealthServiceDep`, `ServerServiceDep` as protocol-typed dependency aliases. Only `routers/auth.py` and `routers/jails.py` actually use these; the other 16 routers import concrete service modules directly.
**Where:**
**Goal:** Make a deliberate decision and enforce it consistently. Option A: adopt the protocol injection pattern everywhere — update all 16 non-compliant routers to accept their service via the typed `Dep` alias. Option B: acknowledge the pattern is unused overhead and remove the protocol aliases and `cast()` wrappers from `dependencies.py`, letting all routers import concrete services directly (the current de facto standard). Option B is cheaper; Option A makes service substitution in tests easier.
`backend/app/services/jail_service.py` — lines 71 and 78:
```python
_reload_all_lock: asyncio.Lock = asyncio.Lock()
_backend_cmd_lock: asyncio.Lock = asyncio.Lock()
```
**Goal:**
**Possible traps and issues:**
`asyncio.Lock()` created before the event loop starts is bound to whichever loop exists at import time. In Python 3.10+ this is generally safe for a single-worker deployment using uvicorn, but it breaks test suites that create a new event loop per test (e.g. `pytest-asyncio` with `loop_scope="function"`), causing a `RuntimeError: Task attached to a different loop` error when the second test tries to acquire the lock.
- Option A requires updating all 16 routers and their test files simultaneously; this is a broad change with high regression risk. Stage it one router at a time.
- Option B means deleting `services/protocols.py` (398 lines) and all `cast("…Service", …)` calls in `dependencies.py`. Ensure nothing outside this layer references the Protocol classes (check test files).
-`dependencies.py` also has repository protocol aliases (`BlocklistRepository`, `ImportLogRepository`, etc.) — decide whether those follow the same fate.
Fix: initialise the locks lazily on first use:
**Docs changes needed:** Update `Docs/Architekture.md` to reflect the chosen pattern. Update `Docs/Refactoring.md`.
```python
_reload_all_lock: asyncio.Lock | None = None
_backend_cmd_lock: asyncio.Lock | None = None
def _get_reload_lock() -> asyncio.Lock:
**Why this is needed:** Inconsistency between `auth.py`/`jails.py` (uses protocols) and every other router (bypasses them) makes the codebase confusing and the injection layer misleading. This is the most widespread single structural inconsistency in the router layer.
global _reload_all_lock
if _reload_all_lock is None:
_reload_all_lock = asyncio.Lock()
return _reload_all_lock
```
Then replace `async with _reload_all_lock:` with `async with _get_reload_lock():`.
**Possible traps:**
- Lazy initialisation is not thread-safe if multiple coroutines call the getter simultaneously before initialisation. In an asyncio context this is safe (single-threaded event loop), but add a brief comment explaining why.
- If the test suite currently happens to pass despite this issue (because `pytest-asyncio` is configured with `loop_scope="session"`), the bug is latent rather than active — still worth fixing for configuration flexibility.
**Docs changes needed:**
None.
**Why:**
Module-level asyncio primitives are a known footgun in Python async codebases. Fixing this is low-risk and makes the test suite more robust against pytest-asyncio configuration changes.
---
---
### TASK-12 — Remove duplicate import in `server_service.py` ✅
### Task 13 — Move `ban_ip`, `unban_ip`, and `get_active_bans` from `jail_service` to `ban_service`
**Where:**
**Found in:**`backend/app/services/jail_service.py` contains `ban_ip`, `unban_ip`, and `get_active_bans`. These operations conceptually belong in `ban_service.py`, which is the declared home for ban management. Routers `bans.py` and `blocklist.py` already import `jail_service` specifically for these functions.
**Goal:** Move the three functions into `ban_service.py`. Update `bans.py` and `blocklist.py` to import from `ban_service` instead of `jail_service`. Remove the now-redundant `jail_service` import from those routers if it is no longer needed.
Delete line 18 (the exact duplicate). One-line change.
**Possible traps:**
**Possible traps and issues:**
None.
-`ban_ip` and `unban_ip` likely use the fail2ban socket through utilities shared with the rest of `jail_service`. Confirm there is no shared private helper that needs to move with them.
-`get_active_bans` may overlap with existing functions in `ban_service`. Check for duplication and merge if needed.
- If `jail_service` tests cover these functions, move those test cases to the `ban_service` test file.
**Docs changes needed:**
**Docs changes needed:** Update `Docs/Architekture.md` service responsibility table. Update `Docs/Refactoring.md`.
None.
**Why:**
**Why this is needed:** Function placement that contradicts the declared service boundary makes it harder to find behaviour during maintenance and violates the principle of least surprise. The service name should predict where operations live.
Duplicate imports generate linter warnings, add noise to diffs, and signal that the file was edited carelessly.
---
---
### TASK-13 — Document `fail2ban_metadata_service` and `history_sync` task in the architecture doc 🟡
### Task 14 — Lift geo-enrichment closure construction out of `history.py` router
**Status:** Completed ✅
**Found in:**`backend/app/routers/history.py` lines 110, 147, and 189 each build a `_enricher` async closure and pass it to `history_service` calls. The closure captures `geo_service` and `http_session` from the outer scope and adapts the geo lookup interface for the service.
**Summary:** Added `fail2ban_metadata_service.py` to the service documentation and confirmed `history_sync.py` is documented as delegating to `history_service.py` in `Docs/Architekture.md`.
**Goal:** Move the closure construction into `history_service.py` or into a shared helper in `geo_service.py`. The router should pass the `http_session` and let the service build the enricher internally, or the service should accept `geo_service` as an injected callable.
-`history_service` already accepts an optional `geo_enricher: GeoEnricher | None` parameter — this is the hook. The router's closures are just adapters for that parameter. Either the service gains a higher-level function that takes `http_session` directly, or a factory function in `geo_service` builds the enricher.
- The same pattern exists in `geo.py` and `ban_service.py` — a consistent solution should handle all three callsites, not just `history.py`.
- Changing the `history_service` function signature will require updating tests.
Two modules exist in the codebase that are absent from the architecture document:
1. `services/fail2ban_metadata_service.py` — An async-safe caching wrapper that resolves the fail2ban SQLite database path by querying the socket once and caching the result. Exports `default_fail2ban_metadata_service`, a module-level singleton. Should be listed in the Services table (or moved to Utils — see TASK-08) with a description of its caching behaviour and the singleton pattern.
**Why this is needed:** Adapter/closure construction is glue code that belongs at the service boundary or in a factory, not in the HTTP handler. Routers should not need to understand the geo service's lookup interface to serve history requests.
2. `tasks/history_sync.py` — A scheduled background task (interval: 300 seconds) that copies new ban records from the fail2ban database into BanGUI's own archive table to preserve history that would otherwise be purged. Should be added to the Tasks table.
---
Additionally note in section 6 (Authentication): the `_session_cache` in `dependencies.py` noted there is now implemented in `utils/session_cache.py` — confirm the section still accurately reflects the implementation.
### Task 15 — Fix stale activation record on failed `activate_jail`
**Possible traps:**
**Found in:**`backend/app/routers/jail_config.py` line 385. `record_activation(app, name)` is called unconditionally before the service call. When the service raises any exception, no handler clears `last_activation`. For the following 60 seconds, the health-check task will misattribute any fail2ban offline event to this failed activation, potentially creating a spurious `PendingRecovery`.
- If TASK-08 is completed before this task, the location of `fail2ban_metadata_service` will change. Write the doc entry after TASK-08 is decided.
**Docs changes needed:**
**Goal:** Wrap the service call in a try/except that clears `last_activation` on any exception before re-raising. Alternatively, only call `record_activation` after a successful return.
`Docs/Architekture.md`:
- Add `fail2ban_metadata_service.py` row to the Services (or Utils) table.
- Add `history_sync.py` row to the Tasks table.
- Cross-reference TASK-05 once it is complete (the task should delegate to a service).
**Why:**
**Possible traps and issues:**
The architecture document is a contract between contributors. Undocumented modules lead to inconsistent patterns and duplicate implementations when new developers add features that the undocumented module already handlzttttttttt nbbbbbbbbbbes.
- Moving `record_activation` to after the `await` means the timing is slightly later (by the service call duration). This is acceptable because the 60-second window is generous.
- Ensure every exception branch (not just the ones currently handled) triggers the cleanup. Use a `try/finally` to clear the record on failure, not individual `except` blocks.
- The unused `activation_time` return value of `record_activation` should be removed if moving the call eliminates the need for it.
**Why this is needed:** A false `PendingRecovery` presented to the user offers a rollback for a jail that was never successfully activated. This could confuse operators into performing a rollback that modifies configuration unnecessarily.
---
### Task 16 — Add a lock to `geo_service.py` module-level mutable state
**Found in:**`backend/app/services/geo_service.py` lines 99–110: `_cache: dict`, `_neg_cache: dict`, `_dirty: set`, and `_geoip_reader` are mutable module-level singletons with no concurrency control. Background tasks (`geo_cache_flush`, `geo_re_resolve`) and request handlers all read and write these concurrently.
**Goal:** Add an `asyncio.Lock` protecting mutations of `_cache`, `_neg_cache`, and `_dirty`. The lock should be acquired for any write and for the copy-then-clear pattern in `flush_dirty`. Read-only accesses that are single-await-free are safe without a lock in the single-threaded asyncio model, but mutation sites must be explicit.
**Possible traps and issues:**
- The comment in `flush_dirty` states that no `await` between copy and clear makes it safe today; a lock is still preferable to ensure the invariant is enforced rather than relied upon implicitly.
- Avoid holding the lock across network I/O (e.g. the ip-api.com fetch in `lookup`). Acquire it only around the dict/set mutation itself.
-`asyncio.Lock` is not thread-safe if `run_in_executor` is used anywhere in the geo path — verify `_cache` is never read from a thread pool.
- Introducing a lock is low risk but adds overhead on every cache write; profile if the geo cache is a hot path.
**Why this is needed:** The current safety is implicit and fragile. A future change that adds an `await` inside the critical section (e.g. logging to a remote sink) would silently introduce data loss in the dirty-flush path. An explicit lock documents the intent and makes the safety guarantee unconditional.
---
### Task 17 — Move crash-detection logic out of the `health_check` task
**Found in:**`backend/app/tasks/health_check.py` lines 70–113. The `_run_probe_with_resources` function contains a state machine that checks `last_activation` timing, detects `online→offline` transitions, and writes `PendingRecovery` records. This is domain business logic embedded in a scheduling function.
**Goal:** Extract the crash-detection state machine into a function in `backend/app/utils/runtime_state.py` or into a new `recovery_service.py`. The health-check task should call that function with the previous status, the new status, and the current runtime state — and receive back the updated state without containing the decision logic itself.
**Possible traps and issues:**
- The state machine references `datetime.datetime.now(UTC)` — ensure the extracted function remains deterministic/injectable for testing by accepting `now` as an optional parameter.
-`runtime_state.py` already holds `record_activation`, `create_pending_recovery`, `clear_pending_recovery`. The extracted logic is a natural neighbour there.
- When moving the logic, preserve the exact 60-second window (`_ACTIVATION_CRASH_WINDOW`) and the guard that prevents overwriting an unresolved record.
- Tests for the health check task will need to be refactored: the task becomes trivially thin; the logic can now be unit-tested without a running scheduler.
**Why this is needed:** Business rules about crash attribution timing should not live inside a scheduling artifact. Embedding decision logic in a background job makes it invisible, hard to test without the scheduler, and impossible to reuse from another trigger (e.g. a manual probe endpoint).
---
### Task 18 — Replace `__import__("datetime")` antipattern in `health_check.py`
**Found in:**`backend/app/tasks/health_check.py` lines 165–166: `next_run_time=__import__("datetime").datetime.now(tz=__import__("datetime").timezone.utc)`. The `datetime` module is already imported at the top of the file on line 6.
**Goal:** Replace the `__import__` calls with the already-imported `datetime` name: `next_run_time=datetime.datetime.now(tz=datetime.timezone.utc)`.
**Possible traps and issues:**
- This is a trivial one-line change. Confirm the module-level `import datetime` exists and is not inside a `TYPE_CHECKING` block.
- Run the health-check related tests to confirm the scheduler still fires on startup.
**Docs changes needed:** None.
**Why this is needed:**`__import__()` is an implementation detail of Python's import system and should never appear in application code. It obscures the dependency, bypasses linting checks, and signals that the code was written reactively to solve a circular import that no longer exists.
---
### Task 19 — Remove the `setup_service` misuse in `config_misc.py` for runtime map-color settings
**Found in:**`backend/app/routers/config_misc.py` uses `setup_service.get_map_color_thresholds()` and `setup_service.update_map_color_thresholds()` for the ongoing `/api/config/map-colors` endpoints. `setup_service` is intended for first-run setup only.
**Goal:** After Task 7 consolidates the map-color functions into the canonical runtime service, update `config_misc.py` to call that service instead of `setup_service`. Ensure `setup_service` no longer exports these functions publicly.
**Possible traps and issues:**
- This task is a follow-on to Task 7; do not attempt it before Task 7 is complete.
- If `setup_service` still needs these functions during first-run setup, it should call the canonical service rather than owning the implementation.
- Router tests that mock `setup_service.get_map_color_thresholds` will need updating.
**Why this is needed:** Using a first-run setup service for ongoing runtime operations misleads developers into modifying the wrong service when behaviour needs to change. It also creates an implicit dependency between setup state and runtime configuration.
---
### Task 20 — Add global exception handlers for domain exceptions in `main.py`
**Found in:**`backend/app/main.py`. Domain exceptions raised by services (`JailNotFoundError`, `ConfigValidationError`, `ConfigWriteError`, `FilterNotFoundError`, etc.) are caught individually with repetitive try/except blocks in each router handler. There are no global `@app.exception_handler` registrations for domain exceptions beyond `Fail2BanConnectionError` and `Fail2BanProtocolError`.
**Goal:** Register `@app.exception_handler` entries in `main.py` for the most commonly caught domain exception classes (at minimum: `JailNotFoundError` → 404, `ConfigValidationError` → 400, `ConfigWriteError` → 500). This allows router handlers to let these exceptions propagate naturally, removing boilerplate try/except blocks.
**Possible traps and issues:**
- Not all routers want the default status code; some endpoints return 409 or 400 for conditions that another endpoint maps to 500. Global handlers are only appropriate for exceptions with a consistent HTTP mapping across all consumers.
- Introduce global handlers incrementally: add the handler first, then remove the try/except from one router at a time and verify the test suite still passes.
- Ensure the handler response format matches the existing `{"detail": "…"}` convention so the frontend does not need updating.
-`HTTPException` must remain a local catch and not be accidentally swallowed by a broad domain handler.
**Why this is needed:** The same try/except→HTTPException conversion is duplicated across every router endpoint. Global handlers reduce boilerplate, make error mapping auditable in one place, and prevent inconsistencies where the same exception produces different status codes in different routes.
---
### Task 21 — Restrict tasks to opening their own DB connections only where justified; document the pattern
**Found in:**`backend/app/tasks/blocklist_import.py`, `geo_cache_flush.py`, `geo_re_resolve.py`, and `history_sync.py`. Each task calls `open_db(settings.database_path)` directly rather than receiving a DB connection via dependency injection. This is structurally inconsistent with the request-level `get_db` dependency used by routers.
**Goal:** Document in `Docs/Architekture.md` that background tasks are intentionally outside FastAPI's dependency injection scope and must manage their own DB connections. Add a short shared helper (e.g. `tasks/_db.py` or reuse `open_db`) with a consistent context-manager pattern so all four tasks open/close connections the same way. If a task already has correct cleanup (try/finally close), confirm it and leave it; do not change what works.
**Possible traps and issues:**
- Do not attempt to force APScheduler background jobs through FastAPI's DI system — this is not supported without a running request context and would require thread-unsafe hacks.
- Focus on consistency: all tasks should use the same open/close pattern and handle exceptions the same way.
- If a task currently leaks a connection on exception, fix the cleanup; if it is correct, only add the documentation.
**Docs changes needed:** Add a "Background Tasks and Database Access" section to `Docs/Architekture.md` explaining why tasks own their connections and how to write a new task.
**Why this is needed:** Without explanation, the inconsistency between router DI-provided connections and task-managed connections looks like an oversight. Documentation prevents future developers from incorrectly trying to inject a DB connection into a task via `Depends`, which would fail silently at runtime.
Blocking a user prevents them from interacting with repositories, such as opening or commenting on pull requests or issues. Learn more about blocking a user.