Fix setup_service to mark setup_complete only after successful runtime DB init

This commit is contained in:
2026-04-12 20:30:22 +02:00
parent e6df045e5e
commit 8e43ef9ad2
4 changed files with 758 additions and 26 deletions

View File

@@ -6,4 +6,703 @@ Reference: `Docs/Refactoring.md` for full analysis of each issue.
---
## Open Issues
---
### Task 1 — Fix setup_service: mark setup_done only after successful DB init
**Status:** Completed
**Severity:** Bug
**Where:**
`backend/app/services/setup_service.py``run_setup()` function, last few lines where `settings_repo.set_setting(db, _KEY_SETUP_DONE, "1")` is called unconditionally.
**Goal:**
`_KEY_SETUP_DONE` must only be written to the bootstrap DB after `_ensure_database_initialized()` has confirmed the runtime database was successfully created and all initial settings were persisted. The write must be the final step in a successful path, not an unconditional epilogue.
**Possible traps and issues:**
- The `_ensure_database_initialized()` helper returns `False` on failure rather than raising — callers must check the return value explicitly rather than relying on exceptions.
- The bootstrap DB and the runtime DB are two different connections; make sure the `set_setting` call targets the bootstrap DB (the one that holds `setup_completed`) and not the runtime DB.
- `run_setup()` is not idempotent by design (it raises if called twice), so a failed first run that incorrectly sets the flag will permanently block the setup wizard without a manual DB reset.
**Docs changes needed:**
`Docs/Architekture.md` section 6 "Authentication & Session Management" describes the setup-completion flag. The note about `is_setup_complete()` caching should clarify that the flag is only written after a fully successful setup.
**Why this is needed:**
If the runtime DB creation fails (e.g., path does not exist, permission denied) but `setup_completed = 1` is still written to the bootstrap DB, every subsequent request will pass the setup-completion check, `get_runtime_database_path()` will return the invalid path, and opening that DB will throw a 503 error on every request with no way to re-run setup through the UI.
---
### Task 2 — Fix log_service.preview_log: stale run_blocking call site
**Severity:** Bug
**Where:**
`backend/app/services/log_service.py` line 74 — the call `await run_blocking(None, _read_tail_lines, str(path), req.num_lines)`.
**Goal:**
Change the call to `await run_blocking(_read_tail_lines, str(path), req.num_lines)`. The `run_blocking` signature is `run_blocking(func, *args, *, executor=None)` — the first argument is the callable, not an executor. The `None` was left over from an older signature where `executor` was the first positional argument.
**Possible traps and issues:**
- `asyncio.to_thread(None, ...)` does not raise immediately at the `await` site; it may produce a confusing `TypeError: 'NoneType' object is not callable` at the point where the thread runs, which can be hard to trace if logs are sparse.
- Verify no other call sites in `log_service.py` or any future callers of `run_blocking` use the old positional-executor convention.
- After fixing, run the log-preview endpoint manually or via the test suite to confirm the fix.
**Docs changes needed:**
None — this is a straightforward call-site correction, not a design change.
**Why this is needed:**
The `GET /api/config` log preview endpoint (`POST /api/config/preview-log`) calls `preview_log()`, which will crash with a `TypeError` on every invocation. The feature has been silently broken since the `run_blocking` signature was changed.
---
### Task 3 — Fix filter_config router: import ConfigWriteError from the correct module
**Severity:** Bug
**Where:**
`backend/app/routers/filter_config.py` line 8 — `from app.services.config_file_service import ConfigWriteError`.
Compare with `backend/app/routers/action_config.py` line 8 — `from app.exceptions import ConfigWriteError`.
`filter_config_service` raises its own internal exceptions that may or may not match the imported class identity.
**Goal:**
All routers must catch `ConfigWriteError` from `app.exceptions` — the single canonical definition. The `filter_config.py` import must be changed to `from app.exceptions import ConfigWriteError`. The same class must also be the one actually raised by `filter_config_service` (verify the service raises from `app.exceptions`, not from a local definition).
**Possible traps and issues:**
- Python exception matching uses class identity (`isinstance`), not structural typing. If the service raises `config_file_service.ConfigWriteError` but the router catches `exceptions.ConfigWriteError`, the `except` block silently misses and the exception propagates to the global 500 handler.
- `filter_config_service.py` imports `ConfigWriteError` from `config_file_service` (line 8 of that file). After fixing the router, also ensure the service raises the canonical one from `app.exceptions`.
- This fix depends on Task 4 (consolidating all exception classes into `app.exceptions`) being done first or in parallel.
**Docs changes needed:**
`Docs/Architekture.md` section 2.2 services table — add a note that all domain exceptions are raised from `app.exceptions`, not from individual service modules.
**Why this is needed:**
Any config file write error in a filter operation (bad content, permission denied, disk full) will be swallowed and returned as HTTP 500 instead of the correct 400/503. Users will see an unhelpful "unexpected error" message with no actionable detail.
---
### Task 4 — Consolidate all exception classes into app.exceptions
**Severity:** High
**Where:**
`backend/app/exceptions.py` — canonical location, currently has `JailNotFoundInConfigError`, `ConfigWriteError`, `ConfigValidationError`, `ConfigOperationError`, and others.
Duplicates in:
- `backend/app/services/config_file_service.py` lines 128, 167, 171 — `JailNotFoundInConfigError`, `JailNameError`, `ConfigWriteError`
- `backend/app/services/jail_config_service.py` lines 84, 123, 127 — same three classes
- `backend/app/services/filter_config_service.py` lines 45, 95 — `FilterNotFoundError`, `JailNameError`
- `backend/app/services/action_config_service.py` lines 44, 116 — `ActionNotFoundError`, `JailNameError`
- `backend/app/services/config_file_service.py` lines 1611 and 2223 — `FilterNotFoundError`, `ActionNotFoundError`
**Goal:**
Every exception class lives exactly once in `app.exceptions`. All services and routers import exceptions from there. Local class definitions in service files are removed. The complete canonical list: `JailNotFoundError`, `JailNotFoundInConfigError`, `JailNameError`, `JailAlreadyActiveError`, `JailAlreadyInactiveError`, `JailOperationError`, `ConfigValidationError`, `ConfigOperationError`, `ConfigWriteError`, `FilterNotFoundError`, `FilterInvalidRegexError`, `FilterAlreadyExistsError`, `FilterNameError`, `FilterReadonlyError`, `ActionNotFoundError`, `ActionAlreadyExistsError`, `ActionNameError`, `ActionReadonlyError`, `ServerOperationError`.
**Possible traps and issues:**
- When removing a local class definition from a service, check every `except` clause in every router and other service that previously imported that class from the service module — those imports must be updated to `app.exceptions`.
- `filter_config.py` and `action_config.py` import `JailNameError` and `JailNotFoundInConfigError` from `jail_config_service` — these must also move to `app.exceptions`.
- Run the full test suite after each service is migrated, not at the end, to catch regressions early.
- Pay attention to `JailAlreadyActiveError` and `JailAlreadyInactiveError` — these are currently only in `jail_config_service` and not yet in `app.exceptions`.
**Docs changes needed:**
`Docs/Architekture.md` section 2.2 — add `exceptions.py` to the utils description table, noting it is the single source of truth for all domain exceptions.
**Why this is needed:**
With multiple copies of the same class name in different modules, `isinstance` checks silently fail when a service raises class A but the router catches class B with the same name. The result is unexpected 500 errors that are hard to reproduce and diagnose.
---
### Task 5 — Resolve config_file_service.py god object and dual implementations
**Severity:** High
**Where:**
`backend/app/services/config_file_service.py` — 3126 lines embedding jail, filter, and action domain logic.
`backend/app/services/jail_config_service.py` — 1066 lines.
`backend/app/services/filter_config_service.py` — 927 lines.
`backend/app/services/action_config_service.py` — 1070 lines.
`backend/app/helpers/config_file_helpers.py` — re-exports private symbols from `config_file_service`.
**Goal:**
Establish one canonical implementation per domain. The intended split (per `Docs/Architekture.md`) is:
- `config_file_service.py` — shared infrastructure only: config file parsing, atomic writes, path helpers, `_parse_jails_sync`, `_build_parser`, `_safe_jail_name`, etc.
- `jail_config_service.py` — jail activation/deactivation, `activate_jail`, `deactivate_jail`, `validate_jail_config`, `rollback_jail`, `list_inactive_jails`.
- `filter_config_service.py` — filter lifecycle: `list_filters`, `get_filter`, `create_filter`, `update_filter`, `delete_filter`, `assign_filter_to_jail`.
- `action_config_service.py` — action lifecycle: same shape as filter.
Duplicate function bodies in `jail_config_service.py` that duplicate what is already in `config_file_service.py` must be removed. The canonical versions stay in `config_file_service.py` (or are promoted to `app/helpers/` if they are shared infrastructure). `config_file_helpers.py` should be able to import from `config_file_service` without re-exporting private names.
**Possible traps and issues:**
- The routers currently import from `jail_config_service` (e.g., `jail_config.py` line 33). After removing duplicates, confirm all router imports still resolve.
- `start_daemon()` and `wait_for_fail2ban()` are duplicated verbatim in two files — decide on one owner (likely `config_file_service`) and delete the other copy; update all callers.
- This is the largest refactoring task. Do it in stages: first remove class duplicates (covered by Task 4), then consolidate functions one group at a time, running tests after each group.
- The `helpers/config_file_helpers.py` re-exports private symbols (`_parse_jails_sync` etc.) because the domain services need them. After this task those symbols should be importable directly from `config_file_service` without being private, or should be moved to a proper shared util module.
**Docs changes needed:**
`Docs/Architekture.md` section 2.2 services table — update the description of `config_file_service.py`, `jail_config_service.py`, `filter_config_service.py`, `action_config_service.py` to reflect the final split after the refactor.
**Why this is needed:**
7201 lines across 5 files with two active implementations of the same logic is the dominant source of maintenance risk in the backend. Any bug fix or feature to jail/filter/action config must be applied in multiple places, and it is currently unclear which implementation is authoritative.
---
### Task 6 — Extract fail2ban response helpers into a shared util module
**Severity:** High
**Where:**
16 copies across 8 service files:
- `_ok()` in `jail_service.py:90`, `config_service.py:77`, `config_file_service.py:570` (inner fn), `server_service.py:64`, `health_service.py:33`
- `_to_dict()` in `jail_service.py:113`, `config_service.py:98`, `health_service.py:60`
- `_ensure_list()` in `jail_service.py:134`, `config_service.py:112`
- `_is_truthy()` in `config_file_service.py:294`, `jail_config_service.py:169`, `filter_config_service.py:157`, `action_config_service.py:158`
- `_is_not_found_error()` in `jail_service.py:155`, `config_service.py:147`
**Goal:**
Create `backend/app/utils/fail2ban_utils.py` (or extend `fail2ban_client.py`) with public functions:
- `unwrap_response(response) -> object` (the current `_ok` logic)
- `response_to_dict(pairs) -> dict[str, object]` (the current `_to_dict` logic)
- `response_to_list(value) -> list[str]` (the current `_ensure_list` logic)
- `is_truthy(value: str) -> bool` (the current `_is_truthy` logic)
- `is_not_found_error(exc: Exception) -> bool` (the current `_is_not_found_error` logic)
All 8 service files then import from this one location and delete their local copies.
**Possible traps and issues:**
- `_ok()` in `config_file_service.py` is defined as an inner function inside a specific method, not at module level — it must be extracted before it can be replaced.
- Check all call sites for subtle differences: `_ok` in some files casts the return, in others it does not. Ensure the shared version covers all variants.
- `_is_truthy` is always paired with `_TRUE_VALUES = frozenset({"true", "yes", "1"})` — move both together. This overlaps with Task 9 (consolidating `_TRUE_VALUES`).
- After replacement, run the full test suite — a signature mismatch will cause runtime `TypeError`s that may not be caught by static analysis.
**Docs changes needed:**
`Docs/Architekture.md` section 2.2 utils table — add `fail2ban_utils.py` (or document the extension to `fail2ban_client.py`) and describe the response-parsing helpers.
**Why this is needed:**
A future bug fix to the pickle-response parsing logic must be applied in 16 places. A discrepancy between copies already exists: the inner `_ok` in `config_file_service.py` does not cast the return, while the top-level versions do.
---
### Task 7 — Adopt constants.py FAIL2BAN_SOCKET_TIMEOUT_SECONDS everywhere
**Severity:** High
**Where:**
`backend/app/utils/constants.py` line 16 — `FAIL2BAN_SOCKET_TIMEOUT_SECONDS: Final[float] = 5.0` (never imported by any service).
Hardcoded local constants in:
- `backend/app/services/jail_service.py:65``10.0`
- `backend/app/services/server_service.py:31``10.0`
- `backend/app/services/jail_config_service.py:58``10.0`
- `backend/app/services/config_service.py:63``10.0`
- `backend/app/services/config_file_service.py:108``10.0`
- `backend/app/services/action_config_service.py:93``10.0`
- `backend/app/services/ban_service.py:55``5.0`
- `backend/app/services/health_service.py:30``5.0`
- `backend/app/services/fail2ban_metadata_service.py:65``5.0` (inline literal, not even a named constant)
**Goal:**
`constants.py` should export two constants:
- `FAIL2BAN_SOCKET_TIMEOUT_SECONDS: Final[float] = 10.0` — standard operational timeout
- `FAIL2BAN_SOCKET_FAST_TIMEOUT_SECONDS: Final[float] = 5.0` — for health probes and metadata queries that should fail quickly
Every service deletes its local `_SOCKET_TIMEOUT` definition and imports the appropriate constant. Decide intentionally which services should use the fast vs standard timeout rather than having the difference be accidental.
**Possible traps and issues:**
- The current value in `constants.py` is `5.0` but most services use `10.0`. Changing to a single value would alter the behaviour of health probes or operational commands. Use two explicit constants to preserve the intentional difference.
- `fail2ban_metadata_service.py` hardcodes `5.0` as an inline literal inside `_resolve_db_path()` with no named constant at all — easy to miss when grepping for `_SOCKET_TIMEOUT`.
- Static analysis (mypy/ruff) will not catch this — must be found by grep or manual review.
**Docs changes needed:**
`Docs/Architekture.md` section 4.3 fail2ban communication — mention the timeout constants and their purpose.
**Why this is needed:**
The current inconsistency (five services at 10 s, three at 5 s) appears to be accidental. A planned change to socket timeout policy requires touching 8+ files instead of one. A misconfigured timeout on a slow host will cause silent bans to fail; having one value to change makes this tunable.
---
### Task 8 — Remove app/helpers/ indirection layer; fix circular imports structurally
**Severity:** Medium
**Where:**
`backend/app/helpers/config_file_helpers.py` — 18 lines, imports 5 private symbols from `config_file_service` and re-exports them unchanged.
`backend/app/helpers/log_helpers.py` — 22 lines, wraps `log_service.preview_log` and `log_service.test_regex` one-for-one.
`backend/app/helpers/jail_helpers.py` — wraps `jail_service.reload_all` as `reload_jails`.
**Goal:**
Eliminate the helpers layer entirely by fixing the underlying circular imports that made it necessary:
1. `config_file_helpers.py` exists because `jail_config_service`, `filter_config_service`, `action_config_service` need private symbols from `config_file_service` but importing directly would create a cycle (after Task 5 splits the god object, the symbols these services need should live in a proper shared location and be importable directly).
2. `log_helpers.py` exists because `config_service` imports from `log_service` but some path would create a cycle. Move `test_regex` and `preview_log` to `log_service` and import directly from there.
3. `jail_helpers.py` is a pointless one-function wrapper. Import `jail_service.reload_all` directly.
After the underlying causes are fixed, delete all three helper files and update all import sites.
**Possible traps and issues:**
- Do not delete helpers until the circular imports are resolved — doing so in the wrong order will cause `ImportError` at startup.
- Check `__all__` in `config_file_helpers.py` — it exports symbols that were previously accessed by external code. After removal, all consumers must be updated to import from the canonical location.
- This task should follow Task 5 (god object split) since that is the root cause of most of the cycles.
**Docs changes needed:**
`Docs/Architekture.md` section 2.1 — remove the `helpers/` directory from the project structure diagram, or rename it to document its true role.
**Why this is needed:**
The helpers layer hides import cycles rather than eliminating them and adds three levels of indirection to reach simple utility functions. A new developer reading the import chain cannot understand why the layer exists.
---
### Task 9 — Consolidate _TRUE_VALUES into constants.py
**Severity:** Low (part of Task 6 cleanup)
**Where:**
`_TRUE_VALUES = frozenset({"true", "yes", "1"})` defined in:
- `backend/app/services/config_file_service.py:117`
- `backend/app/services/jail_config_service.py:67`
- `backend/app/services/action_config_service.py:105`
`filter_config_service.py:35` imports it as a private symbol: `from app.services.config_file_service import _TRUE_VALUES`.
**Goal:**
Add `FAIL2BAN_TRUTHY_VALUES: Final[frozenset[str]] = frozenset({"true", "yes", "1"})` to `backend/app/utils/constants.py`. All four service files import from there; local definitions and the cross-service private import are removed.
**Possible traps and issues:**
- The cross-service import of a private symbol (`_TRUE_VALUES` with leading underscore) breaks the layering contract — if `config_file_service` ever moves or renames the constant, `filter_config_service` silently breaks at runtime. Fix the import in `filter_config_service` as part of this task.
- This constant is closely coupled with `_is_truthy()` (Task 6) — consider combining both changes.
**Docs changes needed:**
`Docs/Architekture.md` section 2.2 utils — note that `constants.py` also holds the shared set of truthy string values used by config parsers.
**Why this is needed:**
Importing a private symbol from a peer service module is a layering violation. If the implementation ever changes (e.g., adding "on" as a truthy value), all four definitions must be found and updated.
---
### Task 10 — Replace ban_service deferred local imports with top-level imports
**Severity:** Medium
**Where:**
`backend/app/services/ban_service.py` lines 172, 340, 596, 695 — four separate `from app.repositories.history_archive_repo import ...` statements inside function bodies.
**Goal:**
Move all four `history_archive_repo` imports to the top of `ban_service.py`. If a true circular import exists, resolve it structurally (move the shared dependency to a lower-level module both can import) rather than deferring.
**Possible traps and issues:**
- Deferred imports suggest a hidden circular dependency: `ban_service``history_archive_repo``ban_service` or similar. Before moving the import, trace the full import chain to confirm. If a cycle exists, it must be broken (e.g., by extracting a shared data model or helper that neither module imports the other for).
- `history_archive_repo` should be a pure data-access module — it should not import from any service. If it does, that is the actual bug to fix.
- Moving imports is safe if there is no cycle; confirm by running `python -c "from app.services import ban_service"` after the change.
**Docs changes needed:**
None.
**Why this is needed:**
Deferred function-body imports hide module-level code dependencies, break standard import analysis tools, and run the import machinery on every call to those functions (Python memoises the result, but the attribute lookup overhead and code clarity cost are real). Four identical deferred imports in one file indicate a structural problem.
---
### Task 11 — Replace history_sync.py deferred import inside loop
**Severity:** Medium
**Where:**
`backend/app/tasks/history_sync.py` line 79 — `from app.repositories.history_archive_repo import archive_ban_event` inside a `while True:` loop body.
**Goal:**
Move the import to top-level. Same structural fix as Task 10 — if a circular import prevents this, break the cycle instead of deferring.
**Possible traps and issues:**
- This import fires on every iteration of the pagination loop (potentially hundreds of times per sync run). Python caches the module after the first import so there is no repeated I/O cost, but the `sys.modules` lookup and attribute access on every loop iteration is unnecessary overhead.
- If moving the import to the top causes a circular import error, the circular dependency must be resolved before this task is closed.
**Docs changes needed:**
None.
**Why this is needed:**
An import statement inside a loop is an unconventional pattern that confuses readers and signals an unresolved dependency issue. Standard tools (`isort`, `ruff`) will flag it.
---
### Task 12 — Replace blocklist_service importlib workaround with proper injection
**Severity:** Medium
**Where:**
`backend/app/services/blocklist_service.py` lines 364366:
```python
jail_svc = importlib.import_module("app.services.jail_service")
ban_ip_fn = getattr(jail_svc, "ban_ip")
```
This was used as a workaround for a circular import between `blocklist_service` and `jail_service`.
**Goal:**
The `import_source()` function already accepts an optional `ban_ip` parameter for testing. Make this parameter required (not optional) at the service level and always pass `jail_service.ban_ip` from the router or from the scheduled task that calls `import_source`. Remove the `importlib` fallback entirely.
**Possible traps and issues:**
- All callers of `import_source()` (the scheduled blocklist task `tasks/blocklist_import.py` and the manual-trigger router endpoint) must be updated to pass `ban_ip=jail_service.ban_ip`.
- After making the parameter required, mypy will flag any caller that does not pass it.
- Verify the circular import that originally motivated this workaround — if `blocklist_service` imports `jail_service` at the top of the file and `jail_service` imports `blocklist_service`, address the cycle properly (the dependency should only flow one way: `blocklist_service` depends on `jail_service`, not the reverse).
**Docs changes needed:**
None.
**Why this is needed:**
`importlib.import_module` with a string breaks static analysis, IDE navigation, and refactoring tools. The function effectively already supports injection but bypasses it in production. The workaround made the production code path differ from the tested code path.
---
### Task 13 — Wire geo_batch_lookup through dependency injection
**Severity:** Medium
**Where:**
Five call sites where `geo_batch_lookup=geo_service.lookup_batch` is passed explicitly as an argument:
- `backend/app/routers/dashboard.py:126` and `:178`
- `backend/app/routers/bans.py:80`
- `backend/app/routers/blocklist.py:143`
- `backend/app/routers/jails.py:602`
**Goal:**
Add a `GeoBatchLookupDep` type alias and `get_geo_batch_lookup()` dependency provider to `backend/app/dependencies.py`, similar to existing deps like `Fail2BanSocketDep`. Routers declare `geo_lookup: GeoBatchLookupDep` in their handler signatures and stop passing `geo_service.lookup_batch` manually. Service functions that accept `geo_batch_lookup` as a parameter should continue doing so (good for testability) — the injection just moves from the router body to the dependency declaration.
**Possible traps and issues:**
- The callable type annotation for `geo_batch_lookup` needs to be precise enough for mypy. Either use a `Protocol` or import the exact type from `geo_service`.
- Some service functions accept `geo_batch_lookup` as a required parameter; after this change, always confirm those services are called with it.
- This is low-risk but touches five routers and `dependencies.py` — run the full test suite after.
**Docs changes needed:**
`Docs/Architekture.md` section 2.2 dependencies — mention the geo lookup dependency provider.
**Why this is needed:**
Manually threading `geo_service.lookup_batch` at call sites is not dependency injection — it is partial application scattered across files. Any future change to the geo lookup implementation (switching provider, adding caching policy) requires touching five routers instead of one dependency function.
---
### Task 14 — Fix router prefix structure for config sub-routers
**Severity:** Medium
**Where:**
`backend/app/routers/config.py` — this aggregator router has `prefix="/api/config"` and includes four sub-routers. The sub-routers are defined as `router = APIRouter()` with no prefix:
- `backend/app/routers/jail_config.py:42`
- `backend/app/routers/filter_config.py:26`
- `backend/app/routers/action_config.py:25`
- `backend/app/routers/config_misc.py:23`
Route paths like `"/jails"` in `jail_config.py` resolve to `/api/config/jails` only by implicit inheritance from the parent.
**Goal:**
Each sub-router should declare its own prefix explicitly (e.g. `APIRouter(prefix="/jails", tags=["Jail Config"])`). The aggregator `config.py` then calls `router.include_router(jail_config.router)` without any additional prefix argument. This makes each router self-describing and independently testable.
**Possible traps and issues:**
- Assert that the final URL paths are identical before and after the change. Use FastAPI's route inspection (`app.routes`) in a test to validate.
- `config_misc.py` routes may not have a natural single sub-prefix — list its routes first and decide on a grouping.
- Frontend `src/api/endpoints.ts` hard-codes the final URL paths; changing prefixes here will break the frontend unless the paths remain the same.
**Docs changes needed:**
`Docs/Architekture.md` section 8.2 endpoint groups — verify the listed paths match the actual registered routes after the fix.
**Why this is needed:**
A sub-router with no prefix is not independently testable via `TestClient(sub_router_app)` because its routes resolve to bare paths like `/jails` instead of `/api/config/jails`. Implicit inheritance from a parent makes the actual URL opaque to anyone reading the sub-router file.
---
### Task 15 — Remove 7 ghost imports from routers/config.py
**Severity:** Medium
**Where:**
`backend/app/routers/config.py` lines 613:
```python
from app.services import (
action_config_service,
config_file_service,
config_service,
filter_config_service,
jail_config_service,
jail_service,
log_service,
) # noqa: F401
```
None of these are referenced in the file body.
**Goal:**
Remove the entire import block. If these imports are present to force module initialisation for side effects, document exactly what side effect is required and trigger it explicitly in the startup lifespan instead.
**Possible traps and issues:**
- If any of these modules register something at import time (class registries, decorators, etc.), removing the import will silently skip that registration. Audit each module for module-level side effects before deleting the import.
- The `# noqa: F401` suppresses the linter warning that would normally catch unused imports — this is a code smell that was masking the issue.
- After removing, run the full application startup and hit all config endpoints to confirm no registration was skipped.
**Docs changes needed:**
None.
**Why this is needed:**
Imports for side effects are an anti-pattern that creates hidden startup dependencies. They suppress linter warnings, make the code misleading, and force unnecessary module-level execution at startup even when the imports are truly vestigial.
---
### Task 16 — Reset _backend_cmd_supported between test runs
**Severity:** Medium
**Where:**
`backend/app/services/jail_service.py` lines 7778:
```python
_backend_cmd_supported: bool | None = None
_backend_cmd_lock: asyncio.Lock = asyncio.Lock()
```
Module-level mutable globals that persist for the lifetime of the Python process.
**Goal:**
Provide a `_reset_capability_cache()` function (or use `importlib.reload` in tests) so the test suite can reset `_backend_cmd_supported` to `None` between test cases. Alternatively, move the cache into the `Fail2BanMetadataService`-style class instance (similar to how `default_fail2ban_metadata_service` is structured) so tests can construct a fresh instance per test.
**Possible traps and issues:**
- `asyncio.Lock()` objects in Python 3.10+ are not bound to the running event loop at creation, so creating them at module level is safe. But if a test creates a new event loop (common with `pytest-asyncio`), an existing `asyncio.Lock` may behave unexpectedly.
- The cleanest fix is to encapsulate the cache and lock into a class, instantiate it as a module-level singleton identical to `Fail2BanMetadataService`, and expose a `invalidate_backend_capability_cache()` function for test teardown.
**Docs changes needed:**
None.
**Why this is needed:**
A test that forces `_backend_cmd_supported = False` (by mocking a socket error) will corrupt the capability cache for all subsequent tests in the same process. This produces flaky test failures that are very hard to diagnose.
---
### Task 17 — Remove unittest.mock.Mock check from production runtime_state
**Severity:** Low
**Where:**
`backend/app/utils/runtime_state.py` lines 1516 and line 107:
```python
from unittest.mock import Mock as _Mock
...
if runtime_settings is not None and _Mock is not None and isinstance(runtime_settings, _Mock):
return get_app_settings(app)
```
**Goal:**
Remove `_Mock` entirely from `runtime_state.py`. The test-isolation problem it was solving should be handled in test fixtures by injecting a real `Settings` instance rather than a `Mock`. Tests that currently rely on this fallback should be updated to use `create_app(settings=my_test_settings)` with a real test settings object.
**Possible traps and issues:**
- Find all tests that pass a `Mock` as `runtime_settings` and replace the mock with a `Settings(**test_overrides)` instance. This is the correct approach.
- After removing the check, run the full test suite. Any test that previously relied on the `Mock` detection will fail fast rather than silently falling back.
**Docs changes needed:**
None.
**Why this is needed:**
`unittest.mock` is a test library. Its presence in a production module is a layering violation that adds import overhead at every startup and puts test scaffolding logic in a path that runs on every request.
---
### Task 18 — Move _COOKIE_NAME to constants.py
**Severity:** Low
**Where:**
- `backend/app/dependencies.py:61``_COOKIE_NAME = "bangui_session"`
- `backend/app/routers/auth.py:29``_COOKIE_NAME = "bangui_session"`
**Goal:**
Add `SESSION_COOKIE_NAME: Final[str] = "bangui_session"` to `backend/app/utils/constants.py`. Both files import from there and remove their local definition.
**Possible traps and issues:**
- Search for any other files that hard-code the string `"bangui_session"` directly (e.g., in tests) and update those too.
- The cookie name is also referenced in the frontend (`document.cookie` or fetch clients) — verify it matches.
**Docs changes needed:**
None.
**Why this is needed:**
If the cookie name ever needs to change (e.g., to support multi-tenant deployments or versioning), it must currently be changed in two backend files and potentially in frontend code. A single constant with a clear name is self-documenting.
---
### Task 19 — Remove dead loop variable and asyncio import in auth_service
**Severity:** Low
**Where:**
`backend/app/services/auth_service.py` lines 10 and 75:
```python
import asyncio # line 10
...
loop = asyncio.get_running_loop() # line 75 — assigned but never used
return await run_blocking(lambda: bool(bcrypt.checkpw(plain_bytes, hashed_bytes)))
```
**Goal:**
Delete line 75 (`loop = asyncio.get_running_loop()`). Remove `import asyncio` from line 10 since it will then be unused.
**Possible traps and issues:**
- Verify `asyncio` is not used anywhere else in the file before removing the import.
- The `run_blocking(lambda: ...)` call relies on `asyncio.to_thread` internally — this is correct and does not need the `loop` variable.
**Docs changes needed:**
None.
**Why this is needed:**
Dead code misleads readers into thinking `loop` is used for something. The original intent was likely `loop.run_in_executor(...)` before `run_blocking` was introduced, but the old assignment was never cleaned up.
---
### Task 20 — Offload ensure_jail_configs to a thread at startup
**Severity:** Low
**Where:**
`backend/app/startup.py` line 120:
```python
ensure_jail_configs(Path(settings.fail2ban_config_dir) / "jail.d")
```
`ensure_jail_configs` in `backend/app/utils/jail_config.py` performs synchronous filesystem writes (`Path.write_text`, `Path.mkdir`) directly in the async startup path.
**Goal:**
Wrap the call in `await asyncio.to_thread(ensure_jail_configs, Path(settings.fail2ban_config_dir) / "jail.d")` or use `await run_blocking(ensure_jail_configs, ...)` to offload the blocking I/O to a thread pool.
**Possible traps and issues:**
- `ensure_jail_configs` is currently synchronous (`def`, not `async def`). It should remain synchronous — just call it via `asyncio.to_thread`.
- On typical local filesystems this completes in microseconds, but on NFS or Docker-volume mounts with high I/O latency it can block the event loop for tens of milliseconds, preventing any other startup task from running.
- Ensure the call remains before the scheduler is started (the blocklist-import task depends on the jail config files being present).
**Docs changes needed:**
None.
**Why this is needed:**
Synchronous I/O in the async startup path violates the "Async Everything" design principle (Architecture doc section 10). It blocks the event loop during startup and sets a bad precedent for future startup tasks.
---
### Task 21 — Consolidate start_daemon / wait_for_fail2ban duplicates
**Severity:** Low
**Where:**
`start_daemon()` and `wait_for_fail2ban()` are defined identically in:
- `backend/app/services/config_file_service.py` (lines approx. 560620)
- `backend/app/services/jail_config_service.py`
**Goal:**
Keep one canonical implementation in `config_file_service.py` (as the shared infrastructure layer). Delete the copy from `jail_config_service.py`. Update `jail_config_service` to call the version from `config_file_service` directly (no intermediary helper).
**Possible traps and issues:**
- After Task 5 splits the god object, confirm whether `start_daemon` and `wait_for_fail2ban` belong in `config_file_service` (shared infra) or in a dedicated `fail2ban_lifecycle.py` util. Either placement is fine as long as there is only one.
- Check if any tests import `start_daemon` or `wait_for_fail2ban` from `jail_config_service` directly — those tests will need updating.
**Docs changes needed:**
None.
**Why this is needed:**
Any change to the daemon-start retry logic (timeouts, exception handling) must currently be applied twice. This will diverge over time as the two copies evolve independently.
---
### Task 22 — Remove DEFAULT_BLOCKING_EXECUTOR dead code
**Severity:** Low
**Where:**
`backend/app/utils/async_utils.py` lines 1619:
```python
DEFAULT_BLOCKING_EXECUTOR: ThreadPoolExecutor = ThreadPoolExecutor(
max_workers=16,
thread_name_prefix="bangui-blocking",
)
```
`run_blocking()` ignores this when `executor=None` (the default at every call site) and delegates to `asyncio.to_thread` instead. `DEFAULT_BLOCKING_EXECUTOR` is never passed to any `run_blocking` call.
**Goal:**
Either wire `DEFAULT_BLOCKING_EXECUTOR` as the actual default used inside `run_blocking` (replacing `asyncio.to_thread`) to get the deliberate size-16 pool, or delete the constant entirely if `asyncio.to_thread` (default executor, size = `min(32, os.cpu_count() + 4)`) is acceptable.
**Possible traps and issues:**
- If the 16-worker pool was intentional (to limit concurrent blocking operations under high load), then `run_blocking` should default to using it: `return await loop.run_in_executor(DEFAULT_BLOCKING_EXECUTOR, functools.partial(func, *args, **kwargs))`.
- `asyncio.to_thread` uses the loop's default executor which is configured at the OS level and may be smaller than 16. If blocking I/O throughput is a concern, the named pool is better.
- `ThreadPoolExecutor` created at module level is never explicitly shut down — add a shutdown hook in the app lifespan if the pool is kept.
**Docs changes needed:**
None.
**Why this is needed:**
A named 16-worker thread pool that is never actually used consumes OS thread resources from startup for zero benefit. It also misleads readers who assume `run_blocking` uses it.
---
### Task 23 — Fix InMemorySessionCache instantiated when disabled
**Severity:** Low
**Where:**
`backend/app/main.py` line 296:
```python
app.state.session_cache = InMemorySessionCache()
```
This is unconditional — `InMemorySessionCache` is always created even when `settings.session_cache_enabled = False`.
**Goal:**
Instantiate the cache conditionally: use `InMemorySessionCache()` when `session_cache_enabled` is True and a `NoOpSessionCache` (a minimal `SessionCache`-protocol-compatible stub that always returns `None` from `get` and does nothing on `set`/`invalidate`) otherwise. The `get_session_cache` dependency in `dependencies.py` can then always return the cache without a None check because the correct implementation (real or no-op) is always present.
**Possible traps and issues:**
- The null-cache pattern avoids the `if cache_enabled:` checks scattered in `require_auth`. After this change, move all cache logic into the cache implementations — `require_auth` calls `cache.get(token)` unconditionally and the no-op always returns `None`.
- `get_session_cache` currently raises HTTP 503 if `app_context.session_cache is None` — after this change the cache is always non-None; remove that guard.
**Docs changes needed:**
`Docs/Architekture.md` section 6 — update the session validation cache note to describe the no-op pattern used for disabled mode.
**Why this is needed:**
Allocating an in-memory structure and its lock mechanisms for a feature that is disabled at configuration time wastes memory and creates a correctness gap: if code accidentally calls `session_cache.set()` when disabled, it will populate a cache that is never invalidated on logout.
---
### Task 24 — Remove unused asyncio import from log_service.py
**Severity:** Low
**Where:**
`backend/app/services/log_service.py` line 9 — `import asyncio` is present but `asyncio` is never referenced in the file body.
**Goal:**
Delete line 9.
**Possible traps and issues:**
- Confirm `asyncio` is not referenced in any conditional branch (including type-checking `TYPE_CHECKING` blocks).
**Docs changes needed:**
None.
**Why this is needed:**
Unused imports add noise, slow down `import` time marginally, and may confuse linters or readers who assume there must be async behaviour somewhere in the module.
---
### Task 25 — Extend service protocols coverage
**Severity:** Low (foundation for future testability)
**Where:**
`backend/app/services/protocols.py` — only `AuthService` and `JailService` are defined.
`backend/app/repositories/protocols.py` — only `SessionRepository` is defined.
No concrete service or repository module declares that it implements any protocol (`class AuthServiceImpl(AuthService):` does not exist anywhere).
**Goal:**
Add `Protocol` definitions to `services/protocols.py` for: `BanService`, `ConfigService`, `HistoryService`, `GeoService`, `BlocklistService`, `HealthService`, `ServerService`. Add `Protocol` definitions to `repositories/protocols.py` for: `BlocklistRepository`, `ImportLogRepository`, `GeoCacheRepository`, `Fail2BanDbRepository`.
Use `typing.runtime_checkable` on protocols that need instance-check support in tests. Expose the protocols as `Dep` type aliases in `dependencies.py` for each new service wired through DI.
**Possible traps and issues:**
- Not all services are injected through `dependencies.py` today — many are called as modules directly (`from app.services import ban_service`). The protocol coverage task should go alongside a step to inject these through `Depends()`.
- Protocols must be kept in sync with the concrete implementations — a method added to the concrete module must be added to the protocol. Consider a test that instantiates a concrete module as the protocol type to catch divergence.
**Docs changes needed:**
`Docs/Architekture.md` section 10 design principles — reinforce the Dependency Inversion entry: all services injected via `dependencies.py` must have a corresponding `Protocol`.
**Why this is needed:**
Without protocols for the majority of services, unit tests must either use the real implementation (slow, requires running fail2ban) or patch methods at the module level (fragile). Protocols enable clean mock injection via `Depends()` in tests.
---
## Completed Issues