Files
BanGUI/Docs/Tasks.md
Lukas 56c511d905 Fix module-level asyncio locks in jail_service
Initialize jail_service locks lazily to avoid import-time event loop binding and add regression tests for lock creation.
2026-04-15 09:10:38 +02:00

26 KiB
Raw Blame History

BanGUI — Task List

This document breaks the entire BanGUI project into development stages, ordered so that each stage builds on the previous one. Every task is described in prose with enough detail for a developer to begin work. References point to the relevant documentation.

Reference: Docs/Refactoring.md for full analysis of each issue.


Open Issues


TASK-01 — Fix undefined names in routers/config_misc.py 🔴

Where: backend/app/routers/config_misc.py — lines 86, 87, 123, 174, 196, 337, 386.

Status: Completed

Goal: The file references four symbols that are never defined or imported. At runtime, any request that hits an error path in the affected endpoints will raise a bare NameError instead of returning the correct HTTP response.

  • ConfigOperationError (caught on lines 86, 385) — defined in app.exceptions but missing from the import block.
  • JailOperationError (caught on lines 123, 174) — same module, same omission.
  • _bad_request (called on lines 87, 337, 386) — the file defines _bad_gateway() but the symmetrical _bad_request() helper was never added.
  • log (used on line 196 in the restart handler) — a structlog logger that every other module in the project creates with log = structlog.get_logger(), but this module skips it entirely.

Additionally, setup_service is imported at the top level (line 20) and then redundantly re-imported inside two function bodies (lines 289, 327), which triggers a flake8/Pylance redefinition warning. The top-level import is unused.

Fix steps:

  1. Add from app.exceptions import ConfigOperationError, JailOperationError to the import block.
  2. Add log: structlog.stdlib.BoundLogger = structlog.get_logger() at module level, following the same pattern as every other service/router.
  3. Add the _bad_request helper directly below _bad_gateway:
    def _bad_request(message: str) -> HTTPException:
        return HTTPException(
            status_code=status.HTTP_400_BAD_REQUEST,
            detail=message,
        )
    
  4. Remove setup_service from the top-level from app.services import ... line; keep only the two local imports where it is actually used, or hoist one of them to the top and remove the other.

Possible traps:

  • The file also uses config_service.ConfigOperationError in one place (line 385) — that reference goes through the module attribute and therefore works today even without a direct import. After adding the bare import, both references will resolve to the same class, which is correct. Verify nothing accidentally catches the wrong exception.
  • Removing setup_service from the top-level import while keeping the local imports means the local imports must stay in sync. Consider hoisting to top-level for consistency with every other router.

Docs changes needed: None — this is a pure bug fix.

Why: Any user action that exercises an error path on the /config/global, /config/reload, /config/restart, or /config/log endpoints will receive an unhandled NameError 500 response instead of the intended 400/409/502. These are common failure modes (fail2ban unreachable, bad config value).


TASK-02 — Wrap blocking mkdir() calls in run_blocking() 🔴

Status: Completed

Where:

  • backend/app/services/setup_service.py — line 178, inside async def _ensure_database_initialized()
  • backend/app/startup.py — line 73, inside async def startup_shared_resources()

Goal: Both sites call Path.mkdir() synchronously inside async functions. Because the entire backend runs on a single-threaded asyncio event loop, a synchronous filesystem call — even one that normally returns in microseconds — blocks all other coroutines from running until it completes, including health-check tasks and in-flight requests. The fix is to wrap each call with the project's existing run_blocking() helper (from app.utils.async_utils), which offloads the call to a thread-pool executor.

startup.py already imports run_blocking indirectly (it calls services that use it), but needs to import it directly. setup_service.py already imports and uses run_blocking elsewhere in the same file (line 93) — this is therefore a straightforward omission.

Fix steps:

  1. In setup_service.py around line 178:
    # Before
    parent_dir.mkdir(parents=True, exist_ok=True)
    # After
    await run_blocking(parent_dir.mkdir, parents=True, exist_ok=True)
    
  2. In startup.py around line 73, add the import if absent and apply the same pattern.

Possible traps:

  • Path.mkdir takes keyword-only args. run_blocking must accept **kwargs or use a lambda: await run_blocking(lambda: db_path.parent.mkdir(parents=True, exist_ok=True)). Check the signature of run_blocking before picking the form.
  • The PermissionError that setup_service deliberately catches must still be handled in the same try/except block — the await must stay inside the try.

Docs changes needed: None — this is a pure bug fix.

Why: Blocking I/O on the event loop is forbidden by the architecture's "Async Everything" principle. Under load or during startup, a slow filesystem (e.g. NFS, encrypted volume) could stall the entire server for hundreds of milliseconds.


TASK-03 — Remove inverted dependency: service importing a task

Status: Completed

Where: backend/app/services/jail_config_service.py — line 37:

from app.tasks.health_check import run_probe

Used on lines 295 and 571 to fire a health probe after activating or deactivating a jail.

Goal: The dependency graph mandates Tasks → Services, never Services → Tasks. run_probe is already exposed by health_service.py (or can be). The fix is to remove the task import from the service and call the equivalent functionality through the service layer instead.

Options:

  1. If health_service.probe() is already equivalent, replace the two run_probe(app) calls with await health_service.probe(socket_path) (passing the socket path directly, avoiding the FastAPI argument entirely — see also TASK-05).
  2. If run_probe does more than just probe (e.g. it also writes to app.state), extract that logic into a function in health_service and call that from both the task and the service.

Possible traps:

  • run_probe in health_check.py accepts app: FastAPI and uses app.state to cache the probe result. Replacing it with a direct health_service call means the cached status in app.state will not be updated. Decide whether the caller needs the cache updated (if yes, pass the relevant state as a parameter; if no, a direct service call is sufficient).
  • After this change, verify that tasks/health_check.py still calls run_probe internally and that no other module relied on the cross-import.

Docs changes needed: Architecture doc section 2.2 (Tasks) should note that tasks may call services, but services must not call tasks.

Why: The inverted import creates a hidden circular-dependency risk: if health_check ever imports anything from jail_config_service, Python will raise an ImportError. It also makes jail_config_service harder to test in isolation because importing it pulls in the scheduler task module.


TASK-04 — Remove FastAPI app instance from service signatures 🟠

Status: Completed

Where: backend/app/services/jail_config_service.pyactivate_jail() and deactivate_jail() function signatures accept app: FastAPI (guarded by TYPE_CHECKING on line 48, used at runtime on lines ~295, ~571).

Goal: Services must not depend on the FastAPI framework. The app object is passed solely so run_probe(app) can update app.state. After TASK-03 removes the run_probe dependency, this parameter becomes unnecessary. If any state from app.state is still needed (e.g. socket_path, scheduler), pass only those specific values — not the entire app.

Fix steps:

  1. Complete TASK-03 first.
  2. Remove the app: FastAPI parameter from activate_jail and deactivate_jail.
  3. Remove the if TYPE_CHECKING: from fastapi import FastAPI block if it is no longer referenced anywhere in the file.
  4. Update all callers (routers and tests) to stop passing app.

Possible traps:

  • The routers that call activate_jail inject request: Request specifically to access request.app. After removing the parameter, those routers should no longer need Request — check whether it is used for anything else in those handlers.
  • Any test that constructs a mock FastAPI instance just to pass it here can be simplified.

Docs changes needed: None beyond removing the FastAPI reference from the service signature.

Why: Framework types in the service layer violate the Dependency Inversion principle, tightly couple service logic to the HTTP layer, and make unit testing significantly harder — requiring a FastAPI application instance just to call a domain function.


TASK-05 — Extract business logic out of tasks/history_sync.py into a service

Where: backend/app/tasks/history_sync.py — lines 1516 import repositories directly; the entire _run_sync_with_settings() function contains paging logic, backfill window calculation, and archive_ban_event calls that constitute business logic.

Goal: Tasks must only schedule and invoke. All logic — the backfill window decision, the paging loop, the SQL cursor over the fail2ban DB, the archive write — belongs in a service. Create (or extend) history_service.py with a sync_from_fail2ban_db() function that accepts a DB connection and socket path, encapsulates the entire sync algorithm, and returns a count of synced records. The task then becomes a thin wrapper:

async def _run_sync_with_settings(settings: Settings) -> None:
    db = await open_db(settings.database_path)
    try:
        synced = await history_service.sync_from_fail2ban_db(db, settings.fail2ban_socket)
        log.info("history_sync_complete", synced=synced)
    finally:
        await db.close()

Possible traps:

  • history_service currently uses the fail2ban database read-only. The new function writes to the BanGUI archive table — make sure the connection passed is the BanGUI DB, not the fail2ban DB. The fail2ban DB path is resolved via get_fail2ban_db_path() inside the service function.
  • The BACKFILL_WINDOW and HISTORY_SYNC_INTERVAL constants can stay in the task file (scheduling concern) but the logic that decides whether to backfill must move to the service.
  • Ensure existing tests for history_sync are updated to test the service function, not the task internals.

Docs changes needed: Architecture doc section on Services: add history_service.sync_from_fail2ban_db to the service table. Architecture doc section on Tasks: update history_sync description to reflect that it delegates to history_service.

Why: Business logic in the task layer cannot be unit-tested without spinning up a scheduler. It also means any future consumer of the sync logic (e.g. a manual trigger endpoint) would have to duplicate the code or import from the task module.


TASK-06 — Break circular dependency between config_file_service and the three config sub-services

Status: Completed

Where:

  • services/jail_config_service.py line 33 → imports config_file_service at module level
  • services/filter_config_service.py line 26 → same
  • services/action_config_service.py line 28 → same
  • services/config_file_service.py lines 1027, 1039, 1107, 1118, 1128, 1140 → lazy-imports jail_config_service; lines 13031367 → lazy-imports filter_config_service; lines 1752+ → lazy-imports action_config_service

Goal: The three sub-services depend on config_file_service for shared parsing utilities, while config_file_service calls back into the three sub-services via lazy in-function imports — the only reason it works is that the circular import is deferred until first call. This is fragile: any reorganisation that removes the lazy-import guard will cause an ImportError at startup.

The correct fix is to break the cycle by extracting the shared logic that config_file_service currently delegates to the sub-services into a lower-level module. Concretely:

  1. Identify every function that config_file_service calls via lazy import from the sub-services (e.g. list_inactive_jails, _activate_jail, list_filters, get_filter, list_actions, etc.).
  2. Determine whether config_file_service should own those functions directly (moving them out of the sub-services), or whether config_file_service should simply be split so the shared parsing utilities live in a config_file_utils module that all four services can import without circularity.
  3. Remove all lazy in-function imports from config_file_service.

The safest first step is option 2: rename the pure parsing/utility functions in config_file_service into a new utils/config_file_utils.py (or extend the existing one), then have all four service files import from the util, leaving config_file_service to handle only orchestration.

Possible traps:

  • config_file_service is over 1700 lines. Splitting it requires careful identification of which functions are pure utilities (no side effects, no DB/socket calls) vs. orchestration. Start with utility functions only.
  • Any function moved to utils must not import from services. If it currently does, the abstraction is wrong and needs further decomposition.
  • Test coverage for existing behaviour must be maintained throughout the refactor. Run the full test suite after each extraction step.

Docs changes needed: Architecture doc section 2.1 (Project Structure): add utils/config_file_utils.py if created. Section 2.2 (Utils): describe the new module. Update the Service descriptions for all four affected services to accurately describe their responsibilities post-split.

Why: Lazy imports mask circular dependencies at import time but surface as confusing runtime errors if import order changes. The current design also makes it impossible to import any one of the four services for testing without transitively loading all four.


TASK-07 — Move schedule management out of routers/blocklist.py into a service

Status: Completed

Summary: Router schedule endpoints now delegate schedule persistence and APScheduler job management to blocklist_service, removing the router's direct dependency on app.tasks.blocklist_import.

Where: backend/app/routers/blocklist.py — line 51:

from app.tasks import blocklist_import as blocklist_import_task

Used on lines 173, 206, 208 to read blocklist_import_task.JOB_ID and call blocklist_import_task.reschedule(app).

Goal: Routers must not know about APScheduler job IDs or the task module internals. Schedule management is business logic and belongs in a service. Create a get_schedule() and update_schedule() function in blocklist_service.py that encapsulate the scheduler interaction. The router calls the service function and returns its result.

The router's GET /schedule and PUT /schedule handlers should be reduced to:

return await blocklist_service.get_schedule(db, scheduler)
await blocklist_service.update_schedule(db, scheduler, body)

The JOB_ID constant and reschedule() call details move into blocklist_service. blocklist_service may import from the task module (tasks can be imported by services for constants/types), but the router should have no direct task dependency.

Possible traps:

  • blocklist_import_task.reschedule(app) uses app to read app.state. If the service function receives the scheduler directly (already available via SchedulerDep), it can avoid the app dependency. Pass scheduler explicitly instead.
  • The SchedulerDep injection in the router should remain unchanged — the scheduler is a dependency, not a business concern, and injecting it via Depends() is correct.

Docs changes needed: Architecture doc Service table: add get_schedule and update_schedule to the blocklist_service entry.

Why: The router knows the internal job ID string and directly manipulates the scheduler — both are implementation details of the task/scheduling layer. This makes the endpoint hard to test without a running scheduler instance.


TASK-08 — Remove inverted layer dependency: utils/fail2ban_db_utils.py importing a service

Status: Completed

Where: backend/app/utils/fail2ban_db_utils.py — line 8:

from app.services.fail2ban_metadata_service import default_fail2ban_metadata_service

Goal: utils/ must be the lowest layer — it may not import from services/. The fail2ban_db_utils module is currently a thin wrapper around default_fail2ban_metadata_service. Two options exist:

  1. Move the singleton down: Move Fail2BanMetadataService (and its default_fail2ban_metadata_service singleton) from services/ to utils/. The class has no service dependencies — it only uses Fail2BanClient from utils/ — so it belongs in utils/ naturally.
  2. Eliminate the util wrapper: Delete fail2ban_db_utils.get_fail2ban_db_path() and have all callers import from fail2ban_metadata_service directly (since it is a service and services can import other services).

Option 1 is cleaner and consistent with the architecture: Fail2BanMetadataService is essentially an async-safe caching wrapper around a socket command, which is a utility concern.

Possible traps:

  • If Fail2BanMetadataService is moved to utils/, the import path changes everywhere it is referenced. Use IDE rename-symbol to update all references atomically.
  • The architecture doc currently does not list fail2ban_metadata_service at all. After moving it to utils, it must be documented in the Utils table.
  • Any service that currently imports fail2ban_db_utils (rather than fail2ban_metadata_service) will need their imports updated if fail2ban_db_utils is removed.

Docs changes needed: Architecture doc section 2.2 (Utils): add fail2ban_db_utils.py (or its replacement) and fail2ban_metadata_service to the Utils table.

Why: A util importing a service creates an implicit upward dependency that the architecture explicitly forbids. It also makes fail2ban_db_utils untestable in isolation — importing it drags in the service and its lazy-loaded metadata cache.


TASK-09 — Move re_resolve_geo orchestration into geo_service

Status: Completed

Where: backend/app/routers/geo.pyasync def re_resolve_geo() (lines ~142174).

Goal: The endpoint currently performs a multi-step workflow inline:

  1. Fetch all unresolved IPs from the DB.
  2. Early-return if the list is empty.
  3. Clear the negative cache.
  4. Batch-resolve all IPs.
  5. Aggregate the count of newly resolved entries.

Steps 15 are business logic. The router should call a single service function and return its result:

result = await geo_service.re_resolve_all(db, http_session)
return result   # {"resolved": N, "total": M}

Add async def re_resolve_all(db, http_session) -> dict[str, int] to geo_service.py containing the current implementation.

Possible traps:

  • clear_neg_cache() is a synchronous function that mutates the module-level _neg_cache dict. Calling it from inside the service function is fine, but be aware that it runs on the event loop thread with no lock. This is an existing race condition (see TASK-11) — do not introduce additional concurrent callers without addressing that issue first.
  • The return type dict[str, int] is an untyped dict. Consider adding a proper response model GeoReResolveResponse to models/geo.py to make the API contract explicit.

Docs changes needed: Architecture doc Service table: add re_resolve_all to the geo_service entry.

Why: Multi-step orchestration in the router violates the zero-business-logic rule. It also prevents the re-resolve logic from being triggered from any other context (e.g. a scheduled task or CLI tool) without duplicating code.


TASK-10 — Move GeoInfo → GeoDetail translation out of the router

Status: Completed

Where: backend/app/routers/geo.pyasync def lookup_ip(), lines ~8593:

raw_geo = result["geo"]
geo_detail: GeoDetail | None = None
if isinstance(raw_geo, GeoInfo):
    geo_detail = GeoDetail(
        country_code=raw_geo.country_code,
        ...
    )

Goal: The router performs conditional type checking and cross-schema mapping. The cleanest fix is to have jail_service.lookup_ip() return a typed dict whose "geo" field is already GeoDetail | None, eliminating the isinstance check entirely. Alternatively, add a GeoInfo.to_detail() method or a geo_service.to_detail() converter that the router calls in one line.

This is a small but clean change: update IpLookupResult TypedDict in jail_service so that "geo" is typed as GeoDetail | None, and have the service construct the GeoDetail before returning.

Possible traps:

  • IpLookupResult is currently typed with "geo": GeoInfo | None. Changing it to GeoDetail means callers that rely on GeoInfo-specific fields (like asn) must use GeoDetail instead — confirm GeoDetail exposes the same fields.
  • If GeoInfo and GeoDetail are nearly identical, consider merging them into a single model to avoid this translation layer entirely.

Docs changes needed: None — this is an internal type alignment.

Why: Schema translation in the router adds fragility: if either model changes, the mapping code must be updated and the connection is not enforced by the type system. Having the service return the correct type is safer and is consistent with the "routers return typed responses" principle.


TASK-11 — Fix asyncio.Lock() created at module import time in jail_service.py

Status: Completed

Where: backend/app/services/jail_service.py — lines 71 and 78:

_reload_all_lock: asyncio.Lock = asyncio.Lock()
_backend_cmd_lock: asyncio.Lock = asyncio.Lock()

Goal: asyncio.Lock() created before the event loop starts is bound to whichever loop exists at import time. In Python 3.10+ this is generally safe for a single-worker deployment using uvicorn, but it breaks test suites that create a new event loop per test (e.g. pytest-asyncio with loop_scope="function"), causing a RuntimeError: Task attached to a different loop error when the second test tries to acquire the lock.

Fix: initialise the locks lazily on first use:

_reload_all_lock: asyncio.Lock | None = None
_backend_cmd_lock: asyncio.Lock | None = None

def _get_reload_lock() -> asyncio.Lock:
    global _reload_all_lock
    if _reload_all_lock is None:
        _reload_all_lock = asyncio.Lock()
    return _reload_all_lock

Then replace async with _reload_all_lock: with async with _get_reload_lock():.

Possible traps:

  • Lazy initialisation is not thread-safe if multiple coroutines call the getter simultaneously before initialisation. In an asyncio context this is safe (single-threaded event loop), but add a brief comment explaining why.
  • If the test suite currently happens to pass despite this issue (because pytest-asyncio is configured with loop_scope="session"), the bug is latent rather than active — still worth fixing for configuration flexibility.

Docs changes needed: None.

Why: Module-level asyncio primitives are a known footgun in Python async codebases. Fixing this is low-risk and makes the test suite more robust against pytest-asyncio configuration changes.


TASK-12 — Remove duplicate import in server_service.py

Where: backend/app/services/server_service.py — lines 1718:

from app.exceptions import ServerOperationError
from app.exceptions import ServerOperationError

Goal: Delete line 18 (the exact duplicate). One-line change.

Possible traps: None.

Docs changes needed: None.

Why: Duplicate imports generate linter warnings, add noise to diffs, and signal that the file was edited carelessly.


TASK-13 — Document fail2ban_metadata_service and history_sync task in the architecture doc 🟡

Where: Docs/Architekture.md — Services table (section 2.2) and Tasks table (section 2.2).

Goal: Two modules exist in the codebase that are absent from the architecture document:

  1. services/fail2ban_metadata_service.py — An async-safe caching wrapper that resolves the fail2ban SQLite database path by querying the socket once and caching the result. Exports default_fail2ban_metadata_service, a module-level singleton. Should be listed in the Services table (or moved to Utils — see TASK-08) with a description of its caching behaviour and the singleton pattern.

  2. tasks/history_sync.py — A scheduled background task (interval: 300 seconds) that copies new ban records from the fail2ban database into BanGUI's own archive table to preserve history that would otherwise be purged. Should be added to the Tasks table.

Additionally note in section 6 (Authentication): the _session_cache in dependencies.py noted there is now implemented in utils/session_cache.py — confirm the section still accurately reflects the implementation.

Possible traps:

  • If TASK-08 is completed before this task, the location of fail2ban_metadata_service will change. Write the doc entry after TASK-08 is decided.

Docs changes needed: Docs/Architekture.md:

  • Add fail2ban_metadata_service.py row to the Services (or Utils) table.
  • Add history_sync.py row to the Tasks table.
  • Cross-reference TASK-05 once it is complete (the task should delegate to a service).

Why: The architecture document is a contract between contributors. Undocumented modules lead to inconsistent patterns and duplicate implementations when new developers add features that the undocumented module already handlzttttttttt nbbbbbbbbbbes.