Move runtime application state into a dedicated runtime state manager

2026-04-10 19:07:35 +02:00
parent 6b177f1881
commit ff92733f90
6 changed files with 223 additions and 55 deletions
--- a/Docs/Tasks.md
+++ b/Docs/Tasks.md
@@ -8,57 +8,107 @@ Reference: `Docs/Refactoring.md` for full analysis of each issue.

 ## Open Issues

-### 1. Fix setup persistence
- Where found: `backend/app/config.py`, `backend/app/startup.py`, `backend/app/services/setup_service.py`, `backend/app/routers/setup.py`
- Goal: runtime configuration should use the values persisted during setup for `database_path`, `fail2ban_socket`, `timezone`, and `session_duration_minutes` rather than only environment defaults.
- Status: completed
- Possible traps and issues:
-  - Setup may appear successful but later use a different DB/socket on restart.
-  - A partially persisted setup run must not leave the app in a broken or half-configured state.
-  - Using both env vars and persisted settings requires a clear precedence model.
+1. Move runtime application state out of `app.state`
+   - Goal: Remove process-local mutable business state from FastAPI application state and centralise it in a cluster-safe, testable runtime state abstraction.
+   - Issue: `app.state` is currently used to store runtime values such as `setup_complete_cached`, `server_status`, `pending_recovery`, and `last_activation`, which are process-specific and leak operational state into the framework layer.
+   - Propose: Introduce a dedicated runtime state manager or cache service that owns these values and exposes them through explicit dependencies. Keep `app.state` limited to shared immutable resources and infrastructure handles.
+   - Test: Verify the backend no longer stores business state in `app.state` and that runtime state is provided by the new abstraction only.
+   - Status: completed

-### 2. Remove or use `session_secret`
- Where found: `backend/app/config.py`
- Goal: either eliminate the unused `BANGUI_SESSION_SECRET` requirement or use it for session token generation / signing so the setting has purpose.
- Status: completed
- Possible traps and issues:
-  - Keeping it required without use is misleading and burdens deployments.
-  - Introducing a new crypto dependency for session tokens must preserve backward compatibility with existing sessions.
-  - If switched to signed tokens, ensure token revocation / logout still works correctly.
+2. Eliminate direct `app.state` access from routers
+   - Goal: Keep routers strictly in the HTTP translation layer by using FastAPI `Depends()` and injected services only.
+   - Issue: Several routers still read or mutate `app.state` directly, including `setup.py` and `config.py`, which breaks the intended separation of concerns and makes handlers harder to test.
+   - Propose: Refactor routers to consume explicit dependencies for required values and actions, and move all direct state mutations into services or dedicated state managers.
+   - Test: Ensure all router handlers compile without requiring direct `app.state` access and existing endpoint tests still pass.

-### 3. Harden session cookie security
- Where found: `backend/app/routers/auth.py`
- Goal: auth cookies should be set with `secure=True` in HTTPS production deployments and `SameSite`/`HttpOnly` behavior should be explicit and configurable.
- Possible traps and issues:
-  - Hardcoding `secure=False` makes production deployment insecure.
-  - Switching to `secure=True` can break local development unless there is an explicit dev override.
-  - The frontend API may need matching CORS and same-site handling when served from a different origin.
- Status: completed — implemented configurable session cookie flags and secure mode support.
+3. Replace process-local session cache with a pluggable abstraction
+   - Goal: Make session validation consistent across multi-worker deployments by replacing the module-level in-memory cache with an injectable cache backend.
+   - Issue: `_session_cache` in `app/dependencies.py` is a process-local dict, so logout invalidation and session revocation only work within a single process.
+   - Propose: Define a cache interface and provide a default in-memory implementation, with the option to swap in shared cache storage (Redis, Memcached) for clustered production deployments.
+   - Test: Add unit tests for the cache abstraction and verify logout/invalidation behaves correctly through the configured cache implementation.

-### 4. Address session cache invalidation semantics
- Where found: `backend/app/dependencies.py`
- Goal: make session caching safe or remove it, and document that cache invalidation is not cluster-safe if the app is run with multiple workers.
- Status: completed — session validation cache is now disabled by default and can be enabled explicitly in single-process deployments.
- Possible traps and issues:
-  - Process-local cache can keep revoked sessions alive in other worker processes.
-  - Implementing a shared cache is a larger architectural change; a safer short-term fix is to disable caching by default.
-  - Need to ensure `logout()` and session expiry remain consistent across requests.
+4. Harden SQLite connection configuration and lifecycle
+   - Goal: Make application database access robust under concurrent requests and background task load.
+   - Issue: `get_db` opens a fresh `aiosqlite` connection per request without explicit database pragmas such as `busy_timeout` or WAL mode.
+   - Propose: Configure SQLite connections in `open_db` with safe defaults (e.g. WAL, busy timeout, journal mode) and consider a centralized request-scoped access wrapper to keep connection setup consistent.
+   - Test: Confirm `open_db` applies the expected pragmas and add simulated concurrency tests that surface lock / timeout failures.

-### 5. Improve external HTTP client resilience
- Where found: `backend/app/startup.py`
- Goal: create `aiohttp.ClientSession()` with sensible global timeouts, connection limit settings, and optional retry policy for geo/blocklist API calls.
- Status: completed — configured shared aiohttp session with sensible timeouts, connection limits, and retry support for transient blocklist/geo failures.
- Possible traps and issues:
-  - Without timeouts, external lookups can hang request handling or background tasks.
-  - Connection limits must be chosen carefully to avoid underutilization or overload.
-  - A retry policy should avoid retry storms and should respect API rate limits.
+5. Separate startup settings from runtime configuration mutation
+   - Goal: Keep startup configuration immutable after bootstrap and handle runtime overrides through a dedicated manager.
+   - Issue: `POST /api/setup` mutates `app.state.settings`, and startup logic also replaces `app.state.settings` with persisted runtime overrides, mixing immutable startup settings with mutable runtime config.
+   - Propose: Introduce a runtime settings manager or configuration service that resolves effective settings from persisted overrides without mutating the startup settings object on `app.state`.
+   - Test: Validate the setup flow persists runtime settings in the database and that runtime settings are resolved through the new manager instead of direct `app.state.settings` mutation.
+
+6. Introduce explicit service and repository abstractions for swapability and testing
+   - Goal: Reduce tight coupling and improve replacement/testing of core backend components.
+   - Issue: Routers and higher-level services currently import concrete service modules directly, which prevents clean substitution and dependency override testing.
+   - Propose: Define protocols or abstract base classes for major services and repositories, then wire concrete implementations through FastAPI dependency providers.
+   - Test: Add tests that override a service dependency with a fake implementation and verify the router behavior remains correct.
+
+7. Move operational orchestration out of routers and into service/task layer
+   - Goal: Keep routers thin and move operational control flow into service or task components.
+   - Issue: `app/routers/config.py` contains operational orchestration such as activation crash tracking, pending recovery state updates, and forced health probes, which belongs in service/task code not the HTTP layer.
+   - Propose: Refactor jail activation/deactivation/recovery coordination into services or task managers that manage state updates and health probe triggers on behalf of the router.
+   - Test: Confirm router tests only cover HTTP translation while unit tests for the new service/task components cover the orchestration logic.
+
+8. Decouple periodic background jobs from FastAPI application state
+   - Goal: Make scheduled task runners explicit and testable by removing direct `app.state` dependency from background task code.
+   - Issue: `history_sync`, `geo_re_resolve`, `geo_cache_flush`, and `blocklist_import` read shared resources directly from `app.state` and register jobs against `app.state.scheduler`, which couples task implementation to FastAPI internals.
+   - Propose: Introduce a scheduler/task bootstrap abstraction that accepts injected settings, database providers, HTTP sessions, and scheduler handles, and move task resource access out of the `app.state` internals.
+   - Test: Add unit tests for task registration and execution with fake resource providers and a mock scheduler, without needing a real FastAPI instance.
+
+9. Remove brittle scheduler bootstrap logic from blocklist job registration
+   - Goal: Ensure job registration is deterministic and compatible with both synchronous and asynchronous startup contexts.
+   - Issue: `blocklist_import.register` uses `asyncio.get_event_loop().run_until_complete` then falls back to `asyncio.ensure_future`, which is fragile and can fail if the application is already running inside an event loop.
+   - Propose: Refactor blocklist job registration to use a pure async bootstrap flow invoked from the lifespan handler, or resolve schedule configuration before starting the scheduler instead of forcing synchronous loop execution.
+   - Test: Add tests validating blocklist job registration in both non-running and running event loop contexts, and verify the registration path does not raise loop state errors.
+
+10. Standardize async offloading and thread-backed sync work behind a shared executor abstraction
+   - Goal: Remove scattered `asyncio.get_event_loop().run_in_executor` / `asyncio.ensure_future` patterns and give the backend a single, consistent way to run blocking file, CPU-bound, or compatibility work.
+   - Issue: Multiple services (`raw_config_io_service`, `log_service`, `jail_config_service`, `setup_service`, and others) manually call `asyncio.get_event_loop()` or `run_in_executor`, creating inconsistent async control flow and making event-loop compatibility harder to maintain.
+   - Propose: Introduce a shared async offload abstraction or utility that uses `asyncio.to_thread` / dedicated thread pool executors, and refactor blocking I/O and CPU-bound helpers to consume that shared provider.
+   - Test: Verify existing async service tests still pass after refactoring and add coverage for the new offload utility to ensure blocking work is properly delegated without event-loop blocking.
+
+11. Eliminate duplicate setup persistence across bootstrap and runtime databases
+   - Goal: Simplify setup persistence by establishing a single source of truth for runtime configuration data and reduce failure surface in first-run setup.
+   - Issue: `setup_service.run_setup` writes the same master password and runtime settings to both the bootstrap configuration database and the runtime database, making setup logic brittle and difficult to reason about.
+   - Propose: Consolidate setup persistence so initial configuration is stored in one clearly defined location, with bootstrap metadata in the startup DB and runtime configuration pulled from the single persisted source.
+   - Test: Add tests that confirm setup writes each value exactly once and that startup uses a single persisted store to resolve runtime settings, removing duplicate write paths.
+
+12. Replace custom fail2ban socket protocol duplication with a dedicated adapter over the vendored client
+   - Goal: Reduce maintenance risk by aligning fail2ban integration with the bundled `fail2ban-master` codebase and isolating protocol implementation behind a swapable adapter.
+   - Issue: `app.utils.fail2ban_client` reimplements low-level socket framing, command encoding, and protocol parsing rather than using the vendored fail2ban client classes, creating duplicated protocol logic and an unclear source of truth.
+   - Propose: Introduce a fail2ban adapter interface and either wrap the vendored `fail2ban-client` implementation or refactor the custom client so it is the single canonical integration point. Ensure all services depend on the adapter abstraction rather than raw socket details.
+   - Test: Add adapter-level unit tests and service tests that can swap in a fake fail2ban adapter, proving the backend no longer couples business logic directly to low-level socket protocol code.
+
+13. Introduce explicit schema migration/versioning for the runtime database
+   - Goal: Allow BanGUI to evolve its application database schema safely across releases and prevent startup failures caused by schema drift.
+   - Issue: `app/db.py` only creates missing tables on startup and has no schema version marker or migration strategy. This leaves upgrades brittle and makes it difficult to evolve the database without manual intervention.
+   - Propose: Add a lightweight migration system or schema version table, define incremental migration scripts, and run upgrades during startup before the application begins serving requests.
+   - Test: Add tests that simulate an older schema version and verify startup migrates it to the latest schema, and assert that `init_db`/migration code leaves the database in the expected current state.
+
+14. Split the monolithic config router into focused sub-routers
+   - Goal: Improve maintainability, reduce cognitive load, and make HTTP routing responsibilities easier to isolate and test.
+   - Issue: `app/routers/config.py` and `app/routers/file_config.py` each contain many unrelated endpoints with mixed concerns, making the HTTP layer large, hard to navigate, and harder to evolve safely.
+   - Propose: Refactor configuration endpoints into smaller routers or feature-specific modules such as `jail_config`, `filter_config`, `action_config`, and `raw_config`. Keep each router focused on one vertical slice of functionality and delegate business logic to a single service.
+   - Test: Verify the refactor preserves endpoint behavior and that router tests are split along the new modules, ensuring each HTTP layer unit test covers a narrow, well-defined feature.
+
+15. Centralize fail2ban metadata resolution and avoid repeated socket discovery
+   - Goal: Reduce coupling between history APIs and fail2ban socket metadata discovery, improving performance and reliability.
+   - Issue: `app/utils/fail2ban_db_utils.get_fail2ban_db_path` queries fail2ban for the database path on every history request. This ties read-only history endpoints to socket availability and duplicates discovery logic across the codebase.
+   - Propose: Introduce a dedicated fail2ban metadata service that resolves and caches runtime metadata such as the fail2ban DB path, refreshing it only when needed instead of every request.
+   - Test: Add tests that confirm history requests can reuse cached metadata, and that the metadata service falls back cleanly when the socket is temporarily unavailable after initial resolution.
+
+16. Introduce an explicit application lifecycle context to replace raw `app.state` access
+   - Goal: Make shared infrastructure handles and runtime configuration available through a typed, injectable context instead of a loosely-typed `app.state` bag.
+   - Issue: The backend currently relies on raw `app.state` access in routers, tasks, and dependencies, which hides contract boundaries and allows arbitrary mutable state to leak into the framework layer.
+   - Propose: Define a dedicated `ApplicationContext` / runtime context object for shared resources such as `settings`, `http_session`, `scheduler`, and runtime caches. Provide it through explicit FastAPI dependencies and remove direct `request.app.state` reads outside a small bootstrap boundary.
+   - Test: Verify the new context can satisfy existing dependencies, and add a static check or test ensuring no backend module directly accesses `request.app.state` except through the new context provider.
+
+17. Resolve runtime configuration before creating shared startup resources
+   - Goal: Ensure startup resources are initialized using the effective runtime settings, not stale bootstrap defaults.
+   - Issue: `startup_shared_resources` creates shared resources like `aiohttp.ClientSession` and geo cache initialization from the initial environment-loaded settings, then later applies persisted runtime overrides to `app.state.settings`, producing a fragile startup ordering.
+   - Propose: Split startup into phases that first resolve bootstrap and runtime persisted configuration, then construct shared resources and register scheduled jobs using those effective settings.
+   - Test: Add startup tests asserting that when persisted runtime settings differ from bootstrap settings, the final initialized resources are built from the resolved effective settings, not the original bootstrap values.

-### 6. Update async socket handling
- Where found: `backend/app/utils/fail2ban_client.py`, `backend/app/startup.py`
- Goal: use modern asyncio APIs (`get_running_loop()`), avoid blocking operations on the event loop, and ensure startup resources are cleaned up if initialization fails.
- Status: completed — switched fail2ban socket I/O to `asyncio.to_thread` and added startup cleanup for failed resource initialization.
- Possible traps and issues:
-  - `asyncio.get_event_loop()` behavior changed in newer Python versions; this can cause runtime warnings or errors.
-  - Resource leaks can occur if `startup_shared_resources()` fails before the lifespan shutdown path is reached.
-  - The fail2ban socket client should still handle transient errors and not hide protocol failures behind generic exceptions.

--- a/backend/app/dependencies.py
+++ b/backend/app/dependencies.py
@@ -21,6 +21,7 @@ from app.config import Settings
 from app.models.auth import Session
 from app.models.config import PendingRecovery
 from app.models.server import ServerStatus
+from app.utils.runtime_state import RuntimeState
 from app.utils.time_utils import utc_now

 log: structlog.stdlib.BoundLogger = structlog.get_logger()
@@ -35,6 +36,7 @@ class AppState(Protocol):
    server_status: ServerStatus
    pending_recovery: PendingRecovery | None
    last_activation: dict[str, datetime.datetime] | None
+    runtime_state: RuntimeState


 _COOKIE_NAME = "bangui_session"
--- a/backend/app/main.py
+++ b/backend/app/main.py
@@ -45,6 +45,7 @@ from app.routers import (
 )
 from app.startup import startup_shared_resources
 from app.utils.fail2ban_client import Fail2BanConnectionError, Fail2BanProtocolError
+from app.utils.runtime_state import ApplicationState, RuntimeState
 from app.utils.setup_state import is_setup_complete_cached, set_setup_complete_cache

 log: structlog.stdlib.BoundLogger = structlog.get_logger()
@@ -283,7 +284,10 @@ def create_app(settings: Settings | None = None) -> FastAPI:
        lifespan=_lifespan,
    )

-    # Store settings on app.state so the lifespan handler can access them.
+    # Store immutable configuration and the dedicated runtime state manager on
+    # app.state. Runtime state values are proxied through the wrapper so the
+    # shared Starlette state bag itself does not hold mutable business state.
+    app.state = ApplicationState(RuntimeState())
    app.state.settings = resolved_settings
    set_setup_complete_cache(app, False)

--- a/backend/app/utils/runtime_state.py
+++ b/backend/app/utils/runtime_state.py
@@ -0,0 +1,85 @@
+"""Centralise mutable runtime application state.
+
+Runtime state is kept outside of Starlette's raw ``app.state`` storage and
+exposed through a controlled state manager object. This keeps the FastAPI
+framework state bag limited to shared infrastructure handles and immutable
+configuration while still allowing existing code to access runtime values via
+attribute proxying.
+"""
+
+from __future__ import annotations
+
+import datetime
+from dataclasses import dataclass, field
+from typing import TYPE_CHECKING, Any
+
+from starlette.datastructures import State
+
+from app.models.server import ServerStatus
+
+if TYPE_CHECKING:  # pragma: no cover
+    from app.models.config import PendingRecovery
+
+ActivationRecord = dict[str, datetime.datetime]
+
+_RUNTIME_ATTRIBUTES: frozenset[str] = frozenset(
+    {
+        "setup_complete_cached",
+        "server_status",
+        "pending_recovery",
+        "last_activation",
+    }
+)
+
+
+@dataclass
+class RuntimeState:
+    """Mutable runtime state for the current application instance."""
+
+    setup_complete_cached: bool = False
+    server_status: ServerStatus = field(default_factory=lambda: ServerStatus(online=False))
+    pending_recovery: PendingRecovery | None = None
+    last_activation: ActivationRecord | None = None
+
+
+class ApplicationState(State):
+    """Application state wrapper that delegates runtime state access.
+
+    This allows runtime values to be stored in a dedicated
+    :class:`RuntimeState` instance while preserving the familiar attribute-based
+    ``app.state`` API for the rest of the application.
+    """
+
+    def __init__(self, runtime_state: RuntimeState, state: dict[str, Any] | None = None):
+        super().__init__(state)
+        object.__setattr__(self, "_runtime_state", runtime_state)
+
+    @property
+    def runtime_state(self) -> RuntimeState:
+        """Return the dedicated runtime state manager."""
+        return object.__getattribute__(self, "_runtime_state")
+
+    def __getattr__(self, key: str) -> Any:
+        if key in _RUNTIME_ATTRIBUTES:
+            return getattr(self.runtime_state, key)
+        return super().__getattr__(key)
+
+    def __setattr__(self, key: str, value: Any) -> None:
+        if key in _RUNTIME_ATTRIBUTES:
+            setattr(self.runtime_state, key, value)
+            return
+        super().__setattr__(key, value)
+
+    def __delattr__(self, key: str) -> None:
+        if key in _RUNTIME_ATTRIBUTES:
+            delattr(self.runtime_state, key)
+            return
+        super().__delattr__(key)
+
+
+def get_runtime_state(app: Any) -> RuntimeState:
+    """Return the runtime state manager for the current FastAPI application."""
+    state = getattr(app, "state", None)
+    if state is None or not hasattr(state, "runtime_state"):
+        raise AttributeError("Runtime state has not been initialised on the application.")
+    return state.runtime_state
--- a/backend/app/utils/setup_state.py
+++ b/backend/app/utils/setup_state.py
@@ -2,17 +2,22 @@

 from __future__ import annotations

-from fastapi import FastAPI
+from typing import TYPE_CHECKING
+
+from app.utils.runtime_state import get_runtime_state
+
+if TYPE_CHECKING:  # pragma: no cover
+    from fastapi import FastAPI


 def is_setup_complete_cached(app: FastAPI) -> bool:
-    """Return the cached setup completion state from application state."""
-    return getattr(app.state, "setup_complete_cached", False)
+    """Return the cached setup completion state from the runtime state manager."""
+    return get_runtime_state(app).setup_complete_cached


 def set_setup_complete_cache(app: FastAPI, completed: bool) -> None:
-    """Set the cached setup completion state on application state."""
-    app.state.setup_complete_cached = completed
+    """Set the cached setup completion state on the runtime state manager."""
+    get_runtime_state(app).setup_complete_cached = completed


 def invalidate_setup_complete_cache(app: FastAPI) -> None:
@@ -21,4 +26,4 @@ def invalidate_setup_complete_cache(app: FastAPI) -> None:
    This helper exists so the cache can be invalidated explicitly if the
    application state changes during runtime.
    """
-    app.state.setup_complete_cached = False
+    get_runtime_state(app).setup_complete_cached = False
--- a/backend/tests/test_main.py
+++ b/backend/tests/test_main.py
@@ -60,6 +60,28 @@ def test_create_app_skips_cors_when_no_origins_are_configured() -> None:
    assert cors_middleware == []


+def test_create_app_initialises_runtime_state_manager() -> None:
+    """The FastAPI app exposes a dedicated runtime state manager on app.state."""
+    settings = Settings(
+        database_path="/tmp/test.db",
+        fail2ban_socket="/tmp/fake_fail2ban.sock",
+        fail2ban_config_dir="/tmp/fail2ban",
+        session_secret="test-secret-key-do-not-use-in-production",
+        session_duration_minutes=60,
+        timezone="UTC",
+        log_level="debug",
+    )
+
+    app = create_app(settings=settings)
+    runtime_state = app.state.runtime_state
+
+    assert runtime_state.setup_complete_cached is False
+    assert runtime_state.server_status.online is False
+    assert runtime_state.pending_recovery is None
+    assert runtime_state.last_activation is None
+    assert app.state.server_status.online is False
+
+
 def test_create_app_disables_cors_by_default() -> None:
    """The FastAPI app does not add CORS middleware when no origins are configured by environment."""
    settings = Settings(