refactoring-backend #3

Merged
lukas.pupkalipinski merged 403 commits from refactoring-backend into main 2026-05-20 20:23:46 +02:00
6 changed files with 223 additions and 55 deletions
Showing only changes of commit ff92733f90 - Show all commits

View File

@@ -8,57 +8,107 @@ Reference: `Docs/Refactoring.md` for full analysis of each issue.
## Open Issues
### 1. Fix setup persistence
- Where found: `backend/app/config.py`, `backend/app/startup.py`, `backend/app/services/setup_service.py`, `backend/app/routers/setup.py`
- Goal: runtime configuration should use the values persisted during setup for `database_path`, `fail2ban_socket`, `timezone`, and `session_duration_minutes` rather than only environment defaults.
- Status: completed
- Possible traps and issues:
- Setup may appear successful but later use a different DB/socket on restart.
- A partially persisted setup run must not leave the app in a broken or half-configured state.
- Using both env vars and persisted settings requires a clear precedence model.
1. Move runtime application state out of `app.state`
- Goal: Remove process-local mutable business state from FastAPI application state and centralise it in a cluster-safe, testable runtime state abstraction.
- Issue: `app.state` is currently used to store runtime values such as `setup_complete_cached`, `server_status`, `pending_recovery`, and `last_activation`, which are process-specific and leak operational state into the framework layer.
- Propose: Introduce a dedicated runtime state manager or cache service that owns these values and exposes them through explicit dependencies. Keep `app.state` limited to shared immutable resources and infrastructure handles.
- Test: Verify the backend no longer stores business state in `app.state` and that runtime state is provided by the new abstraction only.
- Status: completed
### 2. Remove or use `session_secret`
- Where found: `backend/app/config.py`
- Goal: either eliminate the unused `BANGUI_SESSION_SECRET` requirement or use it for session token generation / signing so the setting has purpose.
- Status: completed
- Possible traps and issues:
- Keeping it required without use is misleading and burdens deployments.
- Introducing a new crypto dependency for session tokens must preserve backward compatibility with existing sessions.
- If switched to signed tokens, ensure token revocation / logout still works correctly.
2. Eliminate direct `app.state` access from routers
- Goal: Keep routers strictly in the HTTP translation layer by using FastAPI `Depends()` and injected services only.
- Issue: Several routers still read or mutate `app.state` directly, including `setup.py` and `config.py`, which breaks the intended separation of concerns and makes handlers harder to test.
- Propose: Refactor routers to consume explicit dependencies for required values and actions, and move all direct state mutations into services or dedicated state managers.
- Test: Ensure all router handlers compile without requiring direct `app.state` access and existing endpoint tests still pass.
### 3. Harden session cookie security
- Where found: `backend/app/routers/auth.py`
- Goal: auth cookies should be set with `secure=True` in HTTPS production deployments and `SameSite`/`HttpOnly` behavior should be explicit and configurable.
- Possible traps and issues:
- Hardcoding `secure=False` makes production deployment insecure.
- Switching to `secure=True` can break local development unless there is an explicit dev override.
- The frontend API may need matching CORS and same-site handling when served from a different origin.
- Status: completed — implemented configurable session cookie flags and secure mode support.
3. Replace process-local session cache with a pluggable abstraction
- Goal: Make session validation consistent across multi-worker deployments by replacing the module-level in-memory cache with an injectable cache backend.
- Issue: `_session_cache` in `app/dependencies.py` is a process-local dict, so logout invalidation and session revocation only work within a single process.
- Propose: Define a cache interface and provide a default in-memory implementation, with the option to swap in shared cache storage (Redis, Memcached) for clustered production deployments.
- Test: Add unit tests for the cache abstraction and verify logout/invalidation behaves correctly through the configured cache implementation.
### 4. Address session cache invalidation semantics
- Where found: `backend/app/dependencies.py`
- Goal: make session caching safe or remove it, and document that cache invalidation is not cluster-safe if the app is run with multiple workers.
- Status: completed — session validation cache is now disabled by default and can be enabled explicitly in single-process deployments.
- Possible traps and issues:
- Process-local cache can keep revoked sessions alive in other worker processes.
- Implementing a shared cache is a larger architectural change; a safer short-term fix is to disable caching by default.
- Need to ensure `logout()` and session expiry remain consistent across requests.
4. Harden SQLite connection configuration and lifecycle
- Goal: Make application database access robust under concurrent requests and background task load.
- Issue: `get_db` opens a fresh `aiosqlite` connection per request without explicit database pragmas such as `busy_timeout` or WAL mode.
- Propose: Configure SQLite connections in `open_db` with safe defaults (e.g. WAL, busy timeout, journal mode) and consider a centralized request-scoped access wrapper to keep connection setup consistent.
- Test: Confirm `open_db` applies the expected pragmas and add simulated concurrency tests that surface lock / timeout failures.
### 5. Improve external HTTP client resilience
- Where found: `backend/app/startup.py`
- Goal: create `aiohttp.ClientSession()` with sensible global timeouts, connection limit settings, and optional retry policy for geo/blocklist API calls.
- Status: completed — configured shared aiohttp session with sensible timeouts, connection limits, and retry support for transient blocklist/geo failures.
- Possible traps and issues:
- Without timeouts, external lookups can hang request handling or background tasks.
- Connection limits must be chosen carefully to avoid underutilization or overload.
- A retry policy should avoid retry storms and should respect API rate limits.
5. Separate startup settings from runtime configuration mutation
- Goal: Keep startup configuration immutable after bootstrap and handle runtime overrides through a dedicated manager.
- Issue: `POST /api/setup` mutates `app.state.settings`, and startup logic also replaces `app.state.settings` with persisted runtime overrides, mixing immutable startup settings with mutable runtime config.
- Propose: Introduce a runtime settings manager or configuration service that resolves effective settings from persisted overrides without mutating the startup settings object on `app.state`.
- Test: Validate the setup flow persists runtime settings in the database and that runtime settings are resolved through the new manager instead of direct `app.state.settings` mutation.
6. Introduce explicit service and repository abstractions for swapability and testing
- Goal: Reduce tight coupling and improve replacement/testing of core backend components.
- Issue: Routers and higher-level services currently import concrete service modules directly, which prevents clean substitution and dependency override testing.
- Propose: Define protocols or abstract base classes for major services and repositories, then wire concrete implementations through FastAPI dependency providers.
- Test: Add tests that override a service dependency with a fake implementation and verify the router behavior remains correct.
7. Move operational orchestration out of routers and into service/task layer
- Goal: Keep routers thin and move operational control flow into service or task components.
- Issue: `app/routers/config.py` contains operational orchestration such as activation crash tracking, pending recovery state updates, and forced health probes, which belongs in service/task code not the HTTP layer.
- Propose: Refactor jail activation/deactivation/recovery coordination into services or task managers that manage state updates and health probe triggers on behalf of the router.
- Test: Confirm router tests only cover HTTP translation while unit tests for the new service/task components cover the orchestration logic.
8. Decouple periodic background jobs from FastAPI application state
- Goal: Make scheduled task runners explicit and testable by removing direct `app.state` dependency from background task code.
- Issue: `history_sync`, `geo_re_resolve`, `geo_cache_flush`, and `blocklist_import` read shared resources directly from `app.state` and register jobs against `app.state.scheduler`, which couples task implementation to FastAPI internals.
- Propose: Introduce a scheduler/task bootstrap abstraction that accepts injected settings, database providers, HTTP sessions, and scheduler handles, and move task resource access out of the `app.state` internals.
- Test: Add unit tests for task registration and execution with fake resource providers and a mock scheduler, without needing a real FastAPI instance.
9. Remove brittle scheduler bootstrap logic from blocklist job registration
- Goal: Ensure job registration is deterministic and compatible with both synchronous and asynchronous startup contexts.
- Issue: `blocklist_import.register` uses `asyncio.get_event_loop().run_until_complete` then falls back to `asyncio.ensure_future`, which is fragile and can fail if the application is already running inside an event loop.
- Propose: Refactor blocklist job registration to use a pure async bootstrap flow invoked from the lifespan handler, or resolve schedule configuration before starting the scheduler instead of forcing synchronous loop execution.
- Test: Add tests validating blocklist job registration in both non-running and running event loop contexts, and verify the registration path does not raise loop state errors.
10. Standardize async offloading and thread-backed sync work behind a shared executor abstraction
- Goal: Remove scattered `asyncio.get_event_loop().run_in_executor` / `asyncio.ensure_future` patterns and give the backend a single, consistent way to run blocking file, CPU-bound, or compatibility work.
- Issue: Multiple services (`raw_config_io_service`, `log_service`, `jail_config_service`, `setup_service`, and others) manually call `asyncio.get_event_loop()` or `run_in_executor`, creating inconsistent async control flow and making event-loop compatibility harder to maintain.
- Propose: Introduce a shared async offload abstraction or utility that uses `asyncio.to_thread` / dedicated thread pool executors, and refactor blocking I/O and CPU-bound helpers to consume that shared provider.
- Test: Verify existing async service tests still pass after refactoring and add coverage for the new offload utility to ensure blocking work is properly delegated without event-loop blocking.
11. Eliminate duplicate setup persistence across bootstrap and runtime databases
- Goal: Simplify setup persistence by establishing a single source of truth for runtime configuration data and reduce failure surface in first-run setup.
- Issue: `setup_service.run_setup` writes the same master password and runtime settings to both the bootstrap configuration database and the runtime database, making setup logic brittle and difficult to reason about.
- Propose: Consolidate setup persistence so initial configuration is stored in one clearly defined location, with bootstrap metadata in the startup DB and runtime configuration pulled from the single persisted source.
- Test: Add tests that confirm setup writes each value exactly once and that startup uses a single persisted store to resolve runtime settings, removing duplicate write paths.
12. Replace custom fail2ban socket protocol duplication with a dedicated adapter over the vendored client
- Goal: Reduce maintenance risk by aligning fail2ban integration with the bundled `fail2ban-master` codebase and isolating protocol implementation behind a swapable adapter.
- Issue: `app.utils.fail2ban_client` reimplements low-level socket framing, command encoding, and protocol parsing rather than using the vendored fail2ban client classes, creating duplicated protocol logic and an unclear source of truth.
- Propose: Introduce a fail2ban adapter interface and either wrap the vendored `fail2ban-client` implementation or refactor the custom client so it is the single canonical integration point. Ensure all services depend on the adapter abstraction rather than raw socket details.
- Test: Add adapter-level unit tests and service tests that can swap in a fake fail2ban adapter, proving the backend no longer couples business logic directly to low-level socket protocol code.
13. Introduce explicit schema migration/versioning for the runtime database
- Goal: Allow BanGUI to evolve its application database schema safely across releases and prevent startup failures caused by schema drift.
- Issue: `app/db.py` only creates missing tables on startup and has no schema version marker or migration strategy. This leaves upgrades brittle and makes it difficult to evolve the database without manual intervention.
- Propose: Add a lightweight migration system or schema version table, define incremental migration scripts, and run upgrades during startup before the application begins serving requests.
- Test: Add tests that simulate an older schema version and verify startup migrates it to the latest schema, and assert that `init_db`/migration code leaves the database in the expected current state.
14. Split the monolithic config router into focused sub-routers
- Goal: Improve maintainability, reduce cognitive load, and make HTTP routing responsibilities easier to isolate and test.
- Issue: `app/routers/config.py` and `app/routers/file_config.py` each contain many unrelated endpoints with mixed concerns, making the HTTP layer large, hard to navigate, and harder to evolve safely.
- Propose: Refactor configuration endpoints into smaller routers or feature-specific modules such as `jail_config`, `filter_config`, `action_config`, and `raw_config`. Keep each router focused on one vertical slice of functionality and delegate business logic to a single service.
- Test: Verify the refactor preserves endpoint behavior and that router tests are split along the new modules, ensuring each HTTP layer unit test covers a narrow, well-defined feature.
15. Centralize fail2ban metadata resolution and avoid repeated socket discovery
- Goal: Reduce coupling between history APIs and fail2ban socket metadata discovery, improving performance and reliability.
- Issue: `app/utils/fail2ban_db_utils.get_fail2ban_db_path` queries fail2ban for the database path on every history request. This ties read-only history endpoints to socket availability and duplicates discovery logic across the codebase.
- Propose: Introduce a dedicated fail2ban metadata service that resolves and caches runtime metadata such as the fail2ban DB path, refreshing it only when needed instead of every request.
- Test: Add tests that confirm history requests can reuse cached metadata, and that the metadata service falls back cleanly when the socket is temporarily unavailable after initial resolution.
16. Introduce an explicit application lifecycle context to replace raw `app.state` access
- Goal: Make shared infrastructure handles and runtime configuration available through a typed, injectable context instead of a loosely-typed `app.state` bag.
- Issue: The backend currently relies on raw `app.state` access in routers, tasks, and dependencies, which hides contract boundaries and allows arbitrary mutable state to leak into the framework layer.
- Propose: Define a dedicated `ApplicationContext` / runtime context object for shared resources such as `settings`, `http_session`, `scheduler`, and runtime caches. Provide it through explicit FastAPI dependencies and remove direct `request.app.state` reads outside a small bootstrap boundary.
- Test: Verify the new context can satisfy existing dependencies, and add a static check or test ensuring no backend module directly accesses `request.app.state` except through the new context provider.
17. Resolve runtime configuration before creating shared startup resources
- Goal: Ensure startup resources are initialized using the effective runtime settings, not stale bootstrap defaults.
- Issue: `startup_shared_resources` creates shared resources like `aiohttp.ClientSession` and geo cache initialization from the initial environment-loaded settings, then later applies persisted runtime overrides to `app.state.settings`, producing a fragile startup ordering.
- Propose: Split startup into phases that first resolve bootstrap and runtime persisted configuration, then construct shared resources and register scheduled jobs using those effective settings.
- Test: Add startup tests asserting that when persisted runtime settings differ from bootstrap settings, the final initialized resources are built from the resolved effective settings, not the original bootstrap values.
### 6. Update async socket handling
- Where found: `backend/app/utils/fail2ban_client.py`, `backend/app/startup.py`
- Goal: use modern asyncio APIs (`get_running_loop()`), avoid blocking operations on the event loop, and ensure startup resources are cleaned up if initialization fails.
- Status: completed — switched fail2ban socket I/O to `asyncio.to_thread` and added startup cleanup for failed resource initialization.
- Possible traps and issues:
- `asyncio.get_event_loop()` behavior changed in newer Python versions; this can cause runtime warnings or errors.
- Resource leaks can occur if `startup_shared_resources()` fails before the lifespan shutdown path is reached.
- The fail2ban socket client should still handle transient errors and not hide protocol failures behind generic exceptions.

View File

@@ -21,6 +21,7 @@ from app.config import Settings
from app.models.auth import Session
from app.models.config import PendingRecovery
from app.models.server import ServerStatus
from app.utils.runtime_state import RuntimeState
from app.utils.time_utils import utc_now
log: structlog.stdlib.BoundLogger = structlog.get_logger()
@@ -35,6 +36,7 @@ class AppState(Protocol):
server_status: ServerStatus
pending_recovery: PendingRecovery | None
last_activation: dict[str, datetime.datetime] | None
runtime_state: RuntimeState
_COOKIE_NAME = "bangui_session"

View File

@@ -45,6 +45,7 @@ from app.routers import (
)
from app.startup import startup_shared_resources
from app.utils.fail2ban_client import Fail2BanConnectionError, Fail2BanProtocolError
from app.utils.runtime_state import ApplicationState, RuntimeState
from app.utils.setup_state import is_setup_complete_cached, set_setup_complete_cache
log: structlog.stdlib.BoundLogger = structlog.get_logger()
@@ -283,7 +284,10 @@ def create_app(settings: Settings | None = None) -> FastAPI:
lifespan=_lifespan,
)
# Store settings on app.state so the lifespan handler can access them.
# Store immutable configuration and the dedicated runtime state manager on
# app.state. Runtime state values are proxied through the wrapper so the
# shared Starlette state bag itself does not hold mutable business state.
app.state = ApplicationState(RuntimeState())
app.state.settings = resolved_settings
set_setup_complete_cache(app, False)

View File

@@ -0,0 +1,85 @@
"""Centralise mutable runtime application state.
Runtime state is kept outside of Starlette's raw ``app.state`` storage and
exposed through a controlled state manager object. This keeps the FastAPI
framework state bag limited to shared infrastructure handles and immutable
configuration while still allowing existing code to access runtime values via
attribute proxying.
"""
from __future__ import annotations
import datetime
from dataclasses import dataclass, field
from typing import TYPE_CHECKING, Any
from starlette.datastructures import State
from app.models.server import ServerStatus
if TYPE_CHECKING: # pragma: no cover
from app.models.config import PendingRecovery
ActivationRecord = dict[str, datetime.datetime]
_RUNTIME_ATTRIBUTES: frozenset[str] = frozenset(
{
"setup_complete_cached",
"server_status",
"pending_recovery",
"last_activation",
}
)
@dataclass
class RuntimeState:
"""Mutable runtime state for the current application instance."""
setup_complete_cached: bool = False
server_status: ServerStatus = field(default_factory=lambda: ServerStatus(online=False))
pending_recovery: PendingRecovery | None = None
last_activation: ActivationRecord | None = None
class ApplicationState(State):
"""Application state wrapper that delegates runtime state access.
This allows runtime values to be stored in a dedicated
:class:`RuntimeState` instance while preserving the familiar attribute-based
``app.state`` API for the rest of the application.
"""
def __init__(self, runtime_state: RuntimeState, state: dict[str, Any] | None = None):
super().__init__(state)
object.__setattr__(self, "_runtime_state", runtime_state)
@property
def runtime_state(self) -> RuntimeState:
"""Return the dedicated runtime state manager."""
return object.__getattribute__(self, "_runtime_state")
def __getattr__(self, key: str) -> Any:
if key in _RUNTIME_ATTRIBUTES:
return getattr(self.runtime_state, key)
return super().__getattr__(key)
def __setattr__(self, key: str, value: Any) -> None:
if key in _RUNTIME_ATTRIBUTES:
setattr(self.runtime_state, key, value)
return
super().__setattr__(key, value)
def __delattr__(self, key: str) -> None:
if key in _RUNTIME_ATTRIBUTES:
delattr(self.runtime_state, key)
return
super().__delattr__(key)
def get_runtime_state(app: Any) -> RuntimeState:
"""Return the runtime state manager for the current FastAPI application."""
state = getattr(app, "state", None)
if state is None or not hasattr(state, "runtime_state"):
raise AttributeError("Runtime state has not been initialised on the application.")
return state.runtime_state

View File

@@ -2,17 +2,22 @@
from __future__ import annotations
from fastapi import FastAPI
from typing import TYPE_CHECKING
from app.utils.runtime_state import get_runtime_state
if TYPE_CHECKING: # pragma: no cover
from fastapi import FastAPI
def is_setup_complete_cached(app: FastAPI) -> bool:
"""Return the cached setup completion state from application state."""
return getattr(app.state, "setup_complete_cached", False)
"""Return the cached setup completion state from the runtime state manager."""
return get_runtime_state(app).setup_complete_cached
def set_setup_complete_cache(app: FastAPI, completed: bool) -> None:
"""Set the cached setup completion state on application state."""
app.state.setup_complete_cached = completed
"""Set the cached setup completion state on the runtime state manager."""
get_runtime_state(app).setup_complete_cached = completed
def invalidate_setup_complete_cache(app: FastAPI) -> None:
@@ -21,4 +26,4 @@ def invalidate_setup_complete_cache(app: FastAPI) -> None:
This helper exists so the cache can be invalidated explicitly if the
application state changes during runtime.
"""
app.state.setup_complete_cached = False
get_runtime_state(app).setup_complete_cached = False

View File

@@ -60,6 +60,28 @@ def test_create_app_skips_cors_when_no_origins_are_configured() -> None:
assert cors_middleware == []
def test_create_app_initialises_runtime_state_manager() -> None:
"""The FastAPI app exposes a dedicated runtime state manager on app.state."""
settings = Settings(
database_path="/tmp/test.db",
fail2ban_socket="/tmp/fake_fail2ban.sock",
fail2ban_config_dir="/tmp/fail2ban",
session_secret="test-secret-key-do-not-use-in-production",
session_duration_minutes=60,
timezone="UTC",
log_level="debug",
)
app = create_app(settings=settings)
runtime_state = app.state.runtime_state
assert runtime_state.setup_complete_cached is False
assert runtime_state.server_status.online is False
assert runtime_state.pending_recovery is None
assert runtime_state.last_activation is None
assert app.state.server_status.online is False
def test_create_app_disables_cors_by_default() -> None:
"""The FastAPI app does not add CORS middleware when no origins are configured by environment."""
settings = Settings(