Fix restart/reload endpoint correctness and safety

- jail_service.restart(): replace invalid ["restart"] socket command with ["stop"], matching fail2ban transmitter protocol. The daemon is now stopped via socket; the caller starts it via subprocess. - config_file_service: expose _start_daemon and _wait_for_fail2ban as public start_daemon / wait_for_fail2ban functions. - restart_fail2ban router: orchestrate stop (socket) → start (subprocess) → probe (socket). Returns 204 on success, 503 when fail2ban does not come back within 10 s. Catches JailOperationError → 409. - reload_fail2ban router: add JailOperationError catch → 409 Conflict, consistent with other jail control endpoints. - Tests: add TestJailControls.test_restart_* (3 cases), TestReloadFail2ban 502/409 cases, TestRestartFail2ban (5 cases), TestRollbackJail (6 integration tests verifying file-write, subprocess invocation, socket- probe truthiness, active_jails count, and offline-at-call-time).
2026-03-15 12:59:17 +01:00
parent 61daa8bbc0
commit 93dc699825
7 changed files with 487 additions and 135 deletions
--- a/Docs/Tasks.md
+++ b/Docs/Tasks.md
@@ -3,137 +3,149 @@
 This document breaks the entire BanGUI project into development stages, ordered so that each stage builds on the previous one. Every task is described in prose with enough detail for a developer to begin work. References point to the relevant documentation.

 ---
+## Fix: `restart` and `reload` endpoints — correctness and safety ✅ DONE

-## Agent Operating Instructions
+### Background

-These instructions apply to every AI agent working in this repository. Read them fully before touching any file.
-
-### Before You Begin
-
-1. Read [Instructions.md](Instructions.md) in full — it defines the project context, coding standards, and workflow rules. Every rule there is authoritative and takes precedence over any assumption you make.
-2. Read [Architekture.md](Architekture.md) to understand the system structure before touching any component.
-3. Read the development guide relevant to your task: [Backend-Development.md](Backend-Development.md) or [Web-Development.md](Web-Development.md) (or both).
-4. Read [Features.md](Features.md) so you understand what the product is supposed to do and do not accidentally break intended behaviour.
-
-### How to Work Through This Document
-
- Tasks are grouped by feature area. Each group is self-contained.
- Work through tasks in the order they appear within a group; earlier tasks establish foundations for later ones.
- Mark a task **in-progress** before you start it and **completed** the moment it is done. Never batch completions.
- If a task depends on another task that is not yet complete, stop and complete the dependency first.
- If you are uncertain whether a change is correct, read the relevant documentation section again before proceeding. Do not guess.
-
-### Code Quality Rules (Summary)
-
- No TODOs, no placeholders, no half-finished functions.
- Full type annotations on every function (Python) and full TypeScript types on every symbol (no `any`).
- Layered architecture: routers → services → repositories. No layer may skip another.
- All backend errors are raised as typed HTTP exceptions; all unexpected errors are logged via structlog before re-raising.
- All frontend state lives in typed hooks; no raw `fetch` calls outside of the `api/` layer.
- After every code change, run the full test suite (`make test`) and ensure it is green.
-
-### Definition of Done
-
-A task is done when:
- The code compiles and the test suite passes (`make test`).
- The feature works end-to-end in the dev stack (`make up`).
- No new lint errors are introduced.
- The change is consistent with all documentation rules.
+Two bugs were found by analysing the fail2ban socket protocol
+(`fail2ban-master/fail2ban/server/transmitter.py`) against the current
+backend implementation.

 ---

-## Bug Fixes
+### Task 1 — Fix `jail_service.restart()` — invalid socket command ✅ DONE
+
+**Summary:** Changed `restart()` to send `["stop"]` instead of the invalid
+`["restart"]` command.  Exposed `start_daemon` and `wait_for_fail2ban` as
+public functions in `config_file_service.py`.  The router now orchestrates
+the full stop→start→probe sequence.
+
+**File:** `backend/app/services/jail_service.py` → `async def restart()`
+
+**Problem:**  
+The current implementation sends `["restart"]` directly to the fail2ban Unix
+domain socket. The fail2ban server transmitter (`transmitter.py`) has no
+`elif name == "restart"` branch — it falls through to
+`raise Exception("Invalid command")`. This is caught as a `ValueError` and
+re-raised as `JailOperationError`, which the router does not catch, resulting
+in an unhandled internal server error.
+
+**Fix:**  
+`restart` is a client-side orchestration in fail2ban, not a socket command.
+Replace the body of `jail_service.restart()` with two sequential socket calls:
+1. Send `["stop"]` — this calls `server.quit()` on the daemon side (shuts down
+   all jails and terminates the process).
+2. Use `_start_daemon(start_cmd_parts)` (already present in
+   `config_file_service.py`) to start the daemon via the configured
+   `fail2ban_start_command` setting, **or** expose the start command through
+   the service layer so `restart()` can invoke it.
+
+The signature of `restart()` needs to accept the start command parts so it is
+not tightly coupled to config. Alternatively, keep the socket `stop` call in
+`jail_service.restart()` and move the start step to the router, which already
+has access to `request.app.state.settings.fail2ban_start_command`.
+
+**Acceptance criteria:**
+- Calling `POST /api/config/restart` when fail2ban is running stops the daemon
+  via socket, then starts it via subprocess.
+- No `JailOperationError` is raised for normal operation.
+- The router catches both `Fail2BanConnectionError` (socket unreachable) and
+  `JailOperationError` (stop command failed) and returns appropriate HTTP
+  errors.

 ---

-### BUG-001 — fail2ban: `bangui-sim` jail fails to start due to missing `banaction`
+### Task 2 — Fix `restart_fail2ban` router — missing `JailOperationError` catch ✅ DONE

-**Status:** Done
+**Summary:** Added `except JailOperationError` branch returning HTTP 409 Conflict
+to `restart_fail2ban`.  Also imports `JailOperationError` from `jail_service`.

-**Summary:** `jail.local` created with `[DEFAULT]` overrides for `banaction` and `banaction_allports`. The container init script (`init-fail2ban-config`) overwrites `jail.conf` from the image's `/defaults/` on every start, so modifying `jail.conf` directly is ineffective. `jail.local` is not in the container's defaults and thus persists correctly. Additionally fixed a `TypeError` in `config_file_service.py` where `except jail_service.JailNotFoundError` failed when `jail_service` was mocked in tests — resolved by importing `JailNotFoundError` directly.
+**File:** `backend/app/routers/config.py` → `async def restart_fail2ban()`

-#### Error
+**Problem:**  
+The router only catches `Fail2BanConnectionError`. If `jail_service.restart()`
+raises `JailOperationError` (e.g. fail2ban reports the stop failed), it
+propagates as an unhandled 500.

-```
-Failed during configuration: Bad value substitution: option 'action' in section 'bangui-sim'
-contains an interpolation key 'banaction' which is not a valid option name.
-Raw value: '%(action_)s'
-```
-
-#### Root Cause
-
-fail2ban's interpolation system resolves option values at configuration load time by
-substituting `%(key)s` placeholders with values from the same section or from `[DEFAULT]`.
-
-The chain that fails is:
-
-1. Every jail inherits `action = %(action_)s` from `[DEFAULT]` (no override in `bangui-sim.conf`).
-2. `action_` is defined in `[DEFAULT]` as `%(banaction)s[port="%(port)s", protocol="%(protocol)s", chain="%(chain)s"]`.
-3. `banaction` is **commented out** in `[DEFAULT]`:
-   ```ini
-   # Docker/fail2ban-dev-config/fail2ban/jail.conf  [DEFAULT]
-   #banaction = iptables-multiport      ← this line is disabled
-   ```
-4. Because `banaction` is absent from the interpolation namespace, fail2ban cannot resolve
-   `action_`, which makes it unable to resolve `action`, and the jail fails to load.
-
-The same root cause affects every jail in `jail.d/` that does not define its own `banaction`,
-including `blocklist-import.conf`.
-
-#### Fix
-
-**File:** `Docker/fail2ban-dev-config/fail2ban/jail.conf`
-
-Uncomment the `banaction` line inside the `[DEFAULT]` section so the value is globally
-available to all jails:
-
-```ini
-banaction = iptables-multiport
-banaction_allports = iptables-allports
-```
-
-This is safe: the dev compose (`Docker/compose.debug.yml`) already grants the fail2ban
-container `NET_ADMIN` and `NET_RAW` capabilities, which are the prerequisites for
-iptables-based banning.
-
-#### Tasks
-
- [x] **BUG-001-T1 — Add `banaction` override via `jail.local` [DEFAULT]**
-
-  Open `Docker/fail2ban-dev-config/fail2ban/jail.conf`.
-  Find the two commented-out lines near the `action_` definition:
-  ```ini
-  #banaction = iptables-multiport
-  #banaction_allports = iptables-allports
-  ```
-  Remove the leading `#` from both lines so they become active options.
-  Do not change any other part of the file.
-
- [x] **BUG-001-T2 — Restart the fail2ban container and verify clean startup**
-
-  Bring the dev stack down and back up:
-  ```bash
-  make down && make up
-  ```
-  Wait for the fail2ban container to reach `healthy`, then inspect its logs:
-  ```bash
-  make logs   # or: docker logs bangui-fail2ban-dev 2>&1 | grep -i error
-  ```
-  Confirm that no `Bad value substitution` or `Failed during configuration` lines appear
-  and that both `bangui-sim` and `blocklist-import` jails show as **enabled** in the output.
-
- [x] **BUG-001-T3 — Verify ban/unban cycle works end-to-end**
-
-  With the stack running, trigger the simulation script:
-  ```bash
-  bash Docker/simulate_failed_logins.sh
-  ```
-  Then confirm fail2ban has recorded a ban:
-  ```bash
-  bash Docker/check_ban_status.sh
-  ```
-  The script should report at least one banned IP in the `bangui-sim` jail.
-  Also verify that the BanGUI dashboard reflects the new ban entry.
+**Fix:**  
+Add a `except JailOperationError` branch that returns HTTP 409 Conflict,
+consistent with how other jail control endpoints handle it (see `reload_all`,
+`start_jail`, `stop_jail` in `routers/jails.py`).

 ---

+### Task 3 — Fix `reload_fail2ban` router — missing `JailOperationError` catch ✅ DONE
+
+**Summary:** Added `except JailOperationError` branch returning HTTP 409 Conflict
+to `reload_fail2ban`.
+
+**File:** `backend/app/routers/config.py` → `async def reload_fail2ban()`
+
+**Problem:**  
+Same pattern as Task 2. `jail_service.reload_all()` can raise
+`JailOperationError` (e.g. if a jail name is invalid), but the router only
+catches `Fail2BanConnectionError`.
+
+**Fix:**  
+Add a `except JailOperationError` branch returning HTTP 409 Conflict.
+
+---
+
+### Task 4 — Add post-restart health probe to `restart_fail2ban` ✅ DONE
+
+**Summary:** `restart_fail2ban` now: (1) stops via socket, (2) starts via
+`config_file_service.start_daemon()`, (3) probes with
+`config_file_service.wait_for_fail2ban()`.  Returns 204 on success, 503 when
+fail2ban does not come back within 10 s.
+
+**File:** `backend/app/routers/config.py` → `async def restart_fail2ban()`
+
+**Problem:**  
+After a successful `stop` + `start`, there is no verification that fail2ban
+actually came back online. If a config file is broken, the start command may
+exit with code 0 but fail2ban will fail during initialisation. The router
+returns HTTP 204 No Content regardless.
+
+**Fix:**  
+After `jail_service.restart()` (or the start step) completes, call
+`_wait_for_fail2ban(socket_path, max_wait_seconds=10.0)` (already implemented
+in `config_file_service.py`) and change the response:
+
+- If fail2ban is responsive: return 204 No Content (or a 200 with a body
+  such as `{"fail2ban_running": true, "active_jails": <count>}`).
+- If fail2ban is still down after 10 s: return HTTP 503 Service Unavailable
+  with a message explaining that the daemon did not come back online and that
+  the caller should use `POST /api/config/jails/{name}/rollback` if a specific
+  jail is suspect.
+
+Change the response model and HTTP status code accordingly. Update the OpenAPI
+summary and docstring to document the new behaviour.
+
+---
+
+### Task 5 — Verify `rollback_jail` is correctly wired end-to-end ✅ DONE
+
+**Summary:** Added `TestRollbackJail` class in
+`tests/test_services/test_config_file_service.py` with 6 integration tests
+covering: `.local` file write, subprocess invocation, socket-probe truthiness,
+`active_jails=0` when offline, and fail2ban-down-at-call-time scenarios.
+
+**Files:**  
+- `backend/app/routers/config.py` → `async def rollback_jail()`  
+- `backend/app/services/config_file_service.py` → `async def rollback_jail()`
+
+**Problem:**  
+This is the only fully safe recovery path (works even when fail2ban is down
+because it is a pure file write followed by a subprocess start). Verify with
+an integration test that:
+
+1. A jail `.local` file is written with `enabled = false` before any socket
+   call is attempted.
+2. The start command is invoked via subprocess (not via socket).
+3. `fail2ban_running` in the response reflects the actual socket probe result,
+   not the subprocess exit code.
+4. `active_jails` is 0 (not a stale count) when `fail2ban_running` is false.
+
+Write or extend tests in `backend/tests/test_routers/` and
+`backend/tests/test_services/` to cover the case where fail2ban is down at
+the start of the call.