Files
BanGUI/Docs/Tasks.md
Lukas 93dc699825 Fix restart/reload endpoint correctness and safety
- jail_service.restart(): replace invalid ["restart"] socket command with
  ["stop"], matching fail2ban transmitter protocol. The daemon is now
  stopped via socket; the caller starts it via subprocess.

- config_file_service: expose _start_daemon and _wait_for_fail2ban as
  public start_daemon / wait_for_fail2ban functions.

- restart_fail2ban router: orchestrate stop (socket) → start (subprocess)
  → probe (socket). Returns 204 on success, 503 when fail2ban does not
  come back within 10 s. Catches JailOperationError → 409.

- reload_fail2ban router: add JailOperationError catch → 409 Conflict,
  consistent with other jail control endpoints.

- Tests: add TestJailControls.test_restart_* (3 cases), TestReloadFail2ban
  502/409 cases, TestRestartFail2ban (5 cases), TestRollbackJail (6
  integration tests verifying file-write, subprocess invocation, socket-
  probe truthiness, active_jails count, and offline-at-call-time).
2026-03-15 12:59:17 +01:00

151 lines
6.4 KiB
Markdown

# BanGUI — Task List
This document breaks the entire BanGUI project into development stages, ordered so that each stage builds on the previous one. Every task is described in prose with enough detail for a developer to begin work. References point to the relevant documentation.
---
## Fix: `restart` and `reload` endpoints — correctness and safety ✅ DONE
### Background
Two bugs were found by analysing the fail2ban socket protocol
(`fail2ban-master/fail2ban/server/transmitter.py`) against the current
backend implementation.
---
### Task 1 — Fix `jail_service.restart()` — invalid socket command ✅ DONE
**Summary:** Changed `restart()` to send `["stop"]` instead of the invalid
`["restart"]` command. Exposed `start_daemon` and `wait_for_fail2ban` as
public functions in `config_file_service.py`. The router now orchestrates
the full stop→start→probe sequence.
**File:** `backend/app/services/jail_service.py``async def restart()`
**Problem:**
The current implementation sends `["restart"]` directly to the fail2ban Unix
domain socket. The fail2ban server transmitter (`transmitter.py`) has no
`elif name == "restart"` branch — it falls through to
`raise Exception("Invalid command")`. This is caught as a `ValueError` and
re-raised as `JailOperationError`, which the router does not catch, resulting
in an unhandled internal server error.
**Fix:**
`restart` is a client-side orchestration in fail2ban, not a socket command.
Replace the body of `jail_service.restart()` with two sequential socket calls:
1. Send `["stop"]` — this calls `server.quit()` on the daemon side (shuts down
all jails and terminates the process).
2. Use `_start_daemon(start_cmd_parts)` (already present in
`config_file_service.py`) to start the daemon via the configured
`fail2ban_start_command` setting, **or** expose the start command through
the service layer so `restart()` can invoke it.
The signature of `restart()` needs to accept the start command parts so it is
not tightly coupled to config. Alternatively, keep the socket `stop` call in
`jail_service.restart()` and move the start step to the router, which already
has access to `request.app.state.settings.fail2ban_start_command`.
**Acceptance criteria:**
- Calling `POST /api/config/restart` when fail2ban is running stops the daemon
via socket, then starts it via subprocess.
- No `JailOperationError` is raised for normal operation.
- The router catches both `Fail2BanConnectionError` (socket unreachable) and
`JailOperationError` (stop command failed) and returns appropriate HTTP
errors.
---
### Task 2 — Fix `restart_fail2ban` router — missing `JailOperationError` catch ✅ DONE
**Summary:** Added `except JailOperationError` branch returning HTTP 409 Conflict
to `restart_fail2ban`. Also imports `JailOperationError` from `jail_service`.
**File:** `backend/app/routers/config.py``async def restart_fail2ban()`
**Problem:**
The router only catches `Fail2BanConnectionError`. If `jail_service.restart()`
raises `JailOperationError` (e.g. fail2ban reports the stop failed), it
propagates as an unhandled 500.
**Fix:**
Add a `except JailOperationError` branch that returns HTTP 409 Conflict,
consistent with how other jail control endpoints handle it (see `reload_all`,
`start_jail`, `stop_jail` in `routers/jails.py`).
---
### Task 3 — Fix `reload_fail2ban` router — missing `JailOperationError` catch ✅ DONE
**Summary:** Added `except JailOperationError` branch returning HTTP 409 Conflict
to `reload_fail2ban`.
**File:** `backend/app/routers/config.py``async def reload_fail2ban()`
**Problem:**
Same pattern as Task 2. `jail_service.reload_all()` can raise
`JailOperationError` (e.g. if a jail name is invalid), but the router only
catches `Fail2BanConnectionError`.
**Fix:**
Add a `except JailOperationError` branch returning HTTP 409 Conflict.
---
### Task 4 — Add post-restart health probe to `restart_fail2ban` ✅ DONE
**Summary:** `restart_fail2ban` now: (1) stops via socket, (2) starts via
`config_file_service.start_daemon()`, (3) probes with
`config_file_service.wait_for_fail2ban()`. Returns 204 on success, 503 when
fail2ban does not come back within 10 s.
**File:** `backend/app/routers/config.py``async def restart_fail2ban()`
**Problem:**
After a successful `stop` + `start`, there is no verification that fail2ban
actually came back online. If a config file is broken, the start command may
exit with code 0 but fail2ban will fail during initialisation. The router
returns HTTP 204 No Content regardless.
**Fix:**
After `jail_service.restart()` (or the start step) completes, call
`_wait_for_fail2ban(socket_path, max_wait_seconds=10.0)` (already implemented
in `config_file_service.py`) and change the response:
- If fail2ban is responsive: return 204 No Content (or a 200 with a body
such as `{"fail2ban_running": true, "active_jails": <count>}`).
- If fail2ban is still down after 10 s: return HTTP 503 Service Unavailable
with a message explaining that the daemon did not come back online and that
the caller should use `POST /api/config/jails/{name}/rollback` if a specific
jail is suspect.
Change the response model and HTTP status code accordingly. Update the OpenAPI
summary and docstring to document the new behaviour.
---
### Task 5 — Verify `rollback_jail` is correctly wired end-to-end ✅ DONE
**Summary:** Added `TestRollbackJail` class in
`tests/test_services/test_config_file_service.py` with 6 integration tests
covering: `.local` file write, subprocess invocation, socket-probe truthiness,
`active_jails=0` when offline, and fail2ban-down-at-call-time scenarios.
**Files:**
- `backend/app/routers/config.py``async def rollback_jail()`
- `backend/app/services/config_file_service.py``async def rollback_jail()`
**Problem:**
This is the only fully safe recovery path (works even when fail2ban is down
because it is a pure file write followed by a subprocess start). Verify with
an integration test that:
1. A jail `.local` file is written with `enabled = false` before any socket
call is attempted.
2. The start command is invoked via subprocess (not via socket).
3. `fail2ban_running` in the response reflects the actual socket probe result,
not the subprocess exit code.
4. `active_jails` is 0 (not a stale count) when `fail2ban_running` is false.
Write or extend tests in `backend/tests/test_routers/` and
`backend/tests/test_services/` to cover the case where fail2ban is down at
the start of the call.