Files

Lukas 93dc699825 Fix restart/reload endpoint correctness and safety

- jail_service.restart(): replace invalid ["restart"] socket command with
  ["stop"], matching fail2ban transmitter protocol. The daemon is now
  stopped via socket; the caller starts it via subprocess.

- config_file_service: expose _start_daemon and _wait_for_fail2ban as
  public start_daemon / wait_for_fail2ban functions.

- restart_fail2ban router: orchestrate stop (socket) → start (subprocess)
  → probe (socket). Returns 204 on success, 503 when fail2ban does not
  come back within 10 s. Catches JailOperationError → 409.

- reload_fail2ban router: add JailOperationError catch → 409 Conflict,
  consistent with other jail control endpoints.

- Tests: add TestJailControls.test_restart_* (3 cases), TestReloadFail2ban
  502/409 cases, TestRestartFail2ban (5 cases), TestRollbackJail (6
  integration tests verifying file-write, subprocess invocation, socket-
  probe truthiness, active_jails count, and offline-at-call-time).

2026-03-15 12:59:17 +01:00

6.4 KiB

Raw Blame History

BanGUI — Task List

This document breaks the entire BanGUI project into development stages, ordered so that each stage builds on the previous one. Every task is described in prose with enough detail for a developer to begin work. References point to the relevant documentation.

Fix: `restart` and `reload` endpoints — correctness and safety ✅ DONE

Background

Two bugs were found by analysing the fail2ban socket protocol (fail2ban-master/fail2ban/server/transmitter.py) against the current backend implementation.

Task 1 — Fix `jail_service.restart()` — invalid socket command ✅ DONE

Summary: Changed restart() to send ["stop"] instead of the invalid ["restart"] command. Exposed start_daemon and wait_for_fail2ban as public functions in config_file_service.py. The router now orchestrates the full stop→start→probe sequence.

File: backend/app/services/jail_service.py → async def restart()

Problem:
The current implementation sends ["restart"] directly to the fail2ban Unix domain socket. The fail2ban server transmitter (transmitter.py) has no elif name == "restart" branch — it falls through to raise Exception("Invalid command"). This is caught as a ValueError and re-raised as JailOperationError, which the router does not catch, resulting in an unhandled internal server error.

Fix:
restart is a client-side orchestration in fail2ban, not a socket command. Replace the body of jail_service.restart() with two sequential socket calls:

Send ["stop"] — this calls server.quit() on the daemon side (shuts down all jails and terminates the process).
Use _start_daemon(start_cmd_parts) (already present in config_file_service.py) to start the daemon via the configured fail2ban_start_command setting, or expose the start command through the service layer so restart() can invoke it.

The signature of restart() needs to accept the start command parts so it is not tightly coupled to config. Alternatively, keep the socket stop call in jail_service.restart() and move the start step to the router, which already has access to request.app.state.settings.fail2ban_start_command.

Acceptance criteria:

Calling POST /api/config/restart when fail2ban is running stops the daemon via socket, then starts it via subprocess.
No JailOperationError is raised for normal operation.
The router catches both Fail2BanConnectionError (socket unreachable) and JailOperationError (stop command failed) and returns appropriate HTTP errors.

Task 2 — Fix `restart_fail2ban` router — missing `JailOperationError` catch ✅ DONE

Summary: Added except JailOperationError branch returning HTTP 409 Conflict to restart_fail2ban. Also imports JailOperationError from jail_service.

File: backend/app/routers/config.py → async def restart_fail2ban()

Problem:
The router only catches Fail2BanConnectionError. If jail_service.restart() raises JailOperationError (e.g. fail2ban reports the stop failed), it propagates as an unhandled 500.

Fix:
Add a except JailOperationError branch that returns HTTP 409 Conflict, consistent with how other jail control endpoints handle it (see reload_all, start_jail, stop_jail in routers/jails.py).

Task 3 — Fix `reload_fail2ban` router — missing `JailOperationError` catch ✅ DONE

Summary: Added except JailOperationError branch returning HTTP 409 Conflict to reload_fail2ban.

File: backend/app/routers/config.py → async def reload_fail2ban()

Problem:
Same pattern as Task 2. jail_service.reload_all() can raise JailOperationError (e.g. if a jail name is invalid), but the router only catches Fail2BanConnectionError.

Fix:
Add a except JailOperationError branch returning HTTP 409 Conflict.

Task 4 — Add post-restart health probe to `restart_fail2ban` ✅ DONE

Summary: restart_fail2ban now: (1) stops via socket, (2) starts via config_file_service.start_daemon(), (3) probes with config_file_service.wait_for_fail2ban(). Returns 204 on success, 503 when fail2ban does not come back within 10 s.

File: backend/app/routers/config.py → async def restart_fail2ban()

Problem:
After a successful stop + start, there is no verification that fail2ban actually came back online. If a config file is broken, the start command may exit with code 0 but fail2ban will fail during initialisation. The router returns HTTP 204 No Content regardless.

Fix:
After jail_service.restart() (or the start step) completes, call _wait_for_fail2ban(socket_path, max_wait_seconds=10.0) (already implemented in config_file_service.py) and change the response:

If fail2ban is responsive: return 204 No Content (or a 200 with a body such as {"fail2ban_running": true, "active_jails": <count>}).
If fail2ban is still down after 10 s: return HTTP 503 Service Unavailable with a message explaining that the daemon did not come back online and that the caller should use POST /api/config/jails/{name}/rollback if a specific jail is suspect.

Change the response model and HTTP status code accordingly. Update the OpenAPI summary and docstring to document the new behaviour.

Task 5 — Verify `rollback_jail` is correctly wired end-to-end ✅ DONE

Summary: Added TestRollbackJail class in tests/test_services/test_config_file_service.py with 6 integration tests covering: .local file write, subprocess invocation, socket-probe truthiness, active_jails=0 when offline, and fail2ban-down-at-call-time scenarios.

Files:

backend/app/routers/config.py → async def rollback_jail()
backend/app/services/config_file_service.py → async def rollback_jail()

Problem:
This is the only fully safe recovery path (works even when fail2ban is down because it is a pure file write followed by a subprocess start). Verify with an integration test that:

A jail .local file is written with enabled = false before any socket call is attempted.
The start command is invoked via subprocess (not via socket).
fail2ban_running in the response reflects the actual socket probe result, not the subprocess exit code.
active_jails is 0 (not a stale count) when fail2ban_running is false.

Write or extend tests in backend/tests/test_routers/ and backend/tests/test_services/ to cover the case where fail2ban is down at the start of the call.

6.4 KiB Raw Blame History

BanGUI — Task List

Fix: restart and reload endpoints — correctness and safety ✅ DONE

Background

Task 1 — Fix jail_service.restart() — invalid socket command ✅ DONE

Task 2 — Fix restart_fail2ban router — missing JailOperationError catch ✅ DONE

Task 3 — Fix reload_fail2ban router — missing JailOperationError catch ✅ DONE

Task 4 — Add post-restart health probe to restart_fail2ban ✅ DONE

Task 5 — Verify rollback_jail is correctly wired end-to-end ✅ DONE