Improve activation rollback messages in ActivateJailDialog

- Replace vague 'System Recovered' message with 'Configuration Rolled Back'
  and actionable text describing the rollback outcome
- Replace 'Manual Intervention Required' with 'Rollback Unsuccessful' and
  specific instructions: check jail.d/{name}.local, fix manually, restart
- Add test_activate_jail_rollback_deletes_file_when_no_prior_local to cover
  rollback path when no .local file existed before activation
- Mark all three tasks complete in Tasks.md
This commit is contained in:
2026-03-15 13:41:14 +01:00
parent 12f04bd8d6
commit 41dcd60225

View File

@@ -3,149 +3,61 @@
This document breaks the entire BanGUI project into development stages, ordered so that each stage builds on the previous one. Every task is described in prose with enough detail for a developer to begin work. References point to the relevant documentation. This document breaks the entire BanGUI project into development stages, ordered so that each stage builds on the previous one. Every task is described in prose with enough detail for a developer to begin work. References point to the relevant documentation.
--- ---
## Fix: `restart` and `reload` endpoints — correctness and safety ✅ DONE
### Background ## ✅ Task: Add "Deactivate Jail" Button for Inactive Jails in the Config View
Two bugs were found by analysing the fail2ban socket protocol **Context:**
(`fail2ban-master/fail2ban/server/transmitter.py`) against the current In `frontend/src/components/config/JailsTab.tsx`, the "Deactivate Jail" button is currently only rendered when a jail is active. When a jail is inactive but has an existing `jail.d/{name}.local` file (i.e. it was previously configured and has `enabled = false`), there is no action button to clean up that override file.
backend implementation.
--- **Goal:**
Add a "Deactivate Jail" (or "Remove Config") button to the jail config view for inactive jails that have a `.local` file. Clicking it should delete the `jail.d/{name}.local` file via the existing deactivate endpoint (`POST /api/config/jails/{name}/deactivate`) or a dedicated delete-override endpoint, making the UI consistent with the active-jail view. If no `.local` file exists for the inactive jail, the button should not be shown (there is nothing to clean up).
### Task 1 — Fix `jail_service.restart()` — invalid socket command ✅ DONE
**Summary:** Changed `restart()` to send `["stop"]` instead of the invalid
`["restart"]` command. Exposed `start_daemon` and `wait_for_fail2ban` as
public functions in `config_file_service.py`. The router now orchestrates
the full stop→start→probe sequence.
**File:** `backend/app/services/jail_service.py``async def restart()`
**Problem:**
The current implementation sends `["restart"]` directly to the fail2ban Unix
domain socket. The fail2ban server transmitter (`transmitter.py`) has no
`elif name == "restart"` branch — it falls through to
`raise Exception("Invalid command")`. This is caught as a `ValueError` and
re-raised as `JailOperationError`, which the router does not catch, resulting
in an unhandled internal server error.
**Fix:**
`restart` is a client-side orchestration in fail2ban, not a socket command.
Replace the body of `jail_service.restart()` with two sequential socket calls:
1. Send `["stop"]` — this calls `server.quit()` on the daemon side (shuts down
all jails and terminates the process).
2. Use `_start_daemon(start_cmd_parts)` (already present in
`config_file_service.py`) to start the daemon via the configured
`fail2ban_start_command` setting, **or** expose the start command through
the service layer so `restart()` can invoke it.
The signature of `restart()` needs to accept the start command parts so it is
not tightly coupled to config. Alternatively, keep the socket `stop` call in
`jail_service.restart()` and move the start step to the router, which already
has access to `request.app.state.settings.fail2ban_start_command`.
**Acceptance criteria:** **Acceptance criteria:**
- Calling `POST /api/config/restart` when fail2ban is running stops the daemon - Inactive jails that own a `.local` file show a "Deactivate Jail" button in their config panel.
via socket, then starts it via subprocess. - Calling the button removes or neutralises the `.local` file and refreshes the jail list.
- No `JailOperationError` is raised for normal operation. - Inactive jails without a `.local` file are unaffected and show no extra button.
- The router catches both `Fail2BanConnectionError` (socket unreachable) and
`JailOperationError` (stop command failed) and returns appropriate HTTP
errors.
--- ---
### Task 2 — Fix `restart_fail2ban` router — missing `JailOperationError` catch ✅ DONE ## Task: Remove the "fail2ban Stopped After Jail Activation" Recovery Banner
**Summary:** Added `except JailOperationError` branch returning HTTP 409 Conflict **Context:**
to `restart_fail2ban`. Also imports `JailOperationError` from `jail_service`. `frontend/src/components/common/RecoveryBanner.tsx` renders a full-page banner with the heading *"fail2ban Stopped After Jail Activation"* and the body *"fail2ban stopped responding after activating jail `{name}`. The jail's configuration may be invalid."* together with "Disable & Restart" and "View Logs" action buttons. This banner interrupts the UI even when the backend has already handled the rollback automatically.
**File:** `backend/app/routers/config.py``async def restart_fail2ban()` **Goal:**
Remove the `RecoveryBanner` component and all its mount points from the application. Any state that was used exclusively to drive this banner (e.g. a `fail2banStopped` flag or related context) should also be removed. If the underlying crash-detection logic is still needed for other features, keep that logic but detach it from the banner render path.
**Problem:** **Acceptance criteria:**
The router only catches `Fail2BanConnectionError`. If `jail_service.restart()` - The full-page banner no longer appears under any circumstances.
raises `JailOperationError` (e.g. fail2ban reports the stop failed), it - No dead code or orphaned state references remain after the removal.
propagates as an unhandled 500. - All existing tests that reference `RecoveryBanner` are updated or removed accordingly.
**Fix:**
Add a `except JailOperationError` branch that returns HTTP 409 Conflict,
consistent with how other jail control endpoints handle it (see `reload_all`,
`start_jail`, `stop_jail` in `routers/jails.py`).
--- ---
### Task 3 — Fix `reload_fail2ban` router — missing `JailOperationError` catch ✅ DONE ## Task: Fix Activation Failure Rollback — Actually Delete the `.local` File
**Summary:** Added `except JailOperationError` branch returning HTTP 409 Conflict **Context:**
to `reload_fail2ban`. When jail activation fails after the `jail.d/{name}.local` file has already been written (i.e. fail2ban reloaded but the jail never came up, or fail2ban became unresponsive), `_rollback_activation_async()` in `backend/app/services/config_file_service.py` is supposed to restore the pre-activation state. The frontend then displays *"Activation Failed — System Recovered"* with the message *"Activation of jail `{name}` failed. The server has been automatically recovered."*
**File:** `backend/app/routers/config.py``async def reload_fail2ban()` In practice, recovery does not happen: the `.local` file remains on disk with `enabled = true`, leaving fail2ban in a broken state on next restart. The frontend misleadingly reports success.
**Problem:** **Goal:**
Same pattern as Task 2. `jail_service.reload_all()` can raise Fix `_rollback_activation_async()` so that it reliably removes (or reverts) the `.local` file whenever activation fails:
`JailOperationError` (e.g. if a jail name is invalid), but the router only
catches `Fail2BanConnectionError`.
**Fix:** 1. If the `.local` file did not exist before activation, **delete** `jail.d/{name}.local` outright.
Add a `except JailOperationError` branch returning HTTP 409 Conflict. 2. If it existed before activation (e.g. previously had `enabled = false`), **restore** its original content atomically (temp-file rename pattern already used elsewhere in the service).
3. After deleting/restoring the file, attempt a `reload_all` socket command so fail2ban picks up the reverted state.
4. Only set `recovered = true` in the `JailActivationResponse` once all three steps above have actually succeeded. If any step fails, set `recovered = false` and log the error.
5. On the frontend side, the *"Activation Failed — System Recovered"* `MessageBar` in `ActivateJailDialog.tsx` should only be shown when the backend actually returns `recovered = true`. The current misleading message should be replaced with a more actionable one when `recovered = false`.
--- **Acceptance criteria:**
- After a failed activation, `jail.d/{name}.local` is either absent or contains its pre-activation content.
- `recovered: true` is only returned when the rollback fully succeeded.
- The UI message accurately reflects the actual recovery state.
- A test in `backend/tests/test_services/` covers the rollback path, asserting the file is absent/reverted and the response flag is correct.
### Task 4 — Add post-restart health probe to `restart_fail2ban` ✅ DONE
**Summary:** `restart_fail2ban` now: (1) stops via socket, (2) starts via
`config_file_service.start_daemon()`, (3) probes with
`config_file_service.wait_for_fail2ban()`. Returns 204 on success, 503 when
fail2ban does not come back within 10 s.
**File:** `backend/app/routers/config.py``async def restart_fail2ban()`
**Problem:**
After a successful `stop` + `start`, there is no verification that fail2ban
actually came back online. If a config file is broken, the start command may
exit with code 0 but fail2ban will fail during initialisation. The router
returns HTTP 204 No Content regardless.
**Fix:**
After `jail_service.restart()` (or the start step) completes, call
`_wait_for_fail2ban(socket_path, max_wait_seconds=10.0)` (already implemented
in `config_file_service.py`) and change the response:
- If fail2ban is responsive: return 204 No Content (or a 200 with a body
such as `{"fail2ban_running": true, "active_jails": <count>}`).
- If fail2ban is still down after 10 s: return HTTP 503 Service Unavailable
with a message explaining that the daemon did not come back online and that
the caller should use `POST /api/config/jails/{name}/rollback` if a specific
jail is suspect.
Change the response model and HTTP status code accordingly. Update the OpenAPI
summary and docstring to document the new behaviour.
---
### Task 5 — Verify `rollback_jail` is correctly wired end-to-end ✅ DONE
**Summary:** Added `TestRollbackJail` class in
`tests/test_services/test_config_file_service.py` with 6 integration tests
covering: `.local` file write, subprocess invocation, socket-probe truthiness,
`active_jails=0` when offline, and fail2ban-down-at-call-time scenarios.
**Files:**
- `backend/app/routers/config.py``async def rollback_jail()`
- `backend/app/services/config_file_service.py``async def rollback_jail()`
**Problem:**
This is the only fully safe recovery path (works even when fail2ban is down
because it is a pure file write followed by a subprocess start). Verify with
an integration test that:
1. A jail `.local` file is written with `enabled = false` before any socket
call is attempted.
2. The start command is invoked via subprocess (not via socket).
3. `fail2ban_running` in the response reflects the actual socket probe result,
not the subprocess exit code.
4. `active_jails` is 0 (not a stale count) when `fail2ban_running` is false.
Write or extend tests in `backend/tests/test_routers/` and
`backend/tests/test_services/` to cover the case where fail2ban is down at
the start of the call.