fix: retry, semaphore, reload lock, activation verify, bans_by_jail diagnostics

Stage 1.1-1.3: reload_all include/exclude_jails params already implemented;
  added keyword-arg assertions in router and service tests.

Stage 2.1/6.1: _send_command_sync retry loop (3 attempts, 150ms exp backoff)
  retrying on EAGAIN/ECONNREFUSED/ENOBUFS; immediate raise on all other errors.

Stage 2.2: asyncio.Lock at module level in jail_service.reload_all to
  serialize concurrent reload--all commands.

Stage 3.1: activate_jail re-queries _get_active_jail_names after reload;
  returns active=False with descriptive message if jail did not start.

Stage 4.1/6.2: asyncio.Semaphore (max 10) in Fail2BanClient.send, lazy-
  initialized; logs fail2ban_command_waiting_semaphore at debug when waiting.

Stage 5.1/5.2: unit tests asserting reload_all is called with include_jails
  and exclude_jails; activation verification happy/sad path tests.

Stage 6.3: TestSendCommandSyncRetry (5 cases) + TestFail2BanClientSemaphore
  concurrency test.

Stage 7.1-7.3: _since_unix uses time.time(); bans_by_jail debug logging with
  since_iso; diagnostic warning when total==0 despite table rows; unit test
  verifying the warning fires for stale data.
This commit is contained in:
2026-03-14 11:09:55 +01:00
parent 2274e20123
commit 2f2e5a7419
9 changed files with 880 additions and 115 deletions

View File

@@ -4,48 +4,273 @@ This document breaks the entire BanGUI project into development stages, ordered
---
## Task 1 — Fix vertical alignment of "DNS Mode" dropdown with "Date Pattern" dropdown in Jail Configuration ✅ DONE
## Stage 1 — Bug Fix: Jail Activation / Deactivation Reload Stream
**Status:** Completed — Added `alignItems: "end"` to `fieldRow` in `configStyles.ts`. The two dropdowns now baseline-align to the bottom of their grid row regardless of the "Date Pattern" hint text. Verified no regressions in any other `fieldRow` usages (all other rows have consistent hint presence across their fields).
### 1.1 Fix `reload_all` to include newly activated jails in the start stream ✅ DONE
**Component:** `frontend/src/components/config/JailsTab.tsx`
**Styles:** `frontend/src/components/config/configStyles.ts`
**Problem:**
When a user activates an inactive jail (e.g. `apache-auth`), the backend writes `enabled = true` to `jail.d/apache-auth.local` and calls `jail_service.reload_all()`. However, `reload_all` queries the *currently running* jails via `["status"]` to build the start stream. Since the new jail is not yet running, it is excluded from the stream. After `reload --all`, fail2ban's end-of-reload phase deletes every jail not in the stream — so the newly activated jail never starts.
### Problem
The inverse bug exists for deactivation: the jail is still running when `["status"]` is queried, so it remains in the stream and may be restarted despite `enabled = false` being written to the config.
In the jail configuration detail view (`JailConfigDetail`), the "Date Pattern" and "DNS Mode" fields sit side-by-side in a 2-column CSS grid row (class `fieldRow`). The "Date Pattern" `<Field>` has a `hint` prop (`"Leave blank for auto-detect."`) which renders extra text between the label and the input, pushing the `<Combobox>` control downward. The "DNS Mode" `<Field>` has no `hint`, so its `<Select>` control sits higher. This causes the two dropdowns to be vertically misaligned.
**Fix:**
Add keyword-only `include_jails` and `exclude_jails` parameters to `jail_service.reload_all()`. Callers merge these into the stream derived from the current status. `activate_jail` passes `include_jails=[name]`; `deactivate_jail` passes `exclude_jails=[name]`. All existing callers are unaffected (both params default to `None`).
### Location in code
**Files:**
- `backend/app/services/jail_service.py``reload_all()`
- `backend/app/services/config_file_service.py``activate_jail()`, `deactivate_jail()`
- Around **line 321** of `JailsTab.tsx` there is a `<div className={styles.fieldRow}>` that contains both fields.
- The "Date Pattern" field (lines ~322341) uses `<Combobox>` with a `hint` prop.
- The "DNS Mode" field (lines ~342353) uses `<Select>` without a `hint` prop.
**Acceptance criteria:**
- Activating an inactive jail via the API actually starts it in fail2ban.
- Deactivating a running jail via the API actually stops it after reload.
- All other callers of `reload_all()` (config save, filter/action updates) continue to work without changes.
### Acceptance criteria
---
1. The bottom edges of the "Date Pattern" dropdown and the "DNS Mode" dropdown must be visually aligned on the same horizontal line.
2. The fix must not break the responsive layout (on narrow screens the grid collapses to a single column via `@media (max-width: 900px)`).
3. No other fields in the jail config form should be affected.
### 1.2 Add unit tests for `reload_all` with `include_jails` / `exclude_jails` ✅ DONE
### Suggested approach
Write tests that verify the new parameters produce the correct fail2ban command stream.
**Option A — Align grid items to the end (preferred):**
In `configStyles.ts`, add `alignItems: "end"` to the `fieldRow` style. This makes each grid cell align its content to the bottom, so the two inputs line up regardless of whether one field has a hint and the other doesn't. Verify that this does not break other rows that also use `fieldRow` (Backend/Log Encoding row, and any rows in `DefaultsTab.tsx` or `FiltersTab.tsx`).
**Test cases:**
1. `reload_all(sock, include_jails=["apache-auth"])` when currently running jails are `["sshd", "nginx"]` → the stream sent to fail2ban must contain `["start", "apache-auth"]`, `["start", "nginx"]`, and `["start", "sshd"]`.
2. `reload_all(sock, exclude_jails=["sshd"])` when currently running jails are `["sshd", "nginx"]` → the stream must contain only `["start", "nginx"]`, **not** `["start", "sshd"]`.
3. `reload_all(sock, include_jails=["new"], exclude_jails=["old"])` when running jails are `["old", "nginx"]` → stream must contain `["start", "new"]` and `["start", "nginx"]`, **not** `["start", "old"]`.
4. `reload_all(sock)` without extra args continues to work exactly as before (backwards compatibility).
**Option B — Add a matching hint to DNS Mode:**
Add a `hint` with a non-breaking space (`hint={"\u00A0"}`) to the DNS Mode `<Field>` so it takes up the same vertical space as the Date Pattern hint. This is simpler but slightly hacky.
**Files:**
- `backend/tests/test_services/test_jail_service.py`
### Files to modify
---
| File | Change |
|------|--------|
| `frontend/src/components/config/configStyles.ts` | Add `alignItems: "end"` to `fieldRow` (Option A) |
| — *or* — | |
| `frontend/src/components/config/JailsTab.tsx` | Add `hint` to DNS Mode `<Field>` (Option B) |
### 1.3 Add integration-level tests for activate / deactivate endpoints ✅ DONE
### Testing
Verify that the `POST /api/config/jails/{name}/activate` and `POST /api/config/jails/{name}/deactivate` endpoints pass the correct `include_jails` / `exclude_jails` arguments through to `reload_all`. These tests mock `jail_service.reload_all` and assert on the keyword arguments it receives.
- Open the app, navigate to **Configuration → Jails**, select any jail.
- Confirm the "Date Pattern" combobox and "DNS Mode" select are vertically aligned.
- Resize the browser below 900 px and confirm both fields stack into a single column correctly.
- Check other config tabs (Defaults, Filters) that reuse `fieldRow` to ensure no regression.
**Files:**
- `backend/tests/test_routers/test_config.py` (or a new `test_config_activate.py`)
---
## Stage 2 — Socket Connection Resilience
### 2.1 Add retry logic to `Fail2BanClient.send` for transient connection errors ✅ DONE
**Problem:**
The logs show intermittent `fail2ban_connection_error` events during parallel command bursts (e.g. when fetching jail details after a reload). The fail2ban Unix socket can momentarily refuse connections while processing a reload.
**Task:**
Add a configurable retry mechanism (default 2 retries, 100 ms backoff) to `Fail2BanClient.send()` that catches `ConnectionRefusedError` / `FileNotFoundError` and retries before raising `Fail2BanConnectionError`. This must not retry on protocol-level errors (e.g. unknown jail) — only on connection failures.
**Files:**
- `backend/app/utils/fail2ban_client.py`
**Acceptance criteria:**
- Transient socket errors during reload bursts are retried transparently.
- Non-connection errors (e.g. unknown jail) are raised immediately without retry.
- A structured log message is emitted for each retry attempt.
- Unit tests cover retry success, retry exhaustion, and non-retryable errors.
---
### 2.2 Serialize concurrent `reload_all` calls ✅ DONE
**Problem:**
Multiple browser tabs or fast UI clicks could trigger concurrent `reload_all` calls. Sending overlapping `reload --all` commands to the fail2ban socket is undefined behavior and may cause jail loss.
**Task:**
Add an asyncio lock inside `reload_all` (module-level `asyncio.Lock`) so that concurrent calls are serialized. If a reload is already in progress, subsequent calls wait rather than firing in parallel.
**Files:**
- `backend/app/services/jail_service.py`
**Acceptance criteria:**
- Two concurrent `reload_all` calls are serialized; the second waits for the first to finish.
- Unit test demonstrates that the lock prevents overlapping socket commands.
---
## Stage 3 — Activate / Deactivate UX Improvements
### 3.1 Return the jail's runtime status after activation ✅ DONE
**Problem:**
After activating a jail, the API returns `active: True` optimistically before verifying that fail2ban actually started the jail. If the reload silently fails (e.g. bad regex in the jail config), the frontend shows the jail as active but it is not.
**Task:**
After calling `reload_all`, query `["status"]` and verify the activated jail appears in the running jail list. If it does not, return `active: False` with a warning message explaining the jail config may be invalid. Log a warning event.
**Files:**
- `backend/app/services/config_file_service.py``activate_jail()`
**Acceptance criteria:**
- Successful activation returns `active: True` only after verification.
- If the jail doesn't start (e.g. bad config), the response has `active: False` and a descriptive message.
- A structured log event is emitted on verification failure.
---
### 3.2 Frontend feedback for activation failure
**Task:**
If the activation endpoint returns `active: False`, the ConfigPage jail detail pane should show a warning toast/banner explaining that the jail could not be started and the user should check the jail configuration (filters, log paths, regex etc.).
**Files:**
- `frontend/src/hooks/useConfigActiveStatus.ts` (or relevant hook)
- `frontend/src/components/config/` (jail detail component)
---
## Stage 4 — Parallel Command Throttling
### 4.1 Limit concurrent fail2ban socket commands ✅ DONE
**Problem:**
When loading jail details for multiple active jails, the backend fires dozens of `get` commands in parallel (bantime, findtime, maxretry, failregex, etc. × N jails). The fail2ban socket is single-threaded and some commands time out or fail with connection errors under this load.
**Task:**
Introduce an asyncio `Semaphore` (configurable, default 10) that limits the number of in-flight fail2ban commands. All code paths that use `Fail2BanClient.send()` should acquire the semaphore first. This can be implemented as a connection-pool wrapper or a middleware in the client.
**Files:**
- `backend/app/utils/fail2ban_client.py`
**Acceptance criteria:**
- No more than N commands are sent to the socket concurrently.
- Connection errors during jail detail fetches are eliminated under normal load.
- A structured log event is emitted when a command waits for the semaphore.
---
## Stage 5 — Test Coverage Hardening
### 5.1 Add tests for `activate_jail` and `deactivate_jail` service functions ✅ DONE
**Task:**
Write comprehensive unit tests for `config_file_service.activate_jail` and `config_file_service.deactivate_jail`, covering:
- Happy path: jail exists, is inactive, local file is written, reload includes it, response is correct.
- Jail not found in config → `JailNotFoundInConfigError`.
- Jail already active → `JailAlreadyActiveError`.
- Jail already inactive → `JailAlreadyInactiveError`.
- Reload fails → activation still returns but with logged warning.
- Override parameters (bantime, findtime, etc.) are written to the `.local` file correctly.
**Files:**
- `backend/tests/test_services/test_config_file_service.py`
---
### 5.2 Add tests for deactivate path with `exclude_jails` ✅ DONE
**Task:**
Verify that `deactivate_jail` passes `exclude_jails=[name]` to `reload_all`, ensuring the jail is removed from the start stream. Mock `jail_service.reload_all` and assert the keyword arguments.
**Files:**
- `backend/tests/test_services/test_config_file_service.py`
---
## Stage 6 — Bug Fix: 502 "Resource temporarily unavailable" on fail2ban Socket
### 6.1 Add retry with back-off to `_send_command_sync` for transient `OSError` ✅ DONE
**Problem:**
Under concurrent load the fail2ban Unix socket returns `[Errno 11] Resource temporarily unavailable` (EAGAIN). The `_send_command_sync` function in `fail2ban_client.py` catches this as a generic `OSError` and immediately raises `Fail2BanConnectionError`, which the routers translate into a 502 response. There is no retry.
**Task:**
Wrap the `sock.connect()` / `sock.sendall()` / `sock.recv()` block inside a retry loop (max 3 attempts, exponential back-off starting at 150 ms). Only retry on `OSError` with `errno` in `{errno.EAGAIN, errno.ECONNREFUSED, errno.ENOBUFS}` — all other `OSError` variants and all `Fail2BanProtocolError` cases must be raised immediately.
Emit a structured log event (`fail2ban_socket_retry`) on each retry attempt containing the attempt number, the errno, and the socket path. After the final retry is exhausted, raise `Fail2BanConnectionError` as today.
**Files:**
- `backend/app/utils/fail2ban_client.py``_send_command_sync()`
**Acceptance criteria:**
- A transient EAGAIN on the first attempt is silently retried and succeeds on the second attempt without surfacing a 502.
- Non-retryable socket errors (e.g. `ENOENT` — socket file missing) are raised immediately on the first attempt.
- A `Fail2BanProtocolError` (unpickle failure) is never retried.
- After 3 consecutive EAGAIN failures, `Fail2BanConnectionError` is raised as before.
- Each retry is logged with `structlog`.
---
### 6.2 Add a concurrency semaphore to `Fail2BanClient.send` ✅ DONE
**Problem:**
Dashboard page load fires many parallel `get` commands (jail details, ban stats, trend data). The fail2ban socket is single-threaded; flooding it causes the EAGAIN errors from 6.1.
**Task:**
Introduce an `asyncio.Semaphore` (configurable, default 10) at the module level in `fail2ban_client.py`. Acquire the semaphore in `Fail2BanClient.send()` before dispatching `_send_command_sync` to the thread-pool executor. This caps the number of in-flight socket commands and prevents the socket backlog from overflowing.
**Files:**
- `backend/app/utils/fail2ban_client.py`
**Acceptance criteria:**
- No more than 10 commands are sent to the socket concurrently.
- Under normal load, the 502 errors are eliminated.
- A structured log event is emitted when a command has to wait for the semaphore (debug level).
---
### 6.3 Unit tests for socket retry and semaphore ✅ DONE
**Task:**
Write tests that verify:
1. A single transient `OSError(errno.EAGAIN)` is retried and the command succeeds.
2. Three consecutive EAGAIN failures raise `Fail2BanConnectionError`.
3. An `OSError(errno.ENOENT)` (socket missing) is raised immediately without retry.
4. The semaphore limits concurrency — launch 20 parallel `send()` calls against a mock that records timestamps and assert no more than 10 overlap.
**Files:**
- `backend/tests/test_utils/test_fail2ban_client.py`
---
## Stage 7 — Bug Fix: Empty Bans-by-Jail Response
### 7.1 Investigate and fix the empty `bans_by_jail` query ✅ DONE
**Problem:**
`GET /api/dashboard/bans/by-jail?range=30d` returns `{"jails":[],"total":0}` even though ban data exists in the fail2ban database. The query in `ban_service.bans_by_jail()` filters on `WHERE timeofban >= ?` using a Unix timestamp computed from `datetime.now(tz=UTC)`. If the fail2ban database stores `timeofban` in local time rather than UTC (which is the default for fail2ban ≤ 1.0), the comparison silently excludes all rows because the UTC timestamp is hours ahead of the local-time values.
**Task:**
1. Query the fail2ban database for a few sample `timeofban` values and compare them to `datetime.now(tz=UTC).timestamp()` and `time.time()`. Determine whether fail2ban stores bans in UTC or local time.
2. If fail2ban uses `time.time()` (which returns UTC on all platforms), then the bug is elsewhere — add debug logging to `bans_by_jail` that logs `since`, the actual `SELECT COUNT(*)` result, and `db_path` so the root cause can be traced from production logs.
3. If the timestamps are local time, change `_since_unix()` to use `time.time()` (always UTC epoch) instead of `datetime.now(tz=UTC).timestamp()` to stay consistent. Both should be equivalent on correctly configured systems, but `time.time()` avoids any timezone-aware datetime pitfalls.
4. Add a guard: if `total == 0` and the range is `30d` or `365d`, run a `SELECT COUNT(*) FROM bans` (no WHERE) and log the result. If there are rows in the table but zero match the filter, log a warning with the `since` timestamp and the min/max `timeofban` values from the table. This makes future debugging trivial.
**Files:**
- `backend/app/services/ban_service.py``_since_unix()`, `bans_by_jail()`
**Acceptance criteria:**
- `bans_by_jail` returns the correct jail counts for the requested time range.
- When zero results are returned despite data existing, a warning log is emitted with diagnostic information (since timestamp, db row count, min/max timeofban).
- `_since_unix()` uses a method consistent with how fail2ban stores timestamps.
---
### 7.2 Add a `/api/dashboard/bans/by-jail` diagnostic endpoint or debug logging ✅ DONE
**Task:**
Add debug-level structured log output to `bans_by_jail` that includes:
- The resolved `db_path`.
- The computed `since` Unix timestamp and its ISO representation.
- The raw `total` count from the first query.
- The number of jail groups returned.
This allows operators to diagnose empty-result issues from the container logs without code changes.
**Files:**
- `backend/app/services/ban_service.py``bans_by_jail()`
---
### 7.3 Unit tests for `bans_by_jail` with a seeded in-memory database ✅ DONE
**Task:**
Write tests that create a temporary SQLite database matching the fail2ban `bans` table schema, seed it with rows at known timestamps, and call `bans_by_jail` (mocking `_get_fail2ban_db_path` to point at the temp database). Verify:
1. Rows within the time range are counted and grouped by jail correctly.
2. Rows outside the range are excluded.
3. The `origin` filter (`"blocklist"` / `"selfblock"`) partitions results as expected.
4. An empty database returns `{"jails": [], "total": 0}` without error.
**Files:**
- `backend/tests/test_services/test_ban_service.py`