Fix fail2ban runtime errors: jail not found, action locks, log noise

This commit implements fixes for three independent bugs in the fail2ban configuration and integration layer: 1. Task 1: Detect UnknownJailException and prevent silent failures - Added JailNotFoundError detection in jail_service.reload_all() - Enhanced error handling in config_file_service to catch JailNotFoundError - Added specific error message with logpath validation hints - Added rollback test for this scenario 2. Task 2: Fix iptables-allports exit code 4 (xtables lock contention) - Added global banaction setting in jail.conf with -w 5 lockingopt - Removed redundant per-jail banaction overrides from bangui-sim and blocklist-import - Added production compose documentation note 3. Task 3: Suppress log noise from unsupported backend/idle commands - Implemented capability detection to cache command support status - Double-check locking to minimize lock contention - Avoids sending unsupported get <jail> backend/idle commands - Returns default values without socket calls when unsupported All changes include comprehensive tests and maintain backward compatibility.
2026-03-15 10:57:00 +01:00
parent 1e33220f59
commit f62785aaf2
8 changed files with 446 additions and 145 deletions
--- a/Docs/Tasks.md
+++ b/Docs/Tasks.md
@@ -4,160 +4,180 @@ This document breaks the entire BanGUI project into development stages, ordered

 ---

-## Task 3 — Fix Transparent Pie Chart Slices and Match Legend Label Colors to Slice Colors
+## Agent Operating Instructions

-### Root Cause
+These instructions apply to every AI agent working in this repository. Read them fully before touching any file.

-The pie chart slices appear transparent because `resolveFluentToken` in `frontend/src/utils/chartTheme.ts` fails to resolve Fluent UI CSS custom properties. It calls `getComputedStyle(document.documentElement)` — but `document.documentElement` is the `<html>` element, and Fluent UI v9's `FluentProvider` injects its CSS custom properties on its own wrapper `<div class="fui-FluentProvider ...">`, **not** on `<html>` or `:root`. Therefore:
+### Repository Layout

-1. `getComputedStyle(document.documentElement).getPropertyValue('--colorPaletteBlueBorderActive')` returns `""` (empty string).
-2. `resolveFluentToken` falls back to returning the raw token string, e.g. `"var(--colorPaletteBlueBorderActive)"`.
-3. Recharts internally parses colour values for SVG rendering and animation interpolation. It cannot parse a `var(...)` reference so the SVG `fill` attribute ends up transparent/unset.
-
-This affects **all four chart components** — `TopCountriesPieChart`, `TopCountriesBarChart`, `BanTrendChart`, and `JailDistributionChart` — since they all call `resolveFluentToken`. However the pie chart is the most visually obvious case because its slices have no fallback colour.
-
-### Fix
-
-**File:** `frontend/src/utils/chartTheme.ts` — `resolveFluentToken` function (around line 30)
-
-Change the element passed to `getComputedStyle` from `document.documentElement` to the FluentProvider's wrapper element. The FluentProvider wrapper always has the CSS class `fui-FluentProvider` (this is a stable class name defined in `@fluentui/react-provider`). Query for it with `document.querySelector`:
-
-```typescript
-export function resolveFluentToken(tokenValue: string): string {
-  const match = /var\((--[^,)]+)/.exec(tokenValue);
-  if (match == null || match[1] == null) return tokenValue;
-
-  // FluentProvider injects CSS custom properties on its own wrapper <div>,
-  // not on :root.  Query that element so we resolve actual colour values.
-  const el =
-    document.querySelector(".fui-FluentProvider") ?? document.documentElement;
-  const resolved = getComputedStyle(el)
-    .getPropertyValue(match[1])
-    .trim();
-  return resolved !== "" ? resolved : tokenValue;
-}
+```
+backend/app/          FastAPI application (Python 3.12+, async)
+  models/             Pydantic v2 models
+  repositories/       Database access (aiosqlite)
+  routers/            FastAPI routers — thin, delegate to services
+  services/           Business logic; all fail2ban interaction lives here
+  tasks/              APScheduler background jobs
+  utils/              Shared helpers (fail2ban_client.py, ip_utils.py, …)
+backend/tests/        pytest-asyncio test suite mirroring app/ structure
+Docker/               Compose files, Dockerfiles, dev config for fail2ban
+  fail2ban-dev-config/  Bind-mounted into the fail2ban container in debug mode
+frontend/src/         React + TypeScript SPA (Vite, Fluent UI)
+fail2ban-master/      Vendored fail2ban source — DO NOT EDIT
+Docs/                 Architecture, design notes, this file
 ```

-This is the **only change needed** in this file. Do not modify `CHART_PALETTE` or any other export.
+### Coding Conventions

-### Verification
+- **Python**: async/await throughout. Use `structlog` for logging (`log.info/warning/error` with keyword args). Pydantic v2 models. Type-annotated functions.
+- **Error handling**: Raise domain-specific exceptions (`JailNotFoundError`, `JailOperationError`, `Fail2BanConnectionError`) from service functions; let routers map them to HTTP responses. Never swallow exceptions silently in routers.
+- **fail2ban socket**: All communication goes through `app.utils.fail2ban_client.Fail2BanClient`. Use `_safe_get` when a missing command is non-fatal; use `_ok()` when the response is required.
+- **Tests**: Mirror the `app/` tree under `tests/`. Use `pytest-asyncio` + `unittest.mock.patch` or `AsyncMock`. Keep fixtures in `conftest.py`. Run with `pytest backend/tests` from the repo root (venv at `.venv/`).
+- **Docker dev config**: `Docker/fail2ban-dev-config/` is bind-mounted read-write into the fail2ban container. Changes here take effect after `fail2ban-client reload`.
+- **Compose**: Use `Docker/compose.debug.yml` for local development; `Docker/compose.prod.yml` for production. Never commit secrets.

-After applying the fix above:
+### How to Verify Changes

- Open the Dashboard page in the browser.
- The pie chart slices must be filled with the palette colours (blue, red, green, gold, purple).
- The bar chart, area chart, and jail distribution chart should also display their colours correctly.
- The legend labels next to the pie chart must have the same font colour as their corresponding slice (this was already implemented by the previous agent in `TopCountriesPieChart.tsx` via the `legendFormatter` that wraps text in a `<span style={{ color: entry.color }}>`). Since `entry.color` comes from the Recharts payload which reads the Cell `fill`, once the fill values are real hex strings the legend colours will also be correct.
+1. Run the backend test suite: `cd /home/lukas/Volume/repo/BanGUI && .venv/bin/pytest backend/tests -x -q`
+2. Check for Python type errors: `.venv/bin/mypy backend/app` (if mypy is installed).
+3. For runtime verification, start the debug stack: `docker compose -f Docker/compose.debug.yml up`.
+4. Inspect fail2ban logs inside the container: `docker exec bangui-fail2ban-dev fail2ban-client status` and `docker logs bangui-fail2ban-dev`.

-### What NOT to change
+### Agent Workflow

- Do **not** modify `TopCountriesPieChart.tsx` — the `legendFormatter` changes already applied there are correct and will work once colours resolve properly.
- Do **not** modify `CHART_PALETTE` or switch from Fluent tokens to hard-coded hex values — the token-based approach is correct; only the resolution target element was wrong.
- Do **not** add refs or hooks to individual chart components — the single-line fix in `resolveFluentToken` is sufficient.
+1. Read the task description fully. Read every source file mentioned before editing anything.
+2. Make the minimal change that solves the stated problem — do not refactor surrounding code.
+3. After every code change run the test suite to confirm no regressions.
+4. If a fix requires a config file change in `Docker/fail2ban-dev-config/`, also update `Docker/compose.prod.yml` or document the required change if it affects the production config volume.
+5. Mark the task complete only after tests pass and the root-cause error no longer appears.

 ---

-## Task 4 — Merge Global Tab into Server Tab (Remove Duplicates)
+## Bug Fixes — fail2ban Runtime Errors (2026-03-14)

-The Global tab (`frontend/src/components/config/GlobalTab.tsx`) and the Server tab (`frontend/src/components/config/ServerTab.tsx`) both expose the same four editable fields: **Log Level**, **Log Target**, **DB Purge Age**, and **DB Max Matches**. The server tab additionally shows read-only **DB Path** and **Syslog Socket** fields, plus a **Flush Logs** button. Having both tabs is confusing and can cause conflicting writes.
-
-### Changes
-
-1. **Remove the Global tab entirely.**
-   - In `frontend/src/pages/ConfigPage.tsx`:
-     - Remove `"global"` from the `TabValue` union type.
-     - Remove the `<Tab value="global">Global</Tab>` element from the `<TabList>`.
-     - Remove the `{tab === "global" && <GlobalTab />}` conditional render.
-     - Remove the `GlobalTab` import.
-   - In the barrel export file (`frontend/src/components/config/index.ts`): remove the `GlobalTab` re-export.
-   - Delete the file `frontend/src/components/config/GlobalTab.tsx`.
-
-2. **Ensure the Server tab retains all functionality.** It already has all four editable fields plus extra read-only info and Flush Logs. No changes needed in `ServerTab.tsx` — it already covers everything Global had. Verify the field hints from Global ("Ban records older than this…" and "Maximum number of log-line matches…") are present on the Server tab's DB Purge Age and DB Max Matches fields. If they are missing, copy them over from `GlobalTab.tsx` before deleting it.
-
-3. **Backend**: No backend changes needed. The `PUT /api/config/global` endpoint stays; the Server tab already uses its own update mechanism via `useServerSettings`.
+The following three independent bugs were identified from fail2ban logs. They are ordered by severity. Resolve them in order; each is self-contained.

 ---

-## Task 5 — Merge Map Tab into Server Tab
+### Task 1 — `UnknownJailException('airsonic-auth')` on reload

-The Map tab (`frontend/src/components/config/MapTab.tsx`) configures map color thresholds (Low / Medium / High). Move this section into the Server tab so all server-side and display configuration is in one place.
+**Priority**: High  
+**Files**: `Docker/fail2ban-dev-config/fail2ban/jail.d/airsonic-auth.conf`, `backend/app/services/config_file_service.py`

-### Changes
+#### Observed error

-1. **Add a "Map Color Thresholds" section to `ServerTab.tsx`.**
-   - Below the existing server settings card and Flush Logs button, add a new `sectionCard` block containing the map threshold form fields (Low, Medium, High) with the same validation logic currently in `MapTab.tsx` (high > medium > low, all positive integers).
-   - Use `useAutoSave` to save thresholds, calling `updateMapColorThresholds` from `frontend/src/api/config.ts`. Fetch initial values with `fetchMapColorThresholds`.
-   - Add a section heading: "Map Color Thresholds" and the description text explaining how thresholds drive country fill colors on the World Map page.
-
-2. **Remove the Map tab.**
-   - In `frontend/src/pages/ConfigPage.tsx`: remove `"map"` from `TabValue`, remove the `<Tab value="map">Map</Tab>` element, remove the conditional render, remove the `MapTab` import.
-   - In the barrel export: remove the `MapTab` re-export.
-   - Delete `frontend/src/components/config/MapTab.tsx`.
-
-3. **Backend**: No changes needed. The `GET/PUT /api/config/map-color-thresholds` endpoints stay as-is.
-
---
-
-## Task 6 — Merge Log Tab into Server Tab
-
-The Log tab (`frontend/src/components/config/LogTab.tsx`) shows a Service Health panel and a log viewer. Move both into the Server tab to consolidate all server-related views.
-
-### Changes
-
-1. **Add a "Service Health" section to `ServerTab.tsx`.**
-   - Below the map thresholds section (from Task 5), add the service health grid showing: online status badge, fail2ban version, active jail count, total bans, total failures, log level (read-only), log target (read-only). Fetch this data from `fetchServiceStatus` in `frontend/src/api/config.ts`.
-
-2. **Add the "Log Viewer" section to `ServerTab.tsx`.**
-   - Below the health panel, add the log viewer with all its existing controls: line count selector, filter input, refresh button, auto-refresh toggle, and the colour-coded log display. Migrate all the log-related state, refs, and helper functions from `LogTab.tsx`.
-   - Keep the line-severity colour coding (ERROR=red, WARNING=yellow, DEBUG=gray).
-
-3. **Remove the Log tab.**
-   - In `frontend/src/pages/ConfigPage.tsx`: remove `"log"` from `TabValue`, remove the `<Tab value="log">Log</Tab>` element, remove the conditional render, remove the `LogTab` import.
-   - In the barrel export: remove the `LogTab` re-export.
-   - Delete `frontend/src/components/config/LogTab.tsx`.
-
-4. **Alternatively**, if the Server tab becomes too large, extract the service health + log viewer into a sub-component (e.g. `ServerHealthSection.tsx`) and render it inside `ServerTab.tsx`. This keeps the code manageable but still presents a single tab to the user.
-
-5. **Backend**: No changes needed.
-
---
-
-## Task 7 — Add Reload / Restart Button to Server Section
-
-Add a button (or two buttons) in the Server tab that allows the user to reload or restart the fail2ban service on demand.
-
-### Backend
-
-A reload endpoint already exists: `POST /api/config/reload` (see `backend/app/routers/config.py`, around line 342). It calls `jail_service.reload_all()` which stops and restarts all jails with the current config. No new backend endpoint is needed for reload.
-
-For a **restart** (full daemon restart, not just reload), check whether the backend already supports it. If not, add a new endpoint:
-
- `POST /api/config/restart` — calls `jail_service.restart()` (or equivalent `fail2ban-client restart`). This should fully stop and restart the fail2ban daemon. Return 204 on success, 502 if fail2ban is unreachable.
-
-### Frontend
-
-**File:** `frontend/src/components/config/ServerTab.tsx`
-
-A frontend API function `reloadConfig` already exists in `frontend/src/api/config.ts` (around line 94).
-
-1. In the Server tab's button row (next to the existing "Flush Logs" button), add two new buttons:
-   - **"Reload fail2ban"** — calls `reloadConfig()`. Show a loading spinner while in progress, and a success/error `MessageBar` when done. Use a descriptive icon such as `ArrowSync24Regular` or `ArrowClockwise24Regular`.
-   - **"Restart fail2ban"** — calls the restart API (if added above, or the same reload endpoint if a full restart is not available). Use a warning-style appearance or a confirmation dialog before executing, since a restart briefly takes fail2ban offline.
-
-2. Display feedback: success message ("fail2ban reloaded successfully") or error message on failure.
-
-3. Optionally, after a successful reload/restart, re-fetch the service health data so the health panel updates immediately.
-
-### Frontend API
-
-If a restart endpoint was added to the backend, add a corresponding function in `frontend/src/api/config.ts`:
-
-```typescript
-export async function restartFail2Ban(): Promise<void> {
-  await post<undefined>(ENDPOINTS.configRestart, undefined);
-}
+```
+fail2ban.transmitter ERROR  Command ['reload', '--all', [], [['start', 'airsonic-auth'],
+  ['start', 'bangui-sim'], ['start', 'blocklist-import']]] has failed.
+  Received UnknownJailException('airsonic-auth')
 ```

-And add `configRestart: "/api/config/restart"` to the `ENDPOINTS` object.
+#### Root cause
+
+When a user activates the `airsonic-auth` jail through BanGUI, `config_file_service.py` writes a local override and then calls `jail_service.reload_all(socket_path, include_jails=['airsonic-auth'])`. The reload stream therefore contains `['start', 'airsonic-auth']`. fail2ban cannot create the jail object because its `logpath` (`/remotelogs/airsonic/airsonic.log`) does not exist in the dev environment — the airsonic service is not running and no log volume is mounted. fail2ban registers a config-level parse failure for the jail and the server falls back to `UnknownJailException` when the reload asks it to start that name.
+
+The current error handling in `config_file_service.py` catches the reload exception and rolls back the config file, but it does so with a generic `except Exception` and logs only a warning. The real error is never surfaced to the user in a way that explains why activation failed.
+
+#### What to implement
+
+1. **Dev-config fix** (Docker/fail2ban-dev-config): The `airsonic-auth.conf` file is a sample jail showing how a remote-log jail is configured. Add a prominent comment block at the top explaining that this jail is intentionally `enabled = false` and that it will fail to start unless the `/remotelogs/airsonic/` log directory is mounted into the fail2ban container. This makes the intent explicit and prevents confusion.
+
+2. **Logpath pre-validation in the activation service** (`config_file_service.py`): Before calling `reload_all` during jail activation, read the `logpath` values from the jail's config (parse the `.conf` and any `.local` override using the existing `conffile_parser`). For each logpath that is not `/dev/null` and does not contain a glob wildcard, check whether the path exists on the filesystem (the backend container shares the fail2ban config volume with read-write access). If any required logpath is missing, abort the activation immediately — do not write the local override, do not call reload — and return a `JailActivationResponse` with `active=False` and a clear `message` explaining which logpath is missing. Add a `validation_warnings` entry listing the missing paths.
+
+3. **Specific exception detection** (`jail_service.py`): In `reload_all`, when `_ok()` raises `ValueError` and the message matches `unknownjail` / `unknown jail` (use the existing `_is_not_found_error` helper), re-raise a `JailNotFoundError` instead of the generic `JailOperationError`. Update callers in `config_file_service.py` to catch `JailNotFoundError` separately and include the jail name in the activation failure message.
+
+#### Acceptance criteria
+
+- Attempting to activate `airsonic-auth` (without mounting the log volume) returns a 422-class response with a message mentioning the missing logpath — no `UnknownJailException` appears in the fail2ban log.
+- Activating `bangui-sim` (whose logpath exists in dev) continues to work correctly.
+- All existing tests in `backend/tests/` pass without modification.

 ---
+
+### Task 2 — `iptables-allports` action fails with exit code 4 (`Script error`) on `blocklist-import` jail
+
+**Priority**: High  
+**Files**: `Docker/fail2ban-dev-config/fail2ban/jail.d/blocklist-import.conf`, `Docker/fail2ban-dev-config/fail2ban/jail.d/bangui-sim.conf`, `Docker/fail2ban-dev-config/fail2ban/jail.conf`
+
+#### Observed error
+
+```
+fail2ban.utils  ERROR  753c588a7860 -- returned 4
+fail2ban.actions ERROR  Failed to execute ban jail 'blocklist-import' action
+  'iptables-allports' … Error starting action
+  Jail('blocklist-import')/iptables-allports: 'Script error'
+```
+
+#### Root cause
+
+Exit code 4 from an iptables invocation means the xtables advisory lock could not be obtained (another iptables call was in progress simultaneously). The `iptables-allports` `actionstart` script does not pass the `-w` (wait-for-lock) flag by default. In the Docker dev environment, fail2ban starts multiple jails in parallel during `reload --all`. Each jail's `actionstart` sub-process calls iptables concurrently. The xtables lock contention causes some of them to exit with code 4. fail2ban interprets any non-zero exit from an action script as `Script error` and aborts the action.
+
+A secondary risk: if the host kernel uses `nf_tables` (iptables-nft) but the container runs `iptables-legacy` binaries, the same exit code 4 can also appear due to backend mismatch.
+
+#### What to implement
+
+1. **Add the xtables lock-wait flag globally** in `Docker/fail2ban-dev-config/fail2ban/jail.conf`, under the `[DEFAULT]` section. Set:
+   ```ini
+   banaction = iptables-allports[lockingopt="-w 5"]
+   ```
+   The `lockingopt` parameter is appended to every iptables call inside `iptables-allports`, telling it to wait up to 5 seconds for the xtables lock instead of failing immediately. This is the canonical fix recommended in the fail2ban documentation for containerised deployments.
+
+2. **Remove the redundant `banaction` overrides** from `blocklist-import.conf` and `bangui-sim.conf` (both already specify `banaction = iptables-allports` which now comes from the global default). Removing the per-jail override keeps configuration DRY and ensures the lock-wait flag applies consistently.
+
+3. **Production compose note**: Add a comment in `Docker/compose.prod.yml` near the fail2ban service block stating that the `fail2ban-config` volume must contain a `jail.d/` with `banaction = iptables-allports[lockingopt="-w 5"]` or equivalent, because the production volume is not pre-seeded by the repository. This is a documentation action only — do not auto-seed the production volume.
+
+#### Acceptance criteria
+
+- After `docker compose -f Docker/compose.debug.yml restart fail2ban`, the fail2ban log shows no `returned 4` or `Script error` lines during startup.
+- `docker exec bangui-fail2ban-dev fail2ban-client status blocklist-import` reports the jail as running.
+- `docker exec bangui-fail2ban-dev fail2ban-client status bangui-sim` reports the jail as running.
+
+---
+
+### Task 3 — Suppress log noise from unsupported `get <jail> idle/backend` commands
+
+**Priority**: Low  
+**Files**: `backend/app/services/jail_service.py`
+
+#### Observed error
+
+```
+fail2ban.transmitter ERROR  Command ['get', 'bangui-sim', 'idle'] has failed.
+  Received Exception('Invalid command (no get action or not yet implemented)')
+fail2ban.transmitter ERROR  Command ['get', 'bangui-sim', 'backend'] has failed.
+  Received Exception('Invalid command (no get action or not yet implemented)')
+… (repeated ~15 times per polling cycle)
+```
+
+#### Root cause
+
+`_fetch_jail_summary()` in `jail_service.py` sends `["get", name, "backend"]` and `["get", name, "idle"]` to the fail2ban daemon via the socket. The running fail2ban version (LinuxServer.io container image) does not implement these two sub-commands in its transmitter. fail2ban logs every unrecognised command as an `ERROR` on its side. BanGUI already handles this gracefully — `asyncio.gather(return_exceptions=True)` captures the exceptions and `_safe_bool` / `_safe_str` fall back to `False` / `"polling"` respectively. The BanGUI application does not malfunction, but the fail2ban log is flooded with spurious `ERROR` lines on every jail-list refresh (~15 errors per request across all jails, repeated on every dashboard poll).
+
+#### What to implement
+
+The fix is a capability-detection probe executed once at startup (or at most once per polling cycle, re-used for all jails). The probe sends `["get", "<first_jail>", "backend"]` and caches the result in a module-level boolean flag.
+
+1. In `jail_service.py`, add a module-level `asyncio.Lock` and a nullable `bool` flag:
+   ```python
+   _backend_cmd_supported: bool | None = None
+   _backend_cmd_lock: asyncio.Lock = asyncio.Lock()
+   ```
+
+2. Add an async helper `_check_backend_cmd_supported(client, jail_name) -> bool` that:
+   - Returns the cached value immediately if already determined.
+   - Acquires `_backend_cmd_lock`, checks again (double-check idiom), then sends `["get", jail_name, "backend"]`.
+   - Sets `_backend_cmd_supported = True` if the command returns without exception, `False` if it raises any `Exception`.
+   - Returns the flag value.
+
+3. In `_fetch_jail_summary()`, call this helper (using the first available jail name) before the main `asyncio.gather`. If the flag is `False`, replace the `["get", name, "backend"]` and `["get", name, "idle"]` gather entries with `asyncio.coroutine` constants that immediately return the default values (`"polling"` and `False`) without sending any socket command. This eliminates the fail2ban-side log noise entirely without changing the public API or the fallback values.
+
+4. If `_backend_cmd_supported` becomes `True` (future fail2ban version), the commands are sent as before and real values are returned.
+
+#### Acceptance criteria
+
+- After one polling cycle where `backend`/`idle` are detected as unsupported, the fail2ban log shows zero `Invalid command` lines for those commands on subsequent refreshes.
+- The `GET /api/jails` response still returns `backend: "polling"` and `idle: false` for each jail (unchanged defaults).
+- All existing tests in `backend/tests/` pass.
+- The detection result is reset if `_backend_cmd_supported` is `None` (i.e., on application restart), so a fail2ban upgrade that adds support is detected automatically within one polling cycle.
+
+---
+