Files
BanGUI/Docs/Tasks.md
Lukas f62785aaf2 Fix fail2ban runtime errors: jail not found, action locks, log noise
This commit implements fixes for three independent bugs in the fail2ban configuration and integration layer:

1. Task 1: Detect UnknownJailException and prevent silent failures
   - Added JailNotFoundError detection in jail_service.reload_all()
   - Enhanced error handling in config_file_service to catch JailNotFoundError
   - Added specific error message with logpath validation hints
   - Added rollback test for this scenario

2. Task 2: Fix iptables-allports exit code 4 (xtables lock contention)
   - Added global banaction setting in jail.conf with -w 5 lockingopt
   - Removed redundant per-jail banaction overrides from bangui-sim and blocklist-import
   - Added production compose documentation note

3. Task 3: Suppress log noise from unsupported backend/idle commands
   - Implemented capability detection to cache command support status
   - Double-check locking to minimize lock contention
   - Avoids sending unsupported get <jail> backend/idle commands
   - Returns default values without socket calls when unsupported

All changes include comprehensive tests and maintain backward compatibility.
2026-03-15 10:57:00 +01:00

184 lines
13 KiB
Markdown

# BanGUI — Task List
This document breaks the entire BanGUI project into development stages, ordered so that each stage builds on the previous one. Every task is described in prose with enough detail for a developer to begin work. References point to the relevant documentation.
---
## Agent Operating Instructions
These instructions apply to every AI agent working in this repository. Read them fully before touching any file.
### Repository Layout
```
backend/app/ FastAPI application (Python 3.12+, async)
models/ Pydantic v2 models
repositories/ Database access (aiosqlite)
routers/ FastAPI routers — thin, delegate to services
services/ Business logic; all fail2ban interaction lives here
tasks/ APScheduler background jobs
utils/ Shared helpers (fail2ban_client.py, ip_utils.py, …)
backend/tests/ pytest-asyncio test suite mirroring app/ structure
Docker/ Compose files, Dockerfiles, dev config for fail2ban
fail2ban-dev-config/ Bind-mounted into the fail2ban container in debug mode
frontend/src/ React + TypeScript SPA (Vite, Fluent UI)
fail2ban-master/ Vendored fail2ban source — DO NOT EDIT
Docs/ Architecture, design notes, this file
```
### Coding Conventions
- **Python**: async/await throughout. Use `structlog` for logging (`log.info/warning/error` with keyword args). Pydantic v2 models. Type-annotated functions.
- **Error handling**: Raise domain-specific exceptions (`JailNotFoundError`, `JailOperationError`, `Fail2BanConnectionError`) from service functions; let routers map them to HTTP responses. Never swallow exceptions silently in routers.
- **fail2ban socket**: All communication goes through `app.utils.fail2ban_client.Fail2BanClient`. Use `_safe_get` when a missing command is non-fatal; use `_ok()` when the response is required.
- **Tests**: Mirror the `app/` tree under `tests/`. Use `pytest-asyncio` + `unittest.mock.patch` or `AsyncMock`. Keep fixtures in `conftest.py`. Run with `pytest backend/tests` from the repo root (venv at `.venv/`).
- **Docker dev config**: `Docker/fail2ban-dev-config/` is bind-mounted read-write into the fail2ban container. Changes here take effect after `fail2ban-client reload`.
- **Compose**: Use `Docker/compose.debug.yml` for local development; `Docker/compose.prod.yml` for production. Never commit secrets.
### How to Verify Changes
1. Run the backend test suite: `cd /home/lukas/Volume/repo/BanGUI && .venv/bin/pytest backend/tests -x -q`
2. Check for Python type errors: `.venv/bin/mypy backend/app` (if mypy is installed).
3. For runtime verification, start the debug stack: `docker compose -f Docker/compose.debug.yml up`.
4. Inspect fail2ban logs inside the container: `docker exec bangui-fail2ban-dev fail2ban-client status` and `docker logs bangui-fail2ban-dev`.
### Agent Workflow
1. Read the task description fully. Read every source file mentioned before editing anything.
2. Make the minimal change that solves the stated problem — do not refactor surrounding code.
3. After every code change run the test suite to confirm no regressions.
4. If a fix requires a config file change in `Docker/fail2ban-dev-config/`, also update `Docker/compose.prod.yml` or document the required change if it affects the production config volume.
5. Mark the task complete only after tests pass and the root-cause error no longer appears.
---
## Bug Fixes — fail2ban Runtime Errors (2026-03-14)
The following three independent bugs were identified from fail2ban logs. They are ordered by severity. Resolve them in order; each is self-contained.
---
### Task 1 — `UnknownJailException('airsonic-auth')` on reload
**Priority**: High
**Files**: `Docker/fail2ban-dev-config/fail2ban/jail.d/airsonic-auth.conf`, `backend/app/services/config_file_service.py`
#### Observed error
```
fail2ban.transmitter ERROR Command ['reload', '--all', [], [['start', 'airsonic-auth'],
['start', 'bangui-sim'], ['start', 'blocklist-import']]] has failed.
Received UnknownJailException('airsonic-auth')
```
#### Root cause
When a user activates the `airsonic-auth` jail through BanGUI, `config_file_service.py` writes a local override and then calls `jail_service.reload_all(socket_path, include_jails=['airsonic-auth'])`. The reload stream therefore contains `['start', 'airsonic-auth']`. fail2ban cannot create the jail object because its `logpath` (`/remotelogs/airsonic/airsonic.log`) does not exist in the dev environment — the airsonic service is not running and no log volume is mounted. fail2ban registers a config-level parse failure for the jail and the server falls back to `UnknownJailException` when the reload asks it to start that name.
The current error handling in `config_file_service.py` catches the reload exception and rolls back the config file, but it does so with a generic `except Exception` and logs only a warning. The real error is never surfaced to the user in a way that explains why activation failed.
#### What to implement
1. **Dev-config fix** (Docker/fail2ban-dev-config): The `airsonic-auth.conf` file is a sample jail showing how a remote-log jail is configured. Add a prominent comment block at the top explaining that this jail is intentionally `enabled = false` and that it will fail to start unless the `/remotelogs/airsonic/` log directory is mounted into the fail2ban container. This makes the intent explicit and prevents confusion.
2. **Logpath pre-validation in the activation service** (`config_file_service.py`): Before calling `reload_all` during jail activation, read the `logpath` values from the jail's config (parse the `.conf` and any `.local` override using the existing `conffile_parser`). For each logpath that is not `/dev/null` and does not contain a glob wildcard, check whether the path exists on the filesystem (the backend container shares the fail2ban config volume with read-write access). If any required logpath is missing, abort the activation immediately — do not write the local override, do not call reload — and return a `JailActivationResponse` with `active=False` and a clear `message` explaining which logpath is missing. Add a `validation_warnings` entry listing the missing paths.
3. **Specific exception detection** (`jail_service.py`): In `reload_all`, when `_ok()` raises `ValueError` and the message matches `unknownjail` / `unknown jail` (use the existing `_is_not_found_error` helper), re-raise a `JailNotFoundError` instead of the generic `JailOperationError`. Update callers in `config_file_service.py` to catch `JailNotFoundError` separately and include the jail name in the activation failure message.
#### Acceptance criteria
- Attempting to activate `airsonic-auth` (without mounting the log volume) returns a 422-class response with a message mentioning the missing logpath — no `UnknownJailException` appears in the fail2ban log.
- Activating `bangui-sim` (whose logpath exists in dev) continues to work correctly.
- All existing tests in `backend/tests/` pass without modification.
---
### Task 2 — `iptables-allports` action fails with exit code 4 (`Script error`) on `blocklist-import` jail
**Priority**: High
**Files**: `Docker/fail2ban-dev-config/fail2ban/jail.d/blocklist-import.conf`, `Docker/fail2ban-dev-config/fail2ban/jail.d/bangui-sim.conf`, `Docker/fail2ban-dev-config/fail2ban/jail.conf`
#### Observed error
```
fail2ban.utils ERROR 753c588a7860 -- returned 4
fail2ban.actions ERROR Failed to execute ban jail 'blocklist-import' action
'iptables-allports' … Error starting action
Jail('blocklist-import')/iptables-allports: 'Script error'
```
#### Root cause
Exit code 4 from an iptables invocation means the xtables advisory lock could not be obtained (another iptables call was in progress simultaneously). The `iptables-allports` `actionstart` script does not pass the `-w` (wait-for-lock) flag by default. In the Docker dev environment, fail2ban starts multiple jails in parallel during `reload --all`. Each jail's `actionstart` sub-process calls iptables concurrently. The xtables lock contention causes some of them to exit with code 4. fail2ban interprets any non-zero exit from an action script as `Script error` and aborts the action.
A secondary risk: if the host kernel uses `nf_tables` (iptables-nft) but the container runs `iptables-legacy` binaries, the same exit code 4 can also appear due to backend mismatch.
#### What to implement
1. **Add the xtables lock-wait flag globally** in `Docker/fail2ban-dev-config/fail2ban/jail.conf`, under the `[DEFAULT]` section. Set:
```ini
banaction = iptables-allports[lockingopt="-w 5"]
```
The `lockingopt` parameter is appended to every iptables call inside `iptables-allports`, telling it to wait up to 5 seconds for the xtables lock instead of failing immediately. This is the canonical fix recommended in the fail2ban documentation for containerised deployments.
2. **Remove the redundant `banaction` overrides** from `blocklist-import.conf` and `bangui-sim.conf` (both already specify `banaction = iptables-allports` which now comes from the global default). Removing the per-jail override keeps configuration DRY and ensures the lock-wait flag applies consistently.
3. **Production compose note**: Add a comment in `Docker/compose.prod.yml` near the fail2ban service block stating that the `fail2ban-config` volume must contain a `jail.d/` with `banaction = iptables-allports[lockingopt="-w 5"]` or equivalent, because the production volume is not pre-seeded by the repository. This is a documentation action only — do not auto-seed the production volume.
#### Acceptance criteria
- After `docker compose -f Docker/compose.debug.yml restart fail2ban`, the fail2ban log shows no `returned 4` or `Script error` lines during startup.
- `docker exec bangui-fail2ban-dev fail2ban-client status blocklist-import` reports the jail as running.
- `docker exec bangui-fail2ban-dev fail2ban-client status bangui-sim` reports the jail as running.
---
### Task 3 — Suppress log noise from unsupported `get <jail> idle/backend` commands
**Priority**: Low
**Files**: `backend/app/services/jail_service.py`
#### Observed error
```
fail2ban.transmitter ERROR Command ['get', 'bangui-sim', 'idle'] has failed.
Received Exception('Invalid command (no get action or not yet implemented)')
fail2ban.transmitter ERROR Command ['get', 'bangui-sim', 'backend'] has failed.
Received Exception('Invalid command (no get action or not yet implemented)')
… (repeated ~15 times per polling cycle)
```
#### Root cause
`_fetch_jail_summary()` in `jail_service.py` sends `["get", name, "backend"]` and `["get", name, "idle"]` to the fail2ban daemon via the socket. The running fail2ban version (LinuxServer.io container image) does not implement these two sub-commands in its transmitter. fail2ban logs every unrecognised command as an `ERROR` on its side. BanGUI already handles this gracefully — `asyncio.gather(return_exceptions=True)` captures the exceptions and `_safe_bool` / `_safe_str` fall back to `False` / `"polling"` respectively. The BanGUI application does not malfunction, but the fail2ban log is flooded with spurious `ERROR` lines on every jail-list refresh (~15 errors per request across all jails, repeated on every dashboard poll).
#### What to implement
The fix is a capability-detection probe executed once at startup (or at most once per polling cycle, re-used for all jails). The probe sends `["get", "<first_jail>", "backend"]` and caches the result in a module-level boolean flag.
1. In `jail_service.py`, add a module-level `asyncio.Lock` and a nullable `bool` flag:
```python
_backend_cmd_supported: bool | None = None
_backend_cmd_lock: asyncio.Lock = asyncio.Lock()
```
2. Add an async helper `_check_backend_cmd_supported(client, jail_name) -> bool` that:
- Returns the cached value immediately if already determined.
- Acquires `_backend_cmd_lock`, checks again (double-check idiom), then sends `["get", jail_name, "backend"]`.
- Sets `_backend_cmd_supported = True` if the command returns without exception, `False` if it raises any `Exception`.
- Returns the flag value.
3. In `_fetch_jail_summary()`, call this helper (using the first available jail name) before the main `asyncio.gather`. If the flag is `False`, replace the `["get", name, "backend"]` and `["get", name, "idle"]` gather entries with `asyncio.coroutine` constants that immediately return the default values (`"polling"` and `False`) without sending any socket command. This eliminates the fail2ban-side log noise entirely without changing the public API or the fallback values.
4. If `_backend_cmd_supported` becomes `True` (future fail2ban version), the commands are sent as before and real values are returned.
#### Acceptance criteria
- After one polling cycle where `backend`/`idle` are detected as unsupported, the fail2ban log shows zero `Invalid command` lines for those commands on subsequent refreshes.
- The `GET /api/jails` response still returns `backend: "polling"` and `idle: false` for each jail (unchanged defaults).
- All existing tests in `backend/tests/` pass.
- The detection result is reset if `_backend_cmd_supported` is `None` (i.e., on application restart), so a fail2ban upgrade that adds support is detected automatically within one polling cycle.
---