Files

Lukas f62785aaf2 Fix fail2ban runtime errors: jail not found, action locks, log noise

This commit implements fixes for three independent bugs in the fail2ban configuration and integration layer:

1. Task 1: Detect UnknownJailException and prevent silent failures
   - Added JailNotFoundError detection in jail_service.reload_all()
   - Enhanced error handling in config_file_service to catch JailNotFoundError
   - Added specific error message with logpath validation hints
   - Added rollback test for this scenario

2. Task 2: Fix iptables-allports exit code 4 (xtables lock contention)
   - Added global banaction setting in jail.conf with -w 5 lockingopt
   - Removed redundant per-jail banaction overrides from bangui-sim and blocklist-import
   - Added production compose documentation note

3. Task 3: Suppress log noise from unsupported backend/idle commands
   - Implemented capability detection to cache command support status
   - Double-check locking to minimize lock contention
   - Avoids sending unsupported get <jail> backend/idle commands
   - Returns default values without socket calls when unsupported

All changes include comprehensive tests and maintain backward compatibility.

2026-03-15 10:57:00 +01:00

13 KiB

Raw Blame History

BanGUI — Task List

This document breaks the entire BanGUI project into development stages, ordered so that each stage builds on the previous one. Every task is described in prose with enough detail for a developer to begin work. References point to the relevant documentation.

Agent Operating Instructions

These instructions apply to every AI agent working in this repository. Read them fully before touching any file.

Repository Layout

backend/app/          FastAPI application (Python 3.12+, async)
  models/             Pydantic v2 models
  repositories/       Database access (aiosqlite)
  routers/            FastAPI routers — thin, delegate to services
  services/           Business logic; all fail2ban interaction lives here
  tasks/              APScheduler background jobs
  utils/              Shared helpers (fail2ban_client.py, ip_utils.py, …)
backend/tests/        pytest-asyncio test suite mirroring app/ structure
Docker/               Compose files, Dockerfiles, dev config for fail2ban
  fail2ban-dev-config/  Bind-mounted into the fail2ban container in debug mode
frontend/src/         React + TypeScript SPA (Vite, Fluent UI)
fail2ban-master/      Vendored fail2ban source — DO NOT EDIT
Docs/                 Architecture, design notes, this file

Coding Conventions

Python: async/await throughout. Use structlog for logging (log.info/warning/error with keyword args). Pydantic v2 models. Type-annotated functions.
Error handling: Raise domain-specific exceptions (JailNotFoundError, JailOperationError, Fail2BanConnectionError) from service functions; let routers map them to HTTP responses. Never swallow exceptions silently in routers.
fail2ban socket: All communication goes through app.utils.fail2ban_client.Fail2BanClient. Use _safe_get when a missing command is non-fatal; use _ok() when the response is required.
Tests: Mirror the app/ tree under tests/. Use pytest-asyncio + unittest.mock.patch or AsyncMock. Keep fixtures in conftest.py. Run with pytest backend/tests from the repo root (venv at .venv/).
Docker dev config: Docker/fail2ban-dev-config/ is bind-mounted read-write into the fail2ban container. Changes here take effect after fail2ban-client reload.
Compose: Use Docker/compose.debug.yml for local development; Docker/compose.prod.yml for production. Never commit secrets.

How to Verify Changes

Run the backend test suite: cd /home/lukas/Volume/repo/BanGUI && .venv/bin/pytest backend/tests -x -q
Check for Python type errors: .venv/bin/mypy backend/app (if mypy is installed).
For runtime verification, start the debug stack: docker compose -f Docker/compose.debug.yml up.
Inspect fail2ban logs inside the container: docker exec bangui-fail2ban-dev fail2ban-client status and docker logs bangui-fail2ban-dev.

Agent Workflow

Read the task description fully. Read every source file mentioned before editing anything.
Make the minimal change that solves the stated problem — do not refactor surrounding code.
After every code change run the test suite to confirm no regressions.
If a fix requires a config file change in Docker/fail2ban-dev-config/, also update Docker/compose.prod.yml or document the required change if it affects the production config volume.
Mark the task complete only after tests pass and the root-cause error no longer appears.

Bug Fixes — fail2ban Runtime Errors (2026-03-14)

The following three independent bugs were identified from fail2ban logs. They are ordered by severity. Resolve them in order; each is self-contained.

Task 1 — `UnknownJailException('airsonic-auth')` on reload

Priority: High
Files: Docker/fail2ban-dev-config/fail2ban/jail.d/airsonic-auth.conf, backend/app/services/config_file_service.py

Observed error

fail2ban.transmitter ERROR  Command ['reload', '--all', [], [['start', 'airsonic-auth'],
  ['start', 'bangui-sim'], ['start', 'blocklist-import']]] has failed.
  Received UnknownJailException('airsonic-auth')

Root cause

When a user activates the airsonic-auth jail through BanGUI, config_file_service.py writes a local override and then calls jail_service.reload_all(socket_path, include_jails=['airsonic-auth']). The reload stream therefore contains ['start', 'airsonic-auth']. fail2ban cannot create the jail object because its logpath (/remotelogs/airsonic/airsonic.log) does not exist in the dev environment — the airsonic service is not running and no log volume is mounted. fail2ban registers a config-level parse failure for the jail and the server falls back to UnknownJailException when the reload asks it to start that name.

The current error handling in config_file_service.py catches the reload exception and rolls back the config file, but it does so with a generic except Exception and logs only a warning. The real error is never surfaced to the user in a way that explains why activation failed.

What to implement

Dev-config fix (Docker/fail2ban-dev-config): The airsonic-auth.conf file is a sample jail showing how a remote-log jail is configured. Add a prominent comment block at the top explaining that this jail is intentionally enabled = false and that it will fail to start unless the /remotelogs/airsonic/ log directory is mounted into the fail2ban container. This makes the intent explicit and prevents confusion.
Logpath pre-validation in the activation service (config_file_service.py): Before calling reload_all during jail activation, read the logpath values from the jail's config (parse the .conf and any .local override using the existing conffile_parser). For each logpath that is not /dev/null and does not contain a glob wildcard, check whether the path exists on the filesystem (the backend container shares the fail2ban config volume with read-write access). If any required logpath is missing, abort the activation immediately — do not write the local override, do not call reload — and return a JailActivationResponse with active=False and a clear message explaining which logpath is missing. Add a validation_warnings entry listing the missing paths.
Specific exception detection (jail_service.py): In reload_all, when _ok() raises ValueError and the message matches unknownjail / unknown jail (use the existing _is_not_found_error helper), re-raise a JailNotFoundError instead of the generic JailOperationError. Update callers in config_file_service.py to catch JailNotFoundError separately and include the jail name in the activation failure message.

Acceptance criteria

Attempting to activate airsonic-auth (without mounting the log volume) returns a 422-class response with a message mentioning the missing logpath — no UnknownJailException appears in the fail2ban log.
Activating bangui-sim (whose logpath exists in dev) continues to work correctly.
All existing tests in backend/tests/ pass without modification.

Task 2 — `iptables-allports` action fails with exit code 4 (`Script error`) on `blocklist-import` jail

Priority: High
Files: Docker/fail2ban-dev-config/fail2ban/jail.d/blocklist-import.conf, Docker/fail2ban-dev-config/fail2ban/jail.d/bangui-sim.conf, Docker/fail2ban-dev-config/fail2ban/jail.conf

Observed error

fail2ban.utils  ERROR  753c588a7860 -- returned 4
fail2ban.actions ERROR  Failed to execute ban jail 'blocklist-import' action
  'iptables-allports' … Error starting action
  Jail('blocklist-import')/iptables-allports: 'Script error'

Root cause

Exit code 4 from an iptables invocation means the xtables advisory lock could not be obtained (another iptables call was in progress simultaneously). The iptables-allports actionstart script does not pass the -w (wait-for-lock) flag by default. In the Docker dev environment, fail2ban starts multiple jails in parallel during reload --all. Each jail's actionstart sub-process calls iptables concurrently. The xtables lock contention causes some of them to exit with code 4. fail2ban interprets any non-zero exit from an action script as Script error and aborts the action.

A secondary risk: if the host kernel uses nf_tables (iptables-nft) but the container runs iptables-legacy binaries, the same exit code 4 can also appear due to backend mismatch.

What to implement

Add the xtables lock-wait flag globally in Docker/fail2ban-dev-config/fail2ban/jail.conf, under the [DEFAULT] section. Set:
```
banaction = iptables-allports[lockingopt="-w 5"]
```
The lockingopt parameter is appended to every iptables call inside iptables-allports, telling it to wait up to 5 seconds for the xtables lock instead of failing immediately. This is the canonical fix recommended in the fail2ban documentation for containerised deployments.
Remove the redundant banaction overrides from blocklist-import.conf and bangui-sim.conf (both already specify banaction = iptables-allports which now comes from the global default). Removing the per-jail override keeps configuration DRY and ensures the lock-wait flag applies consistently.
Production compose note: Add a comment in Docker/compose.prod.yml near the fail2ban service block stating that the fail2ban-config volume must contain a jail.d/ with banaction = iptables-allports[lockingopt="-w 5"] or equivalent, because the production volume is not pre-seeded by the repository. This is a documentation action only — do not auto-seed the production volume.

Acceptance criteria

After docker compose -f Docker/compose.debug.yml restart fail2ban, the fail2ban log shows no returned 4 or Script error lines during startup.
docker exec bangui-fail2ban-dev fail2ban-client status blocklist-import reports the jail as running.
docker exec bangui-fail2ban-dev fail2ban-client status bangui-sim reports the jail as running.

Task 3 — Suppress log noise from unsupported `get <jail> idle/backend` commands

Priority: Low
Files: backend/app/services/jail_service.py

Observed error

fail2ban.transmitter ERROR  Command ['get', 'bangui-sim', 'idle'] has failed.
  Received Exception('Invalid command (no get action or not yet implemented)')
fail2ban.transmitter ERROR  Command ['get', 'bangui-sim', 'backend'] has failed.
  Received Exception('Invalid command (no get action or not yet implemented)')
… (repeated ~15 times per polling cycle)

Root cause

_fetch_jail_summary() in jail_service.py sends ["get", name, "backend"] and ["get", name, "idle"] to the fail2ban daemon via the socket. The running fail2ban version (LinuxServer.io container image) does not implement these two sub-commands in its transmitter. fail2ban logs every unrecognised command as an ERROR on its side. BanGUI already handles this gracefully — asyncio.gather(return_exceptions=True) captures the exceptions and _safe_bool / _safe_str fall back to False / "polling" respectively. The BanGUI application does not malfunction, but the fail2ban log is flooded with spurious ERROR lines on every jail-list refresh (~15 errors per request across all jails, repeated on every dashboard poll).

What to implement

The fix is a capability-detection probe executed once at startup (or at most once per polling cycle, re-used for all jails). The probe sends ["get", "<first_jail>", "backend"] and caches the result in a module-level boolean flag.

In jail_service.py, add a module-level asyncio.Lock and a nullable bool flag:

_backend_cmd_supported: bool | None = None
_backend_cmd_lock: asyncio.Lock = asyncio.Lock()

Add an async helper _check_backend_cmd_supported(client, jail_name) -> bool that:
- Returns the cached value immediately if already determined.
- Acquires _backend_cmd_lock, checks again (double-check idiom), then sends ["get", jail_name, "backend"].
- Sets _backend_cmd_supported = True if the command returns without exception, False if it raises any Exception.
- Returns the flag value.
In _fetch_jail_summary(), call this helper (using the first available jail name) before the main asyncio.gather. If the flag is False, replace the ["get", name, "backend"] and ["get", name, "idle"] gather entries with asyncio.coroutine constants that immediately return the default values ("polling" and False) without sending any socket command. This eliminates the fail2ban-side log noise entirely without changing the public API or the fallback values.
If _backend_cmd_supported becomes True (future fail2ban version), the commands are sent as before and real values are returned.

Acceptance criteria

After one polling cycle where backend/idle are detected as unsupported, the fail2ban log shows zero Invalid command lines for those commands on subsequent refreshes.
The GET /api/jails response still returns backend: "polling" and idle: false for each jail (unchanged defaults).
All existing tests in backend/tests/ pass.
The detection result is reset if _backend_cmd_supported is None (i.e., on application restart), so a fail2ban upgrade that adds support is detected automatically within one polling cycle.

13 KiB Raw Blame History

BanGUI — Task List

Agent Operating Instructions

Repository Layout

Coding Conventions

How to Verify Changes

Agent Workflow

Bug Fixes — fail2ban Runtime Errors (2026-03-14)

Task 1 — UnknownJailException('airsonic-auth') on reload

Observed error

Root cause

What to implement

Acceptance criteria

Task 2 — iptables-allports action fails with exit code 4 (Script error) on blocklist-import jail

Observed error

Root cause

What to implement

Acceptance criteria

Task 3 — Suppress log noise from unsupported get <jail> idle/backend commands

Observed error

Root cause

What to implement

Acceptance criteria

13 KiB

Raw Blame History

Task 1 — `UnknownJailException('airsonic-auth')` on reload

Task 2 — `iptables-allports` action fails with exit code 4 (`Script error`) on `blocklist-import` jail

Task 3 — Suppress log noise from unsupported `get <jail> idle/backend` commands