Stage 1.1-1.3: reload_all include/exclude_jails params already implemented; added keyword-arg assertions in router and service tests. Stage 2.1/6.1: _send_command_sync retry loop (3 attempts, 150ms exp backoff) retrying on EAGAIN/ECONNREFUSED/ENOBUFS; immediate raise on all other errors. Stage 2.2: asyncio.Lock at module level in jail_service.reload_all to serialize concurrent reload--all commands. Stage 3.1: activate_jail re-queries _get_active_jail_names after reload; returns active=False with descriptive message if jail did not start. Stage 4.1/6.2: asyncio.Semaphore (max 10) in Fail2BanClient.send, lazy- initialized; logs fail2ban_command_waiting_semaphore at debug when waiting. Stage 5.1/5.2: unit tests asserting reload_all is called with include_jails and exclude_jails; activation verification happy/sad path tests. Stage 6.3: TestSendCommandSyncRetry (5 cases) + TestFail2BanClientSemaphore concurrency test. Stage 7.1-7.3: _since_unix uses time.time(); bans_by_jail debug logging with since_iso; diagnostic warning when total==0 despite table rows; unit test verifying the warning fires for stale data.
15 KiB
BanGUI — Task List
This document breaks the entire BanGUI project into development stages, ordered so that each stage builds on the previous one. Every task is described in prose with enough detail for a developer to begin work. References point to the relevant documentation.
Stage 1 — Bug Fix: Jail Activation / Deactivation Reload Stream
1.1 Fix reload_all to include newly activated jails in the start stream ✅ DONE
Problem:
When a user activates an inactive jail (e.g. apache-auth), the backend writes enabled = true to jail.d/apache-auth.local and calls jail_service.reload_all(). However, reload_all queries the currently running jails via ["status"] to build the start stream. Since the new jail is not yet running, it is excluded from the stream. After reload --all, fail2ban's end-of-reload phase deletes every jail not in the stream — so the newly activated jail never starts.
The inverse bug exists for deactivation: the jail is still running when ["status"] is queried, so it remains in the stream and may be restarted despite enabled = false being written to the config.
Fix:
Add keyword-only include_jails and exclude_jails parameters to jail_service.reload_all(). Callers merge these into the stream derived from the current status. activate_jail passes include_jails=[name]; deactivate_jail passes exclude_jails=[name]. All existing callers are unaffected (both params default to None).
Files:
backend/app/services/jail_service.py—reload_all()backend/app/services/config_file_service.py—activate_jail(),deactivate_jail()
Acceptance criteria:
- Activating an inactive jail via the API actually starts it in fail2ban.
- Deactivating a running jail via the API actually stops it after reload.
- All other callers of
reload_all()(config save, filter/action updates) continue to work without changes.
1.2 Add unit tests for reload_all with include_jails / exclude_jails ✅ DONE
Write tests that verify the new parameters produce the correct fail2ban command stream.
Test cases:
reload_all(sock, include_jails=["apache-auth"])when currently running jails are["sshd", "nginx"]→ the stream sent to fail2ban must contain["start", "apache-auth"],["start", "nginx"], and["start", "sshd"].reload_all(sock, exclude_jails=["sshd"])when currently running jails are["sshd", "nginx"]→ the stream must contain only["start", "nginx"], not["start", "sshd"].reload_all(sock, include_jails=["new"], exclude_jails=["old"])when running jails are["old", "nginx"]→ stream must contain["start", "new"]and["start", "nginx"], not["start", "old"].reload_all(sock)without extra args continues to work exactly as before (backwards compatibility).
Files:
backend/tests/test_services/test_jail_service.py
1.3 Add integration-level tests for activate / deactivate endpoints ✅ DONE
Verify that the POST /api/config/jails/{name}/activate and POST /api/config/jails/{name}/deactivate endpoints pass the correct include_jails / exclude_jails arguments through to reload_all. These tests mock jail_service.reload_all and assert on the keyword arguments it receives.
Files:
backend/tests/test_routers/test_config.py(or a newtest_config_activate.py)
Stage 2 — Socket Connection Resilience
2.1 Add retry logic to Fail2BanClient.send for transient connection errors ✅ DONE
Problem:
The logs show intermittent fail2ban_connection_error events during parallel command bursts (e.g. when fetching jail details after a reload). The fail2ban Unix socket can momentarily refuse connections while processing a reload.
Task:
Add a configurable retry mechanism (default 2 retries, 100 ms backoff) to Fail2BanClient.send() that catches ConnectionRefusedError / FileNotFoundError and retries before raising Fail2BanConnectionError. This must not retry on protocol-level errors (e.g. unknown jail) — only on connection failures.
Files:
backend/app/utils/fail2ban_client.py
Acceptance criteria:
- Transient socket errors during reload bursts are retried transparently.
- Non-connection errors (e.g. unknown jail) are raised immediately without retry.
- A structured log message is emitted for each retry attempt.
- Unit tests cover retry success, retry exhaustion, and non-retryable errors.
2.2 Serialize concurrent reload_all calls ✅ DONE
Problem:
Multiple browser tabs or fast UI clicks could trigger concurrent reload_all calls. Sending overlapping reload --all commands to the fail2ban socket is undefined behavior and may cause jail loss.
Task:
Add an asyncio lock inside reload_all (module-level asyncio.Lock) so that concurrent calls are serialized. If a reload is already in progress, subsequent calls wait rather than firing in parallel.
Files:
backend/app/services/jail_service.py
Acceptance criteria:
- Two concurrent
reload_allcalls are serialized; the second waits for the first to finish. - Unit test demonstrates that the lock prevents overlapping socket commands.
Stage 3 — Activate / Deactivate UX Improvements
3.1 Return the jail's runtime status after activation ✅ DONE
Problem:
After activating a jail, the API returns active: True optimistically before verifying that fail2ban actually started the jail. If the reload silently fails (e.g. bad regex in the jail config), the frontend shows the jail as active but it is not.
Task:
After calling reload_all, query ["status"] and verify the activated jail appears in the running jail list. If it does not, return active: False with a warning message explaining the jail config may be invalid. Log a warning event.
Files:
backend/app/services/config_file_service.py—activate_jail()
Acceptance criteria:
- Successful activation returns
active: Trueonly after verification. - If the jail doesn't start (e.g. bad config), the response has
active: Falseand a descriptive message. - A structured log event is emitted on verification failure.
3.2 Frontend feedback for activation failure
Task:
If the activation endpoint returns active: False, the ConfigPage jail detail pane should show a warning toast/banner explaining that the jail could not be started and the user should check the jail configuration (filters, log paths, regex etc.).
Files:
frontend/src/hooks/useConfigActiveStatus.ts(or relevant hook)frontend/src/components/config/(jail detail component)
Stage 4 — Parallel Command Throttling
4.1 Limit concurrent fail2ban socket commands ✅ DONE
Problem:
When loading jail details for multiple active jails, the backend fires dozens of get commands in parallel (bantime, findtime, maxretry, failregex, etc. × N jails). The fail2ban socket is single-threaded and some commands time out or fail with connection errors under this load.
Task:
Introduce an asyncio Semaphore (configurable, default 10) that limits the number of in-flight fail2ban commands. All code paths that use Fail2BanClient.send() should acquire the semaphore first. This can be implemented as a connection-pool wrapper or a middleware in the client.
Files:
backend/app/utils/fail2ban_client.py
Acceptance criteria:
- No more than N commands are sent to the socket concurrently.
- Connection errors during jail detail fetches are eliminated under normal load.
- A structured log event is emitted when a command waits for the semaphore.
Stage 5 — Test Coverage Hardening
5.1 Add tests for activate_jail and deactivate_jail service functions ✅ DONE
Task:
Write comprehensive unit tests for config_file_service.activate_jail and config_file_service.deactivate_jail, covering:
- Happy path: jail exists, is inactive, local file is written, reload includes it, response is correct.
- Jail not found in config →
JailNotFoundInConfigError. - Jail already active →
JailAlreadyActiveError. - Jail already inactive →
JailAlreadyInactiveError. - Reload fails → activation still returns but with logged warning.
- Override parameters (bantime, findtime, etc.) are written to the
.localfile correctly.
Files:
backend/tests/test_services/test_config_file_service.py
5.2 Add tests for deactivate path with exclude_jails ✅ DONE
Task:
Verify that deactivate_jail passes exclude_jails=[name] to reload_all, ensuring the jail is removed from the start stream. Mock jail_service.reload_all and assert the keyword arguments.
Files:
backend/tests/test_services/test_config_file_service.py
Stage 6 — Bug Fix: 502 "Resource temporarily unavailable" on fail2ban Socket
6.1 Add retry with back-off to _send_command_sync for transient OSError ✅ DONE
Problem:
Under concurrent load the fail2ban Unix socket returns [Errno 11] Resource temporarily unavailable (EAGAIN). The _send_command_sync function in fail2ban_client.py catches this as a generic OSError and immediately raises Fail2BanConnectionError, which the routers translate into a 502 response. There is no retry.
Task:
Wrap the sock.connect() / sock.sendall() / sock.recv() block inside a retry loop (max 3 attempts, exponential back-off starting at 150 ms). Only retry on OSError with errno in {errno.EAGAIN, errno.ECONNREFUSED, errno.ENOBUFS} — all other OSError variants and all Fail2BanProtocolError cases must be raised immediately.
Emit a structured log event (fail2ban_socket_retry) on each retry attempt containing the attempt number, the errno, and the socket path. After the final retry is exhausted, raise Fail2BanConnectionError as today.
Files:
backend/app/utils/fail2ban_client.py—_send_command_sync()
Acceptance criteria:
- A transient EAGAIN on the first attempt is silently retried and succeeds on the second attempt without surfacing a 502.
- Non-retryable socket errors (e.g.
ENOENT— socket file missing) are raised immediately on the first attempt. - A
Fail2BanProtocolError(unpickle failure) is never retried. - After 3 consecutive EAGAIN failures,
Fail2BanConnectionErroris raised as before. - Each retry is logged with
structlog.
6.2 Add a concurrency semaphore to Fail2BanClient.send ✅ DONE
Problem:
Dashboard page load fires many parallel get commands (jail details, ban stats, trend data). The fail2ban socket is single-threaded; flooding it causes the EAGAIN errors from 6.1.
Task:
Introduce an asyncio.Semaphore (configurable, default 10) at the module level in fail2ban_client.py. Acquire the semaphore in Fail2BanClient.send() before dispatching _send_command_sync to the thread-pool executor. This caps the number of in-flight socket commands and prevents the socket backlog from overflowing.
Files:
backend/app/utils/fail2ban_client.py
Acceptance criteria:
- No more than 10 commands are sent to the socket concurrently.
- Under normal load, the 502 errors are eliminated.
- A structured log event is emitted when a command has to wait for the semaphore (debug level).
6.3 Unit tests for socket retry and semaphore ✅ DONE
Task:
Write tests that verify:
- A single transient
OSError(errno.EAGAIN)is retried and the command succeeds. - Three consecutive EAGAIN failures raise
Fail2BanConnectionError. - An
OSError(errno.ENOENT)(socket missing) is raised immediately without retry. - The semaphore limits concurrency — launch 20 parallel
send()calls against a mock that records timestamps and assert no more than 10 overlap.
Files:
backend/tests/test_utils/test_fail2ban_client.py
Stage 7 — Bug Fix: Empty Bans-by-Jail Response
7.1 Investigate and fix the empty bans_by_jail query ✅ DONE
Problem:
GET /api/dashboard/bans/by-jail?range=30d returns {"jails":[],"total":0} even though ban data exists in the fail2ban database. The query in ban_service.bans_by_jail() filters on WHERE timeofban >= ? using a Unix timestamp computed from datetime.now(tz=UTC). If the fail2ban database stores timeofban in local time rather than UTC (which is the default for fail2ban ≤ 1.0), the comparison silently excludes all rows because the UTC timestamp is hours ahead of the local-time values.
Task:
- Query the fail2ban database for a few sample
timeofbanvalues and compare them todatetime.now(tz=UTC).timestamp()andtime.time(). Determine whether fail2ban stores bans in UTC or local time. - If fail2ban uses
time.time()(which returns UTC on all platforms), then the bug is elsewhere — add debug logging tobans_by_jailthat logssince, the actualSELECT COUNT(*)result, anddb_pathso the root cause can be traced from production logs. - If the timestamps are local time, change
_since_unix()to usetime.time()(always UTC epoch) instead ofdatetime.now(tz=UTC).timestamp()to stay consistent. Both should be equivalent on correctly configured systems, buttime.time()avoids any timezone-aware datetime pitfalls. - Add a guard: if
total == 0and the range is30dor365d, run aSELECT COUNT(*) FROM bans(no WHERE) and log the result. If there are rows in the table but zero match the filter, log a warning with thesincetimestamp and the min/maxtimeofbanvalues from the table. This makes future debugging trivial.
Files:
backend/app/services/ban_service.py—_since_unix(),bans_by_jail()
Acceptance criteria:
bans_by_jailreturns the correct jail counts for the requested time range.- When zero results are returned despite data existing, a warning log is emitted with diagnostic information (since timestamp, db row count, min/max timeofban).
_since_unix()uses a method consistent with how fail2ban stores timestamps.
7.2 Add a /api/dashboard/bans/by-jail diagnostic endpoint or debug logging ✅ DONE
Task:
Add debug-level structured log output to bans_by_jail that includes:
- The resolved
db_path. - The computed
sinceUnix timestamp and its ISO representation. - The raw
totalcount from the first query. - The number of jail groups returned.
This allows operators to diagnose empty-result issues from the container logs without code changes.
Files:
backend/app/services/ban_service.py—bans_by_jail()
7.3 Unit tests for bans_by_jail with a seeded in-memory database ✅ DONE
Task:
Write tests that create a temporary SQLite database matching the fail2ban bans table schema, seed it with rows at known timestamps, and call bans_by_jail (mocking _get_fail2ban_db_path to point at the temp database). Verify:
- Rows within the time range are counted and grouped by jail correctly.
- Rows outside the range are excluded.
- The
originfilter ("blocklist"/"selfblock") partitions results as expected. - An empty database returns
{"jails": [], "total": 0}without error.
Files:
backend/tests/test_services/test_ban_service.py