refactoring-backend #3
@@ -1682,9 +1682,164 @@ Since tasks do not have access to `Depends(get_db)` (no request scope), they mus
|
||||
- **Startup validation:** `startup_shared_resources()` raises `RuntimeError` if `BANGUI_WORKERS > 1`.
|
||||
- See [Architekture.md § 9.2](Architekture.md) for full details.
|
||||
|
||||
### Timeout Protection for Background Tasks
|
||||
|
||||
**All background tasks must wrap their async work with timeout protection.** If a task hangs (API unreachable, network partition, database lock), it runs forever — never completes → lock never released → duplicate work starts → resource exhaustion. Timeouts prevent this.
|
||||
|
||||
**Rule:** Every task function must use `run_with_timeout()` from `app.tasks.timeout_utils` to enforce a timeout on its async work.
|
||||
|
||||
```python
|
||||
from app.tasks.timeout_utils import run_with_timeout
|
||||
|
||||
async def _run_import_with_resources(settings: Settings, http_session: ClientSession) -> None:
|
||||
"""Imports blocklists with timeout protection."""
|
||||
async def _do_import() -> None:
|
||||
# ... your async work ...
|
||||
result = await blocklist_service.import_all(...)
|
||||
log.info("import_finished", total=result.total_imported)
|
||||
|
||||
# Wrap with timeout: abort after 300 seconds
|
||||
await run_with_timeout("blocklist_import", _do_import(), timeout_seconds=300)
|
||||
```
|
||||
|
||||
**Why this pattern:**
|
||||
1. `run_with_timeout()` enforces strict time limits using `asyncio.wait_for()`.
|
||||
2. If timeout is exceeded, `TimeoutError` is raised and logged with elapsed time.
|
||||
3. If task approaches timeout (>80% of time budget), a warning is logged for observability.
|
||||
4. Failures are logged at `warning` level (not `error`) — timeouts are expected sometimes, but worth investigating.
|
||||
|
||||
**Timeout Values by Task:**
|
||||
|
||||
| Task | Timeout | Rationale |
|
||||
|------|---------|-----------|
|
||||
| `blocklist_import` | 300s (5 min) | Downloads, validates, applies external lists. Network delays expected. |
|
||||
| `health_check` | 10s | Socket probe to fail2ban. Should complete quickly or fail2ban is unresponsive. |
|
||||
| `geo_cache_flush` | 60s | Writes dirty cache entries to DB. Handles contention gracefully. |
|
||||
| `session_cleanup` | 30s | Deletes expired sessions. DB contention unlikely but possible. |
|
||||
| `rate_limiter_cleanup` | 5s | In-memory cleanup, no I/O. Should always be instant. |
|
||||
| `geo_cache_cleanup` | 60s | Deletes stale geo entries from DB. May scan large table. |
|
||||
| `geo_re_resolve` | 120s | Retries failed IP lookups with backoff. API rate-limit delays expected. |
|
||||
| `history_sync` | 60s | Syncs records from fail2ban DB to archive. May read/write many rows. |
|
||||
| `scheduler_lock_heartbeat` | 5s | Updates lock timestamp. Must be quick or lock is lost. |
|
||||
|
||||
**Timeout Events Are Logged:**
|
||||
|
||||
On timeout:
|
||||
```
|
||||
task_timeout task_name=blocklist_import timeout_seconds=300 elapsed_seconds=300.45
|
||||
```
|
||||
|
||||
On approaching timeout (>80% of budget used):
|
||||
```
|
||||
task_approaching_timeout task_name=blocklist_import timeout_seconds=300 elapsed_seconds=298.5 usage_percent=99.5
|
||||
```
|
||||
|
||||
The logs include `elapsed_seconds` for observability — if you see tasks consistently near timeout, the value may need adjustment.
|
||||
|
||||
**Testing Timeout Behavior:**
|
||||
|
||||
Tests for timeout scenarios are in `backend/tests/test_tasks/test_timeout_utils.py`:
|
||||
- Verify timeout is raised and logged.
|
||||
- Verify approaching-timeout warning is logged.
|
||||
- Verify task exceptions (not timeout) propagate correctly.
|
||||
|
||||
Add timeout tests to your task test file:
|
||||
```python
|
||||
@pytest.mark.asyncio
|
||||
async def test_task_timeout_is_logged(self) -> None:
|
||||
"""Task must be logged and raise TimeoutError on timeout."""
|
||||
with patch("app.tasks.my_task.log") as mock_log:
|
||||
with pytest.raises(TimeoutError):
|
||||
await my_task._run_with_resources(settings) # exceeds timeout
|
||||
|
||||
timeout_calls = [
|
||||
c for c in mock_log.warning.call_args_list
|
||||
if c[0][0] == "task_timeout"
|
||||
]
|
||||
assert len(timeout_calls) == 1
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### Task Idempotency
|
||||
|
||||
**Background tasks must be idempotent** — retrying after a crash must produce the same result as running once.
|
||||
|
||||
If a task crashes or times out mid-execution, the scheduler may retry. Without idempotency, retries cause duplicate work:
|
||||
- **blocklist_import**: banned IPs appear twice → database corruption
|
||||
- **geo_cache_flush**: entries written twice → cache inconsistency
|
||||
- Any multi-step operation: partial state remains
|
||||
|
||||
**Pattern: Content-Hash Idempotency for Blocklist Imports**
|
||||
|
||||
Track imports by source + content hash to detect retries:
|
||||
|
||||
```python
|
||||
from app.repositories import import_run_repo
|
||||
|
||||
async def import_source(source, db, ...):
|
||||
# Download content
|
||||
status, content = await downloader.download(url)
|
||||
|
||||
# Compute hash for idempotency detection
|
||||
content_hash = hashlib.sha256(content.encode()).hexdigest()
|
||||
|
||||
# Check if this exact import already completed
|
||||
existing_run = await import_run_repo.get_by_source_and_hash(
|
||||
db, source.id, content_hash
|
||||
)
|
||||
|
||||
if existing_run and existing_run.status == "completed":
|
||||
# Already done — skip banning, optionally re-warm cache
|
||||
log.info("blocklist_import_already_completed", ...)
|
||||
return ImportSourceResult(ips_imported=existing_run.imported_count, ...)
|
||||
|
||||
# First run: create pending record
|
||||
if not existing_run:
|
||||
run_id = await import_run_repo.create_pending(
|
||||
db, source.id, content_hash
|
||||
)
|
||||
else:
|
||||
run_id = existing_run.id # Retry case
|
||||
|
||||
# Do work (ban IPs, etc.)
|
||||
imported, errors = await ban_executor.ban_ips(...)
|
||||
|
||||
# Mark as completed or failed (atomically)
|
||||
if errors:
|
||||
await import_run_repo.mark_failed(db, run_id, str(errors))
|
||||
else:
|
||||
await import_run_repo.mark_completed(db, run_id, imported, skipped)
|
||||
```
|
||||
|
||||
**Key points:**
|
||||
|
||||
1. **Operation ID must be deterministic** — Use content hash, not timestamp
|
||||
- Same content = same operation ID → retry safe
|
||||
- Different content = different operation ID → new import run
|
||||
|
||||
2. **Check before doing work** — Query `import_runs` table before banning
|
||||
- If completed: skip banning (already done)
|
||||
- If pending: retry was interrupted, try again
|
||||
- If failed: retry to recover
|
||||
|
||||
3. **Atomic state updates** — Mark as completed AFTER all work succeeds
|
||||
- All-or-nothing: either import succeeded + logged, or failed + retryable
|
||||
|
||||
4. **Test idempotency** — Verify retrying same content doesn't duplicate bans
|
||||
```python
|
||||
# First import: ban 2 IPs
|
||||
result1 = await import_source(source, content, db)
|
||||
assert result1.ips_imported == 2
|
||||
|
||||
# Second import (same content): skip bans
|
||||
result2 = await import_source(source, content, db)
|
||||
assert result2.ips_imported == 2
|
||||
assert ban_ip.call_count == 2 # Only called once, not twice
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 10. Code Style & Tooling
|
||||
|
||||
| Tool | Purpose |
|
||||
|---|---|
|
||||
|
||||
@@ -1,97 +1,3 @@
|
||||
## [CRITICAL] Missing security headers (CSP, X-Frame-Options, etc.)
|
||||
|
||||
**Where found**
|
||||
|
||||
- Backend does not set `Content-Security-Policy`, `X-Frame-Options`, `X-Content-Type-Options` headers
|
||||
- Frontend HTML served without CSP meta tags
|
||||
|
||||
**Why this is needed**
|
||||
|
||||
Without security headers, browsers won't protect against XSS, clickjacking, MIME-sniffing, referrer leakage attacks.
|
||||
|
||||
**Goal**
|
||||
|
||||
Add security headers to all HTTP responses.
|
||||
|
||||
**What to do**
|
||||
|
||||
1. Add security headers middleware to `backend/app/main.py`:
|
||||
```python
|
||||
@app.middleware("http")
|
||||
async def add_security_headers(request, call_next):
|
||||
response = await call_next(request)
|
||||
response.headers["Content-Security-Policy"] = "default-src 'self'"
|
||||
response.headers["X-Frame-Options"] = "DENY"
|
||||
response.headers["X-Content-Type-Options"] = "nosniff"
|
||||
return response
|
||||
```
|
||||
|
||||
2. In frontend `index.html`, add CSP meta tag
|
||||
3. Test with browser DevTools Security tab
|
||||
|
||||
**Possible traps and issues**
|
||||
|
||||
- CSP `'unsafe-inline'` defeats security — avoid if possible
|
||||
- CDN resources may need explicit allowlist
|
||||
- Too restrictive CSP breaks functionality; too loose defeats security
|
||||
|
||||
**Docs changes needed**
|
||||
|
||||
- Add section in `Docs/Security.md` § HTTP Security Headers
|
||||
|
||||
**Doc references**
|
||||
|
||||
- `Docs/Security.md` (security headers)
|
||||
|
||||
---
|
||||
|
||||
## [CRITICAL] Background tasks lack timeout protection
|
||||
|
||||
**Where found**
|
||||
|
||||
- `backend/app/tasks/blocklist_import.py` — no timeout
|
||||
- `backend/app/tasks/health_check.py` — no timeout
|
||||
- All task functions lack timeout wrapper
|
||||
|
||||
**Why this is needed**
|
||||
|
||||
If task hangs (API unreachable, network partition), task runs forever. Never completes → lock never released → duplicate work, resource exhaustion.
|
||||
|
||||
**Goal**
|
||||
|
||||
Ensure all background tasks complete within bounded time or fail gracefully.
|
||||
|
||||
**What to do**
|
||||
|
||||
1. Wrap all task functions with `asyncio.wait_for(task, timeout)`:
|
||||
```python
|
||||
await asyncio.wait_for(blocklist_service.import_all(...), timeout=300)
|
||||
```
|
||||
|
||||
2. Set appropriate timeouts per task:
|
||||
- Blocklist import: 300s (5 min)
|
||||
- Health probe: 10s
|
||||
- Geo cache flush: 60s
|
||||
|
||||
3. Log timeout events and trigger alerts
|
||||
|
||||
**Possible traps and issues**
|
||||
|
||||
- Timeout too short → legitimate tasks killed prematurely
|
||||
- Timeout too long → resource leak if many tasks hang
|
||||
- Killing task mid-operation may leave inconsistent state
|
||||
|
||||
**Docs changes needed**
|
||||
|
||||
- Add section in `Docs/Backend-Development.md` § Background Tasks
|
||||
|
||||
**Doc references**
|
||||
|
||||
- `Docs/Backend-Development.md` (background tasks)
|
||||
- `backend/app/tasks/` (task modules)
|
||||
|
||||
---
|
||||
|
||||
## [CRITICAL] Background tasks not idempotent
|
||||
|
||||
**Where found**
|
||||
|
||||
@@ -107,7 +107,7 @@ _SCHEMA_STATEMENTS: list[str] = [
|
||||
_CREATE_HISTORY_ARCHIVE,
|
||||
]
|
||||
|
||||
_CURRENT_SCHEMA_VERSION: int = 5
|
||||
_CURRENT_SCHEMA_VERSION: int = 6
|
||||
|
||||
_MIGRATIONS: dict[int, str] = {
|
||||
1: "\n".join(_SCHEMA_STATEMENTS),
|
||||
@@ -166,6 +166,27 @@ CREATE INDEX IF NOT EXISTS idx_history_archive_ip
|
||||
-- Index for action-based queries: supports ban/unban filtering.
|
||||
CREATE INDEX IF NOT EXISTS idx_history_archive_action
|
||||
ON history_archive (action);
|
||||
""",
|
||||
6: """
|
||||
-- Migration 6: Add import_runs table for tracking blocklist import idempotency.
|
||||
-- Tracks unique imports by source and content hash to enable idempotent retries.
|
||||
-- On import crash, retry will detect the operation_id and skip duplicate bans.
|
||||
-- This prevents duplicate IP bans if the scheduler retries after a failure.
|
||||
CREATE TABLE IF NOT EXISTS import_runs (
|
||||
id INTEGER PRIMARY KEY AUTOINCREMENT,
|
||||
source_id INTEGER NOT NULL REFERENCES blocklist_sources(id) ON DELETE CASCADE,
|
||||
content_hash TEXT NOT NULL,
|
||||
status TEXT NOT NULL CHECK(status IN ('pending', 'completed', 'failed')),
|
||||
imported_count INTEGER NOT NULL DEFAULT 0,
|
||||
skipped_count INTEGER NOT NULL DEFAULT 0,
|
||||
error_message TEXT,
|
||||
created_at TEXT NOT NULL DEFAULT (strftime('%Y-%m-%dT%H:%M:%fZ', 'now')),
|
||||
updated_at TEXT NOT NULL DEFAULT (strftime('%Y-%m-%dT%H:%M:%fZ', 'now')),
|
||||
UNIQUE(source_id, content_hash)
|
||||
);
|
||||
-- Index for looking up completed imports by source
|
||||
CREATE INDEX IF NOT EXISTS idx_import_runs_source_status
|
||||
ON import_runs (source_id, status);
|
||||
""",
|
||||
}
|
||||
|
||||
|
||||
@@ -50,6 +50,7 @@ from app.repositories.protocols import (
|
||||
GeoCacheRepository,
|
||||
HistoryArchiveRepository,
|
||||
ImportLogRepository,
|
||||
ImportRunRepository,
|
||||
SessionRepository,
|
||||
SettingsRepository,
|
||||
)
|
||||
@@ -68,6 +69,7 @@ from app.repositories import (
|
||||
geo_cache_repo,
|
||||
history_archive_repo,
|
||||
import_log_repo,
|
||||
import_run_repo,
|
||||
session_repo,
|
||||
settings_repo,
|
||||
)
|
||||
@@ -292,6 +294,15 @@ async def get_import_log_repo() -> ImportLogRepository:
|
||||
return cast("ImportLogRepository", import_log_repo)
|
||||
|
||||
|
||||
async def get_import_run_repo() -> ImportRunRepository:
|
||||
"""Provide the concrete import run repository implementation.
|
||||
|
||||
The import_run_repo module uses structural typing to satisfy the ImportRunRepository
|
||||
Protocol interface for tracking blocklist imports for idempotency detection.
|
||||
"""
|
||||
return cast("ImportRunRepository", import_run_repo)
|
||||
|
||||
|
||||
async def get_settings_repo() -> SettingsRepository:
|
||||
"""Provide the concrete settings repository implementation.
|
||||
|
||||
@@ -649,6 +660,7 @@ SettingsRepoDep = Annotated[SettingsRepository, Depends(get_settings_repo)]
|
||||
HistoryArchiveRepositoryDep = Annotated[HistoryArchiveRepository, Depends(get_history_archive_repo)]
|
||||
BlocklistRepositoryDep = Annotated[BlocklistRepository, Depends(get_blocklist_repo)]
|
||||
ImportLogRepositoryDep = Annotated[ImportLogRepository, Depends(get_import_log_repo)]
|
||||
ImportRunRepositoryDep = Annotated[ImportRunRepository, Depends(get_import_run_repo)]
|
||||
GeoCacheRepositoryDep = Annotated[GeoCacheRepository, Depends(get_geo_cache_repo)]
|
||||
Fail2BanDbRepositoryDep = Annotated[Fail2BanDbRepository, Depends(get_fail2ban_db_repo)]
|
||||
AppStateDep = Annotated[ApplicationContext, Depends(get_app_state)]
|
||||
|
||||
@@ -78,6 +78,28 @@ class ImportLogListResponse(PaginatedListResponse[ImportLogEntry]):
|
||||
|
||||
pass
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Import run tracking (for idempotency)
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
class ImportRunEntry(BanGuiBaseModel):
|
||||
"""Tracks a unique blocklist import run by source and content hash.
|
||||
|
||||
Used to detect re-runs and prevent duplicate bans when the scheduler
|
||||
retries after a crash.
|
||||
"""
|
||||
|
||||
id: int
|
||||
source_id: int
|
||||
content_hash: str
|
||||
status: str # 'pending' | 'completed' | 'failed'
|
||||
imported_count: int
|
||||
skipped_count: int
|
||||
error_message: str | None
|
||||
created_at: str
|
||||
updated_at: str
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Schedule
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
140
backend/app/repositories/import_run_repo.py
Normal file
140
backend/app/repositories/import_run_repo.py
Normal file
@@ -0,0 +1,140 @@
|
||||
"""Import run repository for blocklist import idempotency tracking.
|
||||
|
||||
Persists and queries import run records in the ``import_runs`` table.
|
||||
Enables detection of duplicate import attempts and prevents re-running bans
|
||||
on scheduler retry after a crash.
|
||||
|
||||
All methods are plain async functions that accept an :class:`aiosqlite.Connection`.
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
from typing import TYPE_CHECKING
|
||||
|
||||
if TYPE_CHECKING:
|
||||
import aiosqlite
|
||||
|
||||
from app.models.blocklist import ImportRunEntry
|
||||
|
||||
|
||||
async def get_by_source_and_hash(
|
||||
db: aiosqlite.Connection,
|
||||
source_id: int,
|
||||
content_hash: str,
|
||||
) -> ImportRunEntry | None:
|
||||
"""Check if a specific import (by source and content hash) already exists.
|
||||
|
||||
Args:
|
||||
db: Active aiosqlite connection.
|
||||
source_id: FK to ``blocklist_sources.id``.
|
||||
content_hash: SHA256 hash of the downloaded blocklist content.
|
||||
|
||||
Returns:
|
||||
ImportRunEntry if found, None otherwise.
|
||||
"""
|
||||
async with db.execute(
|
||||
"""
|
||||
SELECT
|
||||
id, source_id, content_hash, status,
|
||||
imported_count, skipped_count, error_message,
|
||||
created_at, updated_at
|
||||
FROM import_runs
|
||||
WHERE source_id = ? AND content_hash = ?
|
||||
""",
|
||||
(source_id, content_hash),
|
||||
) as cursor:
|
||||
row = await cursor.fetchone()
|
||||
|
||||
if not row:
|
||||
return None
|
||||
|
||||
return ImportRunEntry(
|
||||
id=row[0],
|
||||
source_id=row[1],
|
||||
content_hash=row[2],
|
||||
status=row[3],
|
||||
imported_count=row[4],
|
||||
skipped_count=row[5],
|
||||
error_message=row[6],
|
||||
created_at=row[7],
|
||||
updated_at=row[8],
|
||||
)
|
||||
|
||||
|
||||
async def create_pending(
|
||||
db: aiosqlite.Connection,
|
||||
source_id: int,
|
||||
content_hash: str,
|
||||
) -> int:
|
||||
"""Create a pending import run entry.
|
||||
|
||||
Args:
|
||||
db: Active aiosqlite connection.
|
||||
source_id: FK to ``blocklist_sources.id``.
|
||||
content_hash: SHA256 hash of the downloaded blocklist content.
|
||||
|
||||
Returns:
|
||||
Primary key of the inserted row.
|
||||
"""
|
||||
cursor = await db.execute(
|
||||
"""
|
||||
INSERT INTO import_runs (source_id, content_hash, status)
|
||||
VALUES (?, ?, 'pending')
|
||||
""",
|
||||
(source_id, content_hash),
|
||||
)
|
||||
await db.commit()
|
||||
return int(cursor.lastrowid) # type: ignore[arg-type]
|
||||
|
||||
|
||||
async def mark_completed(
|
||||
db: aiosqlite.Connection,
|
||||
run_id: int,
|
||||
imported_count: int,
|
||||
skipped_count: int,
|
||||
) -> None:
|
||||
"""Mark an import run as completed with final counts.
|
||||
|
||||
Args:
|
||||
db: Active aiosqlite connection.
|
||||
run_id: Primary key of the import run.
|
||||
imported_count: Number of IPs successfully banned.
|
||||
skipped_count: Number of entries skipped (invalid or CIDR).
|
||||
"""
|
||||
await db.execute(
|
||||
"""
|
||||
UPDATE import_runs
|
||||
SET status = 'completed',
|
||||
imported_count = ?,
|
||||
skipped_count = ?,
|
||||
updated_at = strftime('%Y-%m-%dT%H:%M:%fZ', 'now')
|
||||
WHERE id = ?
|
||||
""",
|
||||
(imported_count, skipped_count, run_id),
|
||||
)
|
||||
await db.commit()
|
||||
|
||||
|
||||
async def mark_failed(
|
||||
db: aiosqlite.Connection,
|
||||
run_id: int,
|
||||
error_message: str,
|
||||
) -> None:
|
||||
"""Mark an import run as failed with error details.
|
||||
|
||||
Args:
|
||||
db: Active aiosqlite connection.
|
||||
run_id: Primary key of the import run.
|
||||
error_message: Error description.
|
||||
"""
|
||||
await db.execute(
|
||||
"""
|
||||
UPDATE import_runs
|
||||
SET status = 'failed',
|
||||
error_message = ?,
|
||||
updated_at = strftime('%Y-%m-%dT%H:%M:%fZ', 'now')
|
||||
WHERE id = ?
|
||||
""",
|
||||
(error_message, run_id),
|
||||
)
|
||||
await db.commit()
|
||||
@@ -16,6 +16,8 @@ from app.models.ban import BanOrigin
|
||||
from app.repositories.fail2ban_db_repo import BanIpCount, BanRecord, HistoryRecord, JailBanCount
|
||||
from app.repositories.geo_cache_repo import GeoCacheRow
|
||||
from app.repositories.import_log_repo import ImportLogRow
|
||||
from app.models.blocklist import ImportRunEntry
|
||||
|
||||
|
||||
|
||||
class SessionRepository(Protocol):
|
||||
@@ -140,6 +142,47 @@ class ImportLogRepository(Protocol):
|
||||
...
|
||||
|
||||
|
||||
class ImportRunRepository(Protocol):
|
||||
"""Protocol for tracking blocklist import runs for idempotency."""
|
||||
|
||||
async def get_by_source_and_hash(
|
||||
self,
|
||||
db: aiosqlite.Connection,
|
||||
source_id: int,
|
||||
content_hash: str,
|
||||
) -> ImportRunEntry | None:
|
||||
"""Check if a specific import (by source and content hash) has been completed."""
|
||||
...
|
||||
|
||||
async def create_pending(
|
||||
self,
|
||||
db: aiosqlite.Connection,
|
||||
source_id: int,
|
||||
content_hash: str,
|
||||
) -> int:
|
||||
"""Create a pending import run entry. Returns the id."""
|
||||
...
|
||||
|
||||
async def mark_completed(
|
||||
self,
|
||||
db: aiosqlite.Connection,
|
||||
run_id: int,
|
||||
imported_count: int,
|
||||
skipped_count: int,
|
||||
) -> None:
|
||||
"""Mark an import run as completed with final counts."""
|
||||
...
|
||||
|
||||
async def mark_failed(
|
||||
self,
|
||||
db: aiosqlite.Connection,
|
||||
run_id: int,
|
||||
error_message: str,
|
||||
) -> None:
|
||||
"""Mark an import run as failed with error details."""
|
||||
...
|
||||
|
||||
|
||||
class GeoCacheRepository(Protocol):
|
||||
async def load_all(self, db: aiosqlite.Connection) -> list[GeoCacheRow]:
|
||||
...
|
||||
|
||||
@@ -3,16 +3,22 @@
|
||||
Coordinates the download, parse, validate, ban, and logging steps for
|
||||
importing blocklist sources. This thin orchestration layer composes the
|
||||
individual components.
|
||||
|
||||
Implements idempotent retries: if the process crashes after downloading but
|
||||
before completing, retry will detect the cached operation and skip duplicate
|
||||
bans while re-warming the geo cache.
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
import hashlib
|
||||
from typing import TYPE_CHECKING
|
||||
|
||||
import aiohttp
|
||||
import structlog
|
||||
|
||||
from app.models.blocklist import BlocklistSource, ImportSourceResult
|
||||
from app.repositories import import_run_repo
|
||||
from app.services.blocklist_ban_executor import BanExecutor
|
||||
from app.services.blocklist_downloader import BlocklistDownloader
|
||||
from app.services.blocklist_parser import BlocklistParser
|
||||
@@ -35,6 +41,19 @@ def _aiohttp_timeout(seconds: float) -> aiohttp.ClientTimeout:
|
||||
return aiohttp.ClientTimeout(total=seconds)
|
||||
|
||||
|
||||
def _compute_content_hash(content: str) -> str:
|
||||
"""Compute SHA256 hash of blocklist content for idempotency detection.
|
||||
|
||||
Args:
|
||||
content: Raw blocklist content as string.
|
||||
|
||||
Returns:
|
||||
Hex-encoded SHA256 hash.
|
||||
"""
|
||||
return hashlib.sha256(content.encode()).hexdigest()
|
||||
|
||||
|
||||
|
||||
class BlocklistImportWorkflow:
|
||||
"""Orchestrates the complete blocklist import flow for a single source."""
|
||||
|
||||
@@ -70,12 +89,15 @@ class BlocklistImportWorkflow:
|
||||
) -> ImportSourceResult:
|
||||
"""Download and apply bans from a single blocklist source.
|
||||
|
||||
Implements idempotent retries: if the process crashes mid-operation,
|
||||
retry will detect the cached import run and skip duplicate bans.
|
||||
|
||||
The workflow:
|
||||
1. Download the URL with retries for transient failures.
|
||||
2. Parse content to extract valid IP addresses.
|
||||
3. Ban each valid IP via fail2ban.
|
||||
4. Pre-warm geo cache with newly banned IPs.
|
||||
5. Log the result.
|
||||
2. Compute content hash for idempotency detection.
|
||||
3. Check if this exact content has already been imported.
|
||||
4. If yes (retry case): skip banning, but re-warm geo cache.
|
||||
5. If no: mark as pending, parse, ban, mark as completed, pre-warm cache.
|
||||
|
||||
After a successful import, the geo cache is pre-warmed by batch-resolving
|
||||
all newly banned IPs. This ensures the dashboard and map show country
|
||||
@@ -128,11 +150,69 @@ class BlocklistImportWorkflow:
|
||||
error=error_msg,
|
||||
)
|
||||
|
||||
# --- Compute content hash for idempotency ---
|
||||
content_hash = _compute_content_hash(content)
|
||||
|
||||
# --- Check if this import has already been completed ---
|
||||
existing_run = await import_run_repo.get_by_source_and_hash(
|
||||
db,
|
||||
source.id,
|
||||
content_hash,
|
||||
)
|
||||
|
||||
if existing_run is not None and existing_run.status == "completed":
|
||||
log.info(
|
||||
"blocklist_import_already_completed",
|
||||
source_id=source.id,
|
||||
content_hash=content_hash[:8],
|
||||
imported=existing_run.imported_count,
|
||||
skipped=existing_run.skipped_count,
|
||||
)
|
||||
# Skip banning (already done), but still offer to pre-warm cache
|
||||
await self._prewarm_geo_cache(
|
||||
source,
|
||||
existing_run.imported_count,
|
||||
content,
|
||||
geo_is_cached,
|
||||
geo_cache,
|
||||
)
|
||||
return ImportSourceResult(
|
||||
source_id=source.id,
|
||||
source_url=source.url,
|
||||
ips_imported=existing_run.imported_count,
|
||||
ips_skipped=existing_run.skipped_count,
|
||||
error=None,
|
||||
)
|
||||
|
||||
# --- Parse and validate ---
|
||||
parsed = self.parser.parse(content)
|
||||
valid_ips = parsed.valid_ips
|
||||
skipped = parsed.skipped_entries
|
||||
|
||||
# --- Create or update pending import run entry ---
|
||||
if existing_run is None:
|
||||
run_id = await import_run_repo.create_pending(
|
||||
db,
|
||||
source.id,
|
||||
content_hash,
|
||||
)
|
||||
log.info(
|
||||
"blocklist_import_tracking_created",
|
||||
source_id=source.id,
|
||||
run_id=run_id,
|
||||
content_hash=content_hash[:8],
|
||||
)
|
||||
else:
|
||||
# Retry case: existing run is pending or failed, try again
|
||||
run_id = existing_run.id
|
||||
log.info(
|
||||
"blocklist_import_retrying",
|
||||
source_id=source.id,
|
||||
run_id=run_id,
|
||||
content_hash=content_hash[:8],
|
||||
previous_status=existing_run.status,
|
||||
)
|
||||
|
||||
# --- Ban ---
|
||||
imported, failed, ban_error = await self.ban_executor.ban_ips(
|
||||
socket_path,
|
||||
@@ -140,46 +220,42 @@ class BlocklistImportWorkflow:
|
||||
valid_ips,
|
||||
)
|
||||
|
||||
# --- Update import run status ---
|
||||
if ban_error is not None:
|
||||
await import_run_repo.mark_failed(db, run_id, ban_error)
|
||||
log.warning(
|
||||
"blocklist_import_banning_failed",
|
||||
source_id=source.id,
|
||||
run_id=run_id,
|
||||
error=ban_error,
|
||||
)
|
||||
else:
|
||||
await import_run_repo.mark_completed(
|
||||
db,
|
||||
run_id,
|
||||
imported,
|
||||
skipped + failed,
|
||||
)
|
||||
|
||||
# --- Log result ---
|
||||
await self.log_result(db, source, imported, skipped, ban_error)
|
||||
await self.log_result(db, source, imported, skipped + failed, ban_error)
|
||||
log.info(
|
||||
"blocklist_source_imported",
|
||||
source_id=source.id,
|
||||
url=source.url,
|
||||
imported=imported,
|
||||
skipped=skipped,
|
||||
skipped=skipped + failed,
|
||||
error=ban_error,
|
||||
)
|
||||
|
||||
# --- Pre-warm geo cache for newly imported IPs ---
|
||||
imported_ips = valid_ips[: imported] if imported > 0 else []
|
||||
if imported_ips and geo_is_cached is not None:
|
||||
uncached_ips: list[str] = [
|
||||
ip for ip in imported_ips if not geo_is_cached(ip)
|
||||
]
|
||||
skipped_geo: int = len(imported_ips) - len(uncached_ips)
|
||||
|
||||
if skipped_geo > 0:
|
||||
log.info(
|
||||
"blocklist_geo_prewarm_cache_hit",
|
||||
source_id=source.id,
|
||||
skipped=skipped_geo,
|
||||
to_lookup=len(uncached_ips),
|
||||
)
|
||||
|
||||
if uncached_ips and geo_cache is not None:
|
||||
try:
|
||||
await geo_cache.lookup_batch(uncached_ips, self.downloader.http_session, db=db)
|
||||
log.info(
|
||||
"blocklist_geo_prewarm_complete",
|
||||
source_id=source.id,
|
||||
count=len(uncached_ips),
|
||||
)
|
||||
except (TimeoutError, aiohttp.ClientError, OSError):
|
||||
log.warning(
|
||||
"blocklist_geo_prewarm_failed",
|
||||
source_id=source.id,
|
||||
)
|
||||
await self._prewarm_geo_cache(
|
||||
source,
|
||||
imported,
|
||||
content,
|
||||
geo_is_cached,
|
||||
geo_cache,
|
||||
)
|
||||
|
||||
return ImportSourceResult(
|
||||
source_id=source.id,
|
||||
@@ -188,3 +264,59 @@ class BlocklistImportWorkflow:
|
||||
ips_skipped=skipped + failed,
|
||||
error=ban_error,
|
||||
)
|
||||
|
||||
async def _prewarm_geo_cache(
|
||||
self,
|
||||
source: BlocklistSource,
|
||||
imported: int,
|
||||
content: str,
|
||||
geo_is_cached: Callable[[str], bool] | None,
|
||||
geo_cache: GeoCache | None,
|
||||
) -> None:
|
||||
"""Pre-warm geo cache with newly imported IPs.
|
||||
|
||||
Extracted into helper to support both first-run and retry scenarios.
|
||||
|
||||
Args:
|
||||
source: The blocklist source.
|
||||
imported: Number of IPs that were (or have already been) banned.
|
||||
content: The downloaded content to extract IPs from.
|
||||
geo_is_cached: Optional function to check if an IP is cached.
|
||||
geo_cache: Optional GeoCache instance for pre-warming.
|
||||
"""
|
||||
if imported == 0 or geo_is_cached is None or geo_cache is None:
|
||||
return
|
||||
|
||||
# Re-parse content to get IPs (needed for retry case)
|
||||
parsed = self.parser.parse(content)
|
||||
imported_ips = parsed.valid_ips[:imported] if imported > 0 else []
|
||||
|
||||
if not imported_ips:
|
||||
return
|
||||
|
||||
uncached_ips: list[str] = [
|
||||
ip for ip in imported_ips if not geo_is_cached(ip)
|
||||
]
|
||||
skipped_geo: int = len(imported_ips) - len(uncached_ips)
|
||||
|
||||
if skipped_geo > 0:
|
||||
log.info(
|
||||
"blocklist_geo_prewarm_cache_hit",
|
||||
source_id=source.id,
|
||||
skipped=skipped_geo,
|
||||
to_lookup=len(uncached_ips),
|
||||
)
|
||||
|
||||
if uncached_ips:
|
||||
try:
|
||||
await geo_cache.lookup_batch(uncached_ips, self.downloader.http_session, db=None)
|
||||
log.info(
|
||||
"blocklist_geo_prewarm_complete",
|
||||
source_id=source.id,
|
||||
count=len(uncached_ips),
|
||||
)
|
||||
except (TimeoutError, aiohttp.ClientError, OSError):
|
||||
log.warning(
|
||||
"blocklist_geo_prewarm_failed",
|
||||
source_id=source.id,
|
||||
)
|
||||
|
||||
@@ -19,6 +19,7 @@ import structlog
|
||||
|
||||
from app.services import ban_service, blocklist_service
|
||||
from app.tasks.db import task_db
|
||||
from app.tasks.timeout_utils import run_with_timeout
|
||||
from app.utils.runtime_state import get_effective_settings
|
||||
|
||||
if TYPE_CHECKING:
|
||||
@@ -32,6 +33,9 @@ log: structlog.stdlib.BoundLogger = structlog.get_logger()
|
||||
#: Stable APScheduler job id so the job can be replaced without duplicates.
|
||||
JOB_ID: str = "blocklist_import"
|
||||
|
||||
#: Maximum seconds to allow for blocklist import task to complete.
|
||||
TASK_TIMEOUT_SECONDS: int = 300
|
||||
|
||||
|
||||
async def _run_import_with_resources(settings: Settings, http_session: ClientSession) -> None:
|
||||
"""APScheduler callback that imports all enabled blocklist sources.
|
||||
@@ -40,25 +44,29 @@ async def _run_import_with_resources(settings: Settings, http_session: ClientSes
|
||||
settings: The resolved application settings used for database access.
|
||||
http_session: The shared aiohttp session used for blocklist downloads.
|
||||
"""
|
||||
socket_path: str = settings.fail2ban_socket
|
||||
|
||||
log.info("blocklist_import_starting")
|
||||
try:
|
||||
async with task_db(settings) as db:
|
||||
result = await blocklist_service.import_all(
|
||||
db,
|
||||
http_session,
|
||||
socket_path,
|
||||
ban_ip=ban_service.ban_ip,
|
||||
async def _do_import() -> None:
|
||||
socket_path: str = settings.fail2ban_socket
|
||||
|
||||
log.info("blocklist_import_starting")
|
||||
try:
|
||||
async with task_db(settings) as db:
|
||||
result = await blocklist_service.import_all(
|
||||
db,
|
||||
http_session,
|
||||
socket_path,
|
||||
ban_ip=ban_service.ban_ip,
|
||||
)
|
||||
log.info(
|
||||
"blocklist_import_finished",
|
||||
total_imported=result.total_imported,
|
||||
total_skipped=result.total_skipped,
|
||||
errors=result.errors_count,
|
||||
)
|
||||
log.info(
|
||||
"blocklist_import_finished",
|
||||
total_imported=result.total_imported,
|
||||
total_skipped=result.total_skipped,
|
||||
errors=result.errors_count,
|
||||
)
|
||||
except Exception:
|
||||
log.exception("blocklist_import_unexpected_error")
|
||||
except Exception:
|
||||
log.exception("blocklist_import_unexpected_error")
|
||||
|
||||
await run_with_timeout("blocklist_import", _do_import(), TASK_TIMEOUT_SECONDS)
|
||||
|
||||
|
||||
run_import_with_resources = _run_import_with_resources
|
||||
|
||||
@@ -18,6 +18,7 @@ import structlog
|
||||
|
||||
from app.repositories import geo_cache_repo
|
||||
from app.tasks.db import task_db
|
||||
from app.tasks.timeout_utils import run_with_timeout
|
||||
from app.utils.runtime_state import get_effective_settings
|
||||
|
||||
if TYPE_CHECKING:
|
||||
@@ -36,6 +37,9 @@ GEO_CLEANUP_INTERVAL: int = 24 * 60 * 60
|
||||
#: Stable APScheduler job ID — ensures re-registration replaces, not duplicates.
|
||||
JOB_ID: str = "geo_cache_cleanup"
|
||||
|
||||
#: Maximum seconds to allow for geo cache cleanup to complete.
|
||||
TASK_TIMEOUT_SECONDS: int = 60
|
||||
|
||||
|
||||
async def _run_cleanup_with_resources(settings: Settings) -> None:
|
||||
"""Delete stale entries from the geo cache.
|
||||
@@ -46,17 +50,21 @@ async def _run_cleanup_with_resources(settings: Settings) -> None:
|
||||
Args:
|
||||
settings: The resolved application settings used for database access.
|
||||
"""
|
||||
cutoff_dt = datetime.now(UTC) - timedelta(days=GEO_CACHE_RETENTION_DAYS)
|
||||
cutoff_iso = cutoff_dt.strftime("%Y-%m-%dT%H:%M:%SZ")
|
||||
|
||||
async with task_db(settings) as db:
|
||||
deleted = await geo_cache_repo.delete_stale_entries(db, cutoff_iso)
|
||||
await db.commit()
|
||||
async def _do_cleanup() -> None:
|
||||
cutoff_dt = datetime.now(UTC) - timedelta(days=GEO_CACHE_RETENTION_DAYS)
|
||||
cutoff_iso = cutoff_dt.strftime("%Y-%m-%dT%H:%M:%SZ")
|
||||
|
||||
if deleted > 0:
|
||||
log.info("geo_cache_cleanup_ran", deleted=deleted, retention_days=GEO_CACHE_RETENTION_DAYS)
|
||||
else:
|
||||
log.debug("geo_cache_cleanup_ran", deleted=deleted, retention_days=GEO_CACHE_RETENTION_DAYS)
|
||||
async with task_db(settings) as db:
|
||||
deleted = await geo_cache_repo.delete_stale_entries(db, cutoff_iso)
|
||||
await db.commit()
|
||||
|
||||
if deleted > 0:
|
||||
log.info("geo_cache_cleanup_ran", deleted=deleted, retention_days=GEO_CACHE_RETENTION_DAYS)
|
||||
else:
|
||||
log.debug("geo_cache_cleanup_ran", deleted=deleted, retention_days=GEO_CACHE_RETENTION_DAYS)
|
||||
|
||||
await run_with_timeout("geo_cache_cleanup", _do_cleanup(), TASK_TIMEOUT_SECONDS)
|
||||
|
||||
|
||||
async def _run_cleanup(app: FastAPI) -> None:
|
||||
|
||||
@@ -15,14 +15,15 @@ from typing import TYPE_CHECKING
|
||||
|
||||
import structlog
|
||||
|
||||
from app.services.geo_cache import GeoCache
|
||||
from app.tasks.db import task_db
|
||||
from app.tasks.timeout_utils import run_with_timeout
|
||||
from app.utils.runtime_state import get_effective_settings
|
||||
|
||||
if TYPE_CHECKING:
|
||||
from fastapi import FastAPI
|
||||
|
||||
from app.config import Settings
|
||||
from app.services.geo_cache import GeoCache
|
||||
|
||||
log: structlog.stdlib.BoundLogger = structlog.get_logger()
|
||||
|
||||
@@ -32,6 +33,9 @@ GEO_FLUSH_INTERVAL: int = 60
|
||||
#: Stable APScheduler job ID — ensures re-registration replaces, not duplicates.
|
||||
JOB_ID: str = "geo_cache_flush"
|
||||
|
||||
#: Maximum seconds to allow for geo cache flush to complete.
|
||||
TASK_TIMEOUT_SECONDS: int = 60
|
||||
|
||||
|
||||
async def _run_flush_with_resources(geo_cache: GeoCache, settings: Settings) -> None:
|
||||
"""Flush the geo cache dirty set to the application database.
|
||||
@@ -40,11 +44,15 @@ async def _run_flush_with_resources(geo_cache: GeoCache, settings: Settings) ->
|
||||
geo_cache: The application's GeoCache instance.
|
||||
settings: The resolved application settings used for database access.
|
||||
"""
|
||||
async with task_db(settings) as db:
|
||||
count = await geo_cache.flush_dirty(db)
|
||||
|
||||
if count > 0:
|
||||
log.debug("geo_cache_flush_ran", flushed=count)
|
||||
async def _do_flush() -> None:
|
||||
async with task_db(settings) as db:
|
||||
count = await geo_cache.flush_dirty(db)
|
||||
|
||||
if count > 0:
|
||||
log.debug("geo_cache_flush_ran", flushed=count)
|
||||
|
||||
await run_with_timeout("geo_cache_flush", _do_flush(), TASK_TIMEOUT_SECONDS)
|
||||
|
||||
|
||||
async def _run_flush(app: FastAPI) -> None:
|
||||
|
||||
@@ -21,8 +21,8 @@ from typing import TYPE_CHECKING
|
||||
|
||||
import structlog
|
||||
|
||||
from app.services.geo_cache import GeoCache
|
||||
from app.tasks.db import task_db
|
||||
from app.tasks.timeout_utils import run_with_timeout
|
||||
from app.utils.runtime_state import get_effective_settings
|
||||
|
||||
if TYPE_CHECKING:
|
||||
@@ -30,6 +30,7 @@ if TYPE_CHECKING:
|
||||
from fastapi import FastAPI
|
||||
|
||||
from app.config import Settings
|
||||
from app.services.geo_cache import GeoCache
|
||||
|
||||
log: structlog.stdlib.BoundLogger = structlog.get_logger()
|
||||
|
||||
@@ -39,6 +40,9 @@ GEO_RE_RESOLVE_INTERVAL: int = 600
|
||||
#: Stable APScheduler job ID — ensures re-registration replaces, not duplicates.
|
||||
JOB_ID: str = "geo_re_resolve"
|
||||
|
||||
#: Maximum seconds to allow for geo re-resolve to complete.
|
||||
TASK_TIMEOUT_SECONDS: int = 120
|
||||
|
||||
|
||||
async def _run_re_resolve_with_resources(
|
||||
geo_cache: GeoCache, settings: Settings, http_session: ClientSession
|
||||
@@ -50,31 +54,35 @@ async def _run_re_resolve_with_resources(
|
||||
settings: The resolved application settings used for database access.
|
||||
http_session: The shared aiohttp session used for external lookups.
|
||||
"""
|
||||
async with task_db(settings) as db:
|
||||
# Fetch all IPs with NULL country_code from the persistent cache.
|
||||
unresolved_ips = await geo_cache.get_unresolved_ips(db)
|
||||
|
||||
if not unresolved_ips:
|
||||
log.debug("geo_re_resolve_skip", reason="no_unresolved_ips")
|
||||
return
|
||||
async def _do_re_resolve() -> None:
|
||||
async with task_db(settings) as db:
|
||||
# Fetch all IPs with NULL country_code from the persistent cache.
|
||||
unresolved_ips = await geo_cache.get_unresolved_ips(db)
|
||||
|
||||
log.info("geo_re_resolve_start", unresolved=len(unresolved_ips))
|
||||
if not unresolved_ips:
|
||||
log.debug("geo_re_resolve_skip", reason="no_unresolved_ips")
|
||||
return
|
||||
|
||||
# Clear the negative cache so these IPs are eligible for fresh API calls.
|
||||
await geo_cache.clear_neg_cache()
|
||||
log.info("geo_re_resolve_start", unresolved=len(unresolved_ips))
|
||||
|
||||
# lookup_batch handles throttling, retries, and persistence when db is
|
||||
# passed. This is a background task so DB writes are allowed.
|
||||
results = await geo_cache.lookup_batch(unresolved_ips, http_session, db=db)
|
||||
# Clear the negative cache so these IPs are eligible for fresh API calls.
|
||||
await geo_cache.clear_neg_cache()
|
||||
|
||||
resolved_count: int = sum(
|
||||
1 for info in results.values() if info.country_code is not None
|
||||
)
|
||||
log.info(
|
||||
"geo_re_resolve_complete",
|
||||
retried=len(unresolved_ips),
|
||||
resolved=resolved_count,
|
||||
)
|
||||
# lookup_batch handles throttling, retries, and persistence when db is
|
||||
# passed. This is a background task so DB writes are allowed.
|
||||
results = await geo_cache.lookup_batch(unresolved_ips, http_session, db=db)
|
||||
|
||||
resolved_count: int = sum(
|
||||
1 for info in results.values() if info.country_code is not None
|
||||
)
|
||||
log.info(
|
||||
"geo_re_resolve_complete",
|
||||
retried=len(unresolved_ips),
|
||||
resolved=resolved_count,
|
||||
)
|
||||
|
||||
await run_with_timeout("geo_re_resolve", _do_re_resolve(), TASK_TIMEOUT_SECONDS)
|
||||
|
||||
|
||||
async def _run_re_resolve(app: FastAPI) -> None:
|
||||
|
||||
@@ -24,6 +24,7 @@ import structlog
|
||||
|
||||
from app.models.server import ServerStatus
|
||||
from app.services import health_service
|
||||
from app.tasks.timeout_utils import run_with_timeout
|
||||
from app.utils.runtime_state import (
|
||||
RuntimeState,
|
||||
get_effective_settings,
|
||||
@@ -42,6 +43,9 @@ log: structlog.stdlib.BoundLogger = structlog.get_logger()
|
||||
#: How often the probe fires (seconds).
|
||||
HEALTH_CHECK_INTERVAL: int = 30
|
||||
|
||||
#: Maximum seconds to allow for health probe to complete.
|
||||
HEALTH_PROBE_TIMEOUT_SECONDS: int = 10
|
||||
|
||||
|
||||
async def _run_probe_with_resources(settings: Settings, runtime_state: RuntimeState) -> None:
|
||||
"""Probe fail2ban and cache the result on the runtime state.
|
||||
@@ -50,14 +54,13 @@ async def _run_probe_with_resources(settings: Settings, runtime_state: RuntimeSt
|
||||
settings: The resolved application settings used for the probe.
|
||||
runtime_state: The mutable runtime state manager.
|
||||
"""
|
||||
socket_path: str = settings.fail2ban_socket
|
||||
prev_status: ServerStatus = getattr(
|
||||
runtime_state,
|
||||
"server_status",
|
||||
ServerStatus(online=False),
|
||||
)
|
||||
status: ServerStatus = await health_service.probe(socket_path)
|
||||
process_health_probe_result(runtime_state, status)
|
||||
|
||||
async def _do_probe() -> None:
|
||||
socket_path: str = settings.fail2ban_socket
|
||||
status: ServerStatus = await health_service.probe(socket_path)
|
||||
process_health_probe_result(runtime_state, status)
|
||||
|
||||
await run_with_timeout("health_check", _do_probe(), HEALTH_PROBE_TIMEOUT_SECONDS)
|
||||
|
||||
|
||||
async def _run_probe(app: FastAPI) -> None:
|
||||
|
||||
@@ -13,6 +13,7 @@ import structlog
|
||||
|
||||
from app.services import history_service
|
||||
from app.tasks.db import task_db
|
||||
from app.tasks.timeout_utils import run_with_timeout
|
||||
from app.utils.runtime_state import get_effective_settings
|
||||
|
||||
if TYPE_CHECKING:
|
||||
@@ -31,16 +32,22 @@ HISTORY_SYNC_INTERVAL: int = 300
|
||||
#: Backfill window when archive is empty (seconds).
|
||||
BACKFILL_WINDOW: int = 648000
|
||||
|
||||
#: Maximum seconds to allow for history sync to complete.
|
||||
TASK_TIMEOUT_SECONDS: int = 60
|
||||
|
||||
|
||||
async def _run_sync_with_settings(settings: Settings) -> None:
|
||||
socket_path: str = settings.fail2ban_socket
|
||||
|
||||
try:
|
||||
async with task_db(settings) as db:
|
||||
synced = await history_service.sync_from_fail2ban_db(db, socket_path)
|
||||
log.info("history_sync_complete", synced=synced)
|
||||
except Exception:
|
||||
log.exception("history_sync_failed")
|
||||
async def _do_sync() -> None:
|
||||
try:
|
||||
async with task_db(settings) as db:
|
||||
synced = await history_service.sync_from_fail2ban_db(db, socket_path)
|
||||
log.info("history_sync_complete", synced=synced)
|
||||
except Exception:
|
||||
log.exception("history_sync_failed")
|
||||
|
||||
await run_with_timeout("history_sync", _do_sync(), TASK_TIMEOUT_SECONDS)
|
||||
|
||||
|
||||
async def _run_sync(app: FastAPI) -> None:
|
||||
|
||||
@@ -16,6 +16,8 @@ from typing import TYPE_CHECKING
|
||||
|
||||
import structlog
|
||||
|
||||
from app.tasks.timeout_utils import run_with_timeout
|
||||
|
||||
if TYPE_CHECKING:
|
||||
from fastapi import FastAPI
|
||||
|
||||
@@ -29,8 +31,11 @@ RATE_LIMITER_CLEANUP_INTERVAL: int = 30 * 60 # 30 minutes
|
||||
#: Stable APScheduler job ID — ensures re-registration replaces, not duplicates.
|
||||
JOB_ID: str = "rate_limiter_cleanup"
|
||||
|
||||
#: Maximum seconds to allow for rate limiter cleanup to complete.
|
||||
TASK_TIMEOUT_SECONDS: int = 5
|
||||
|
||||
def _run_cleanup(app: FastAPI) -> None:
|
||||
|
||||
async def _run_cleanup(app: FastAPI) -> None:
|
||||
"""Trigger cleanup of expired rate-limiter entries.
|
||||
|
||||
Cleans up both the login-specific rate limiter (exponential backoff)
|
||||
@@ -39,23 +44,27 @@ def _run_cleanup(app: FastAPI) -> None:
|
||||
Args:
|
||||
app: The FastAPI application instance (holds the rate limiters).
|
||||
"""
|
||||
login_limiter = getattr(app.state, "login_rate_limiter", None)
|
||||
if login_limiter is None:
|
||||
log.warning(
|
||||
"rate_limiter_cleanup_skipped",
|
||||
reason="login_rate_limiter not found on app.state",
|
||||
)
|
||||
else:
|
||||
login_limiter.cleanup_expired()
|
||||
|
||||
global_limiter = getattr(app.state, "global_rate_limiter", None)
|
||||
if global_limiter is None:
|
||||
log.warning(
|
||||
"rate_limiter_cleanup_skipped",
|
||||
reason="global_rate_limiter not found on app.state",
|
||||
)
|
||||
else:
|
||||
global_limiter.cleanup_expired()
|
||||
async def _do_cleanup() -> None:
|
||||
login_limiter = getattr(app.state, "login_rate_limiter", None)
|
||||
if login_limiter is None:
|
||||
log.warning(
|
||||
"rate_limiter_cleanup_skipped",
|
||||
reason="login_rate_limiter not found on app.state",
|
||||
)
|
||||
else:
|
||||
login_limiter.cleanup_expired()
|
||||
|
||||
global_limiter = getattr(app.state, "global_rate_limiter", None)
|
||||
if global_limiter is None:
|
||||
log.warning(
|
||||
"rate_limiter_cleanup_skipped",
|
||||
reason="global_rate_limiter not found on app.state",
|
||||
)
|
||||
else:
|
||||
global_limiter.cleanup_expired()
|
||||
|
||||
await run_with_timeout("rate_limiter_cleanup", _do_cleanup(), TASK_TIMEOUT_SECONDS)
|
||||
|
||||
|
||||
def register(app: FastAPI) -> None:
|
||||
|
||||
@@ -16,6 +16,7 @@ from typing import TYPE_CHECKING
|
||||
import structlog
|
||||
|
||||
from app.tasks.db import task_db
|
||||
from app.tasks.timeout_utils import run_with_timeout
|
||||
from app.utils.runtime_state import get_effective_settings
|
||||
from app.utils.scheduler_lock import update_scheduler_lock_heartbeat
|
||||
|
||||
@@ -32,6 +33,9 @@ SCHEDULER_LOCK_HEARTBEAT_INTERVAL: int = 10
|
||||
#: Stable APScheduler job ID — ensures re-registration replaces, not duplicates.
|
||||
JOB_ID: str = "scheduler_lock_heartbeat"
|
||||
|
||||
#: Maximum seconds to allow for scheduler lock heartbeat to complete.
|
||||
TASK_TIMEOUT_SECONDS: int = 5
|
||||
|
||||
|
||||
async def _update_heartbeat_with_resources(settings: Settings) -> None:
|
||||
"""Update the scheduler lock heartbeat timestamp.
|
||||
@@ -43,16 +47,20 @@ async def _update_heartbeat_with_resources(settings: Settings) -> None:
|
||||
Args:
|
||||
settings: The resolved application settings used for database access.
|
||||
"""
|
||||
async with task_db(settings) as db:
|
||||
success = await update_scheduler_lock_heartbeat(db)
|
||||
|
||||
if success:
|
||||
log.debug("scheduler_lock_heartbeat_updated")
|
||||
else:
|
||||
log.warning(
|
||||
"scheduler_lock_heartbeat_failed",
|
||||
message="Failed to update heartbeat; we may have lost the lock.",
|
||||
)
|
||||
async def _do_update() -> None:
|
||||
async with task_db(settings) as db:
|
||||
success = await update_scheduler_lock_heartbeat(db)
|
||||
|
||||
if success:
|
||||
log.debug("scheduler_lock_heartbeat_updated")
|
||||
else:
|
||||
log.warning(
|
||||
"scheduler_lock_heartbeat_failed",
|
||||
message="Failed to update heartbeat; we may have lost the lock.",
|
||||
)
|
||||
|
||||
await run_with_timeout("scheduler_lock_heartbeat", _do_update(), TASK_TIMEOUT_SECONDS)
|
||||
|
||||
|
||||
async def _update_heartbeat(app: FastAPI) -> None:
|
||||
|
||||
@@ -16,6 +16,7 @@ import structlog
|
||||
|
||||
from app.repositories import session_repo
|
||||
from app.tasks.db import task_db
|
||||
from app.tasks.timeout_utils import run_with_timeout
|
||||
from app.utils.runtime_state import get_effective_settings
|
||||
from app.utils.time_utils import utc_now
|
||||
|
||||
@@ -32,6 +33,9 @@ SESSION_CLEANUP_INTERVAL: int = 6 * 60 * 60 # 6 hours
|
||||
#: Stable APScheduler job ID — ensures re-registration replaces, not duplicates.
|
||||
JOB_ID: str = "session_cleanup"
|
||||
|
||||
#: Maximum seconds to allow for session cleanup to complete.
|
||||
TASK_TIMEOUT_SECONDS: int = 30
|
||||
|
||||
|
||||
async def _run_cleanup_with_resources(settings: Settings) -> None:
|
||||
"""Delete all expired sessions from the database.
|
||||
@@ -39,11 +43,15 @@ async def _run_cleanup_with_resources(settings: Settings) -> None:
|
||||
Args:
|
||||
settings: The resolved application settings used for database access.
|
||||
"""
|
||||
now_iso = utc_now().isoformat()
|
||||
async with task_db(settings) as db:
|
||||
deleted_count = await session_repo.delete_expired_sessions(db, now_iso)
|
||||
|
||||
log.info("session_cleanup_ran", deleted_count=deleted_count, cutoff_time=now_iso)
|
||||
async def _do_cleanup() -> None:
|
||||
now_iso = utc_now().isoformat()
|
||||
async with task_db(settings) as db:
|
||||
deleted_count = await session_repo.delete_expired_sessions(db, now_iso)
|
||||
|
||||
log.info("session_cleanup_ran", deleted_count=deleted_count, cutoff_time=now_iso)
|
||||
|
||||
await run_with_timeout("session_cleanup", _do_cleanup(), TASK_TIMEOUT_SECONDS)
|
||||
|
||||
|
||||
async def _run_cleanup(app: FastAPI) -> None:
|
||||
|
||||
62
backend/app/tasks/timeout_utils.py
Normal file
62
backend/app/tasks/timeout_utils.py
Normal file
@@ -0,0 +1,62 @@
|
||||
"""Timeout protection utilities for background tasks.
|
||||
|
||||
Provides helpers to wrap async task functions with asyncio.wait_for() timeout
|
||||
protection. Ensures tasks complete within bounded time or fail gracefully with
|
||||
proper logging and error handling.
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
import asyncio
|
||||
import time
|
||||
from collections.abc import Awaitable
|
||||
from typing import TypeVar
|
||||
|
||||
import structlog
|
||||
|
||||
log: structlog.stdlib.BoundLogger = structlog.get_logger()
|
||||
|
||||
T = TypeVar("T")
|
||||
|
||||
|
||||
async def run_with_timeout(
|
||||
task_name: str,
|
||||
coro: Awaitable[T],
|
||||
timeout_seconds: int,
|
||||
) -> T:
|
||||
"""Run an async coroutine with timeout protection.
|
||||
|
||||
Args:
|
||||
task_name: Human-readable name of the task for logging.
|
||||
coro: The coroutine to execute.
|
||||
timeout_seconds: Maximum seconds to wait before timeout.
|
||||
|
||||
Raises:
|
||||
asyncio.TimeoutError: If the task exceeds the timeout.
|
||||
|
||||
Returns:
|
||||
The return value of the coroutine.
|
||||
"""
|
||||
start_time = time.monotonic()
|
||||
try:
|
||||
result: T = await asyncio.wait_for(coro, timeout=timeout_seconds)
|
||||
elapsed = time.monotonic() - start_time
|
||||
if elapsed > timeout_seconds * 0.8:
|
||||
log.warning(
|
||||
"task_approaching_timeout",
|
||||
task_name=task_name,
|
||||
timeout_seconds=timeout_seconds,
|
||||
elapsed_seconds=round(elapsed, 2),
|
||||
usage_percent=round((elapsed / timeout_seconds) * 100, 1),
|
||||
)
|
||||
return result
|
||||
except TimeoutError:
|
||||
elapsed = time.monotonic() - start_time
|
||||
log.warning(
|
||||
"task_timeout",
|
||||
task_name=task_name,
|
||||
timeout_seconds=timeout_seconds,
|
||||
elapsed_seconds=round(elapsed, 2),
|
||||
)
|
||||
raise
|
||||
|
||||
@@ -305,6 +305,105 @@ class TestImport:
|
||||
assert len(result.results) == 1
|
||||
assert result.results[0].source_url == "https://s1.test/"
|
||||
|
||||
@patch("app.utils.ip_utils.validate_blocklist_url")
|
||||
async def test_import_source_idempotent_on_retry(
|
||||
self, mock_validate: AsyncMock, db: aiosqlite.Connection
|
||||
) -> None:
|
||||
"""Retry of same content skips banning and reuses existing import record."""
|
||||
mock_validate.return_value = None
|
||||
content = "1.2.3.4\n5.6.7.8\n"
|
||||
session = _make_session(content)
|
||||
source = await blocklist_service.create_source(db, "Idempotency Test", "https://t.test/")
|
||||
|
||||
from app.services import ban_service
|
||||
|
||||
ban_count = 0
|
||||
|
||||
async def mock_ban_ip(ip: str, jail: str, socket_path: str) -> None:
|
||||
nonlocal ban_count
|
||||
ban_count += 1
|
||||
|
||||
# First import: should ban 2 IPs
|
||||
with patch("app.services.ban_service.ban_ip", side_effect=mock_ban_ip):
|
||||
result1 = await blocklist_service.import_source(
|
||||
source,
|
||||
session,
|
||||
"/tmp/fake.sock",
|
||||
db,
|
||||
ban_ip=ban_service.ban_ip,
|
||||
)
|
||||
|
||||
assert result1.ips_imported == 2
|
||||
assert ban_count == 2
|
||||
first_ban_count = ban_count
|
||||
|
||||
# Second import with same content: should skip banning
|
||||
session2 = _make_session(content)
|
||||
ban_count = 0
|
||||
with patch("app.services.ban_service.ban_ip", side_effect=mock_ban_ip):
|
||||
result2 = await blocklist_service.import_source(
|
||||
source,
|
||||
session2,
|
||||
"/tmp/fake.sock",
|
||||
db,
|
||||
ban_ip=ban_service.ban_ip,
|
||||
)
|
||||
|
||||
# Should skip banning entirely
|
||||
assert result2.ips_imported == 2
|
||||
assert result2.error is None
|
||||
assert ban_count == 0 # No bans called on retry
|
||||
assert first_ban_count == 2
|
||||
|
||||
@patch("app.utils.ip_utils.validate_blocklist_url")
|
||||
async def test_import_source_different_content_not_reused(
|
||||
self, mock_validate: AsyncMock, db: aiosqlite.Connection
|
||||
) -> None:
|
||||
"""Different content creates new import record, even from same source."""
|
||||
mock_validate.return_value = None
|
||||
source = await blocklist_service.create_source(db, "Different Content Test", "https://t.test/")
|
||||
|
||||
from app.services import ban_service
|
||||
|
||||
ban_count = 0
|
||||
|
||||
async def mock_ban_ip(ip: str, jail: str, socket_path: str) -> None:
|
||||
nonlocal ban_count
|
||||
ban_count += 1
|
||||
|
||||
# First import with 2 IPs
|
||||
content1 = "1.2.3.4\n5.6.7.8\n"
|
||||
session1 = _make_session(content1)
|
||||
with patch("app.services.ban_service.ban_ip", side_effect=mock_ban_ip):
|
||||
result1 = await blocklist_service.import_source(
|
||||
source,
|
||||
session1,
|
||||
"/tmp/fake.sock",
|
||||
db,
|
||||
ban_ip=ban_service.ban_ip,
|
||||
)
|
||||
|
||||
assert result1.ips_imported == 2
|
||||
assert ban_count == 2
|
||||
first_ban_count = ban_count
|
||||
|
||||
# Second import with different content (3 IPs): should ban again
|
||||
content2 = "1.2.3.4\n5.6.7.8\n9.10.11.12\n"
|
||||
session2 = _make_session(content2)
|
||||
ban_count = 0
|
||||
with patch("app.services.ban_service.ban_ip", side_effect=mock_ban_ip):
|
||||
result2 = await blocklist_service.import_source(
|
||||
source,
|
||||
session2,
|
||||
"/tmp/fake.sock",
|
||||
db,
|
||||
ban_ip=ban_service.ban_ip,
|
||||
)
|
||||
|
||||
# Different content means new import
|
||||
assert result2.ips_imported == 3
|
||||
assert result2.error is None
|
||||
assert ban_count == 3 # All 3 IPs banned because content is different
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Schedule
|
||||
|
||||
144
backend/tests/test_tasks/test_timeout_utils.py
Normal file
144
backend/tests/test_tasks/test_timeout_utils.py
Normal file
@@ -0,0 +1,144 @@
|
||||
"""Tests for timeout protection utilities for background tasks.
|
||||
|
||||
Validates that :func:`~app.tasks.timeout_utils.run_with_timeout` correctly
|
||||
enforces timeouts on async tasks and logs appropriate warnings.
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
import asyncio
|
||||
from unittest.mock import AsyncMock, patch
|
||||
|
||||
import pytest
|
||||
|
||||
|
||||
class TestRunWithTimeout:
|
||||
"""Tests for :func:`~app.tasks.timeout_utils.run_with_timeout`."""
|
||||
|
||||
@pytest.mark.asyncio
|
||||
async def test_run_with_timeout_completes_quickly(self) -> None:
|
||||
"""``run_with_timeout`` must complete and return the result when task finishes quickly."""
|
||||
from app.tasks.timeout_utils import run_with_timeout
|
||||
|
||||
async def _quick_task() -> str:
|
||||
return "success"
|
||||
|
||||
result = await run_with_timeout("test_task", _quick_task(), timeout_seconds=10)
|
||||
assert result == "success"
|
||||
|
||||
@pytest.mark.asyncio
|
||||
async def test_run_with_timeout_raises_timeout_error(self) -> None:
|
||||
"""``run_with_timeout`` must raise TimeoutError when task exceeds timeout."""
|
||||
from app.tasks.timeout_utils import run_with_timeout
|
||||
|
||||
async def _slow_task() -> None:
|
||||
await asyncio.sleep(5)
|
||||
|
||||
with pytest.raises(TimeoutError):
|
||||
await run_with_timeout("slow_task", _slow_task(), timeout_seconds=0.1)
|
||||
|
||||
@pytest.mark.asyncio
|
||||
async def test_run_with_timeout_logs_timeout_event(self) -> None:
|
||||
"""``run_with_timeout`` must log a warning when timeout occurs."""
|
||||
from app.tasks.timeout_utils import run_with_timeout
|
||||
|
||||
async def _slow_task() -> None:
|
||||
await asyncio.sleep(5)
|
||||
|
||||
with patch("app.tasks.timeout_utils.log") as mock_log:
|
||||
with pytest.raises(TimeoutError):
|
||||
await run_with_timeout("slow_task", _slow_task(), timeout_seconds=0.1)
|
||||
|
||||
# Verify timeout was logged
|
||||
timeout_calls = [
|
||||
c for c in mock_log.warning.call_args_list if c[0][0] == "task_timeout"
|
||||
]
|
||||
assert len(timeout_calls) == 1
|
||||
call_kwargs = timeout_calls[0][1]
|
||||
assert call_kwargs["task_name"] == "slow_task"
|
||||
assert call_kwargs["timeout_seconds"] == 0.1
|
||||
assert call_kwargs["elapsed_seconds"] >= 0.1
|
||||
|
||||
@pytest.mark.asyncio
|
||||
async def test_run_with_timeout_logs_approaching_timeout(self) -> None:
|
||||
"""``run_with_timeout`` must log warning when task uses >80% of timeout."""
|
||||
from app.tasks.timeout_utils import run_with_timeout
|
||||
|
||||
async def _medium_task() -> None:
|
||||
await asyncio.sleep(0.25)
|
||||
|
||||
with patch("app.tasks.timeout_utils.log") as mock_log:
|
||||
await run_with_timeout("medium_task", _medium_task(), timeout_seconds=0.3)
|
||||
|
||||
# Verify approaching timeout warning was logged (task used >80% of timeout)
|
||||
approaching_calls = [
|
||||
c
|
||||
for c in mock_log.warning.call_args_list
|
||||
if c[0][0] == "task_approaching_timeout"
|
||||
]
|
||||
assert len(approaching_calls) == 1
|
||||
call_kwargs = approaching_calls[0][1]
|
||||
assert call_kwargs["task_name"] == "medium_task"
|
||||
assert call_kwargs["timeout_seconds"] == 0.3
|
||||
assert call_kwargs["usage_percent"] > 80
|
||||
|
||||
@pytest.mark.asyncio
|
||||
async def test_run_with_timeout_no_warning_when_well_under_timeout(self) -> None:
|
||||
"""``run_with_timeout`` must not log warning when task completes well before timeout."""
|
||||
from app.tasks.timeout_utils import run_with_timeout
|
||||
|
||||
async def _quick_task() -> None:
|
||||
await asyncio.sleep(0.01)
|
||||
|
||||
with patch("app.tasks.timeout_utils.log") as mock_log:
|
||||
await run_with_timeout("quick_task", _quick_task(), timeout_seconds=1.0)
|
||||
|
||||
# Verify no approaching timeout warning was logged
|
||||
approaching_calls = [
|
||||
c
|
||||
for c in mock_log.warning.call_args_list
|
||||
if c[0][0] == "task_approaching_timeout"
|
||||
]
|
||||
assert len(approaching_calls) == 0
|
||||
|
||||
@pytest.mark.asyncio
|
||||
async def test_run_with_timeout_logs_elapsed_time(self) -> None:
|
||||
"""``run_with_timeout`` must include elapsed time in timeout log."""
|
||||
from app.tasks.timeout_utils import run_with_timeout
|
||||
|
||||
async def _slow_task() -> None:
|
||||
await asyncio.sleep(0.2)
|
||||
|
||||
with patch("app.tasks.timeout_utils.log") as mock_log:
|
||||
with pytest.raises(TimeoutError):
|
||||
await run_with_timeout("slow_task", _slow_task(), timeout_seconds=0.1)
|
||||
|
||||
timeout_calls = [
|
||||
c for c in mock_log.warning.call_args_list if c[0][0] == "task_timeout"
|
||||
]
|
||||
call_kwargs = timeout_calls[0][1]
|
||||
assert "elapsed_seconds" in call_kwargs
|
||||
assert call_kwargs["elapsed_seconds"] >= 0.1
|
||||
|
||||
@pytest.mark.asyncio
|
||||
async def test_run_with_timeout_returns_correct_type(self) -> None:
|
||||
"""``run_with_timeout`` must preserve the return type of the coroutine."""
|
||||
from app.tasks.timeout_utils import run_with_timeout
|
||||
|
||||
async def _task_returns_int() -> int:
|
||||
return 42
|
||||
|
||||
result = await run_with_timeout("int_task", _task_returns_int(), timeout_seconds=10)
|
||||
assert isinstance(result, int)
|
||||
assert result == 42
|
||||
|
||||
@pytest.mark.asyncio
|
||||
async def test_run_with_timeout_task_exception_propagates(self) -> None:
|
||||
"""``run_with_timeout`` must propagate exceptions from the task (not timeout)."""
|
||||
from app.tasks.timeout_utils import run_with_timeout
|
||||
|
||||
async def _failing_task() -> None:
|
||||
raise ValueError("Task failed")
|
||||
|
||||
with pytest.raises(ValueError, match="Task failed"):
|
||||
await run_with_timeout("failing_task", _failing_task(), timeout_seconds=10)
|
||||
Reference in New Issue
Block a user