Make background tasks idempotent - prevent duplicate bans on retry
CRITICAL FIX: Background tasks (especially blocklist_import) crashed mid-execution, leaving partial state. On retry, the same bans were applied again, causing duplicates. Solution: Content-hash based operation tracking for blocklist imports: - Added import_runs table (migration 6) to track operations by source + content hash - Before banning, check if this exact content has already been imported - If completed: skip banning (already done), optionally re-warm cache - If new or failed: proceed with ban and mark as completed or failed Changes: - Database: Migration 6 adds import_runs table with operation state tracking - Model: Added ImportRunEntry for import run records - Repository: New import_run_repo module with CRUD operations - Workflow: Updated blocklist_import_workflow to check operation history before banning - Dependencies: Registered import_run_repo for dependency injection - Tests: Added test_import_source_idempotent_on_retry and test_import_source_different_content_not_reused - Documentation: Added Task Idempotency section to Backend-Development.md Verification: - All 7 import tests pass (5 existing + 2 new idempotency tests) - Type checking: mypy --strict ✅ - Linting: ruff ✅ - No API changes, backwards compatible via automatic migration Fixes: Background tasks not idempotent #CRITICAL Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
This commit is contained in:
@@ -1682,9 +1682,164 @@ Since tasks do not have access to `Depends(get_db)` (no request scope), they mus
|
||||
- **Startup validation:** `startup_shared_resources()` raises `RuntimeError` if `BANGUI_WORKERS > 1`.
|
||||
- See [Architekture.md § 9.2](Architekture.md) for full details.
|
||||
|
||||
### Timeout Protection for Background Tasks
|
||||
|
||||
**All background tasks must wrap their async work with timeout protection.** If a task hangs (API unreachable, network partition, database lock), it runs forever — never completes → lock never released → duplicate work starts → resource exhaustion. Timeouts prevent this.
|
||||
|
||||
**Rule:** Every task function must use `run_with_timeout()` from `app.tasks.timeout_utils` to enforce a timeout on its async work.
|
||||
|
||||
```python
|
||||
from app.tasks.timeout_utils import run_with_timeout
|
||||
|
||||
async def _run_import_with_resources(settings: Settings, http_session: ClientSession) -> None:
|
||||
"""Imports blocklists with timeout protection."""
|
||||
async def _do_import() -> None:
|
||||
# ... your async work ...
|
||||
result = await blocklist_service.import_all(...)
|
||||
log.info("import_finished", total=result.total_imported)
|
||||
|
||||
# Wrap with timeout: abort after 300 seconds
|
||||
await run_with_timeout("blocklist_import", _do_import(), timeout_seconds=300)
|
||||
```
|
||||
|
||||
**Why this pattern:**
|
||||
1. `run_with_timeout()` enforces strict time limits using `asyncio.wait_for()`.
|
||||
2. If timeout is exceeded, `TimeoutError` is raised and logged with elapsed time.
|
||||
3. If task approaches timeout (>80% of time budget), a warning is logged for observability.
|
||||
4. Failures are logged at `warning` level (not `error`) — timeouts are expected sometimes, but worth investigating.
|
||||
|
||||
**Timeout Values by Task:**
|
||||
|
||||
| Task | Timeout | Rationale |
|
||||
|------|---------|-----------|
|
||||
| `blocklist_import` | 300s (5 min) | Downloads, validates, applies external lists. Network delays expected. |
|
||||
| `health_check` | 10s | Socket probe to fail2ban. Should complete quickly or fail2ban is unresponsive. |
|
||||
| `geo_cache_flush` | 60s | Writes dirty cache entries to DB. Handles contention gracefully. |
|
||||
| `session_cleanup` | 30s | Deletes expired sessions. DB contention unlikely but possible. |
|
||||
| `rate_limiter_cleanup` | 5s | In-memory cleanup, no I/O. Should always be instant. |
|
||||
| `geo_cache_cleanup` | 60s | Deletes stale geo entries from DB. May scan large table. |
|
||||
| `geo_re_resolve` | 120s | Retries failed IP lookups with backoff. API rate-limit delays expected. |
|
||||
| `history_sync` | 60s | Syncs records from fail2ban DB to archive. May read/write many rows. |
|
||||
| `scheduler_lock_heartbeat` | 5s | Updates lock timestamp. Must be quick or lock is lost. |
|
||||
|
||||
**Timeout Events Are Logged:**
|
||||
|
||||
On timeout:
|
||||
```
|
||||
task_timeout task_name=blocklist_import timeout_seconds=300 elapsed_seconds=300.45
|
||||
```
|
||||
|
||||
On approaching timeout (>80% of budget used):
|
||||
```
|
||||
task_approaching_timeout task_name=blocklist_import timeout_seconds=300 elapsed_seconds=298.5 usage_percent=99.5
|
||||
```
|
||||
|
||||
The logs include `elapsed_seconds` for observability — if you see tasks consistently near timeout, the value may need adjustment.
|
||||
|
||||
**Testing Timeout Behavior:**
|
||||
|
||||
Tests for timeout scenarios are in `backend/tests/test_tasks/test_timeout_utils.py`:
|
||||
- Verify timeout is raised and logged.
|
||||
- Verify approaching-timeout warning is logged.
|
||||
- Verify task exceptions (not timeout) propagate correctly.
|
||||
|
||||
Add timeout tests to your task test file:
|
||||
```python
|
||||
@pytest.mark.asyncio
|
||||
async def test_task_timeout_is_logged(self) -> None:
|
||||
"""Task must be logged and raise TimeoutError on timeout."""
|
||||
with patch("app.tasks.my_task.log") as mock_log:
|
||||
with pytest.raises(TimeoutError):
|
||||
await my_task._run_with_resources(settings) # exceeds timeout
|
||||
|
||||
timeout_calls = [
|
||||
c for c in mock_log.warning.call_args_list
|
||||
if c[0][0] == "task_timeout"
|
||||
]
|
||||
assert len(timeout_calls) == 1
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### Task Idempotency
|
||||
|
||||
**Background tasks must be idempotent** — retrying after a crash must produce the same result as running once.
|
||||
|
||||
If a task crashes or times out mid-execution, the scheduler may retry. Without idempotency, retries cause duplicate work:
|
||||
- **blocklist_import**: banned IPs appear twice → database corruption
|
||||
- **geo_cache_flush**: entries written twice → cache inconsistency
|
||||
- Any multi-step operation: partial state remains
|
||||
|
||||
**Pattern: Content-Hash Idempotency for Blocklist Imports**
|
||||
|
||||
Track imports by source + content hash to detect retries:
|
||||
|
||||
```python
|
||||
from app.repositories import import_run_repo
|
||||
|
||||
async def import_source(source, db, ...):
|
||||
# Download content
|
||||
status, content = await downloader.download(url)
|
||||
|
||||
# Compute hash for idempotency detection
|
||||
content_hash = hashlib.sha256(content.encode()).hexdigest()
|
||||
|
||||
# Check if this exact import already completed
|
||||
existing_run = await import_run_repo.get_by_source_and_hash(
|
||||
db, source.id, content_hash
|
||||
)
|
||||
|
||||
if existing_run and existing_run.status == "completed":
|
||||
# Already done — skip banning, optionally re-warm cache
|
||||
log.info("blocklist_import_already_completed", ...)
|
||||
return ImportSourceResult(ips_imported=existing_run.imported_count, ...)
|
||||
|
||||
# First run: create pending record
|
||||
if not existing_run:
|
||||
run_id = await import_run_repo.create_pending(
|
||||
db, source.id, content_hash
|
||||
)
|
||||
else:
|
||||
run_id = existing_run.id # Retry case
|
||||
|
||||
# Do work (ban IPs, etc.)
|
||||
imported, errors = await ban_executor.ban_ips(...)
|
||||
|
||||
# Mark as completed or failed (atomically)
|
||||
if errors:
|
||||
await import_run_repo.mark_failed(db, run_id, str(errors))
|
||||
else:
|
||||
await import_run_repo.mark_completed(db, run_id, imported, skipped)
|
||||
```
|
||||
|
||||
**Key points:**
|
||||
|
||||
1. **Operation ID must be deterministic** — Use content hash, not timestamp
|
||||
- Same content = same operation ID → retry safe
|
||||
- Different content = different operation ID → new import run
|
||||
|
||||
2. **Check before doing work** — Query `import_runs` table before banning
|
||||
- If completed: skip banning (already done)
|
||||
- If pending: retry was interrupted, try again
|
||||
- If failed: retry to recover
|
||||
|
||||
3. **Atomic state updates** — Mark as completed AFTER all work succeeds
|
||||
- All-or-nothing: either import succeeded + logged, or failed + retryable
|
||||
|
||||
4. **Test idempotency** — Verify retrying same content doesn't duplicate bans
|
||||
```python
|
||||
# First import: ban 2 IPs
|
||||
result1 = await import_source(source, content, db)
|
||||
assert result1.ips_imported == 2
|
||||
|
||||
# Second import (same content): skip bans
|
||||
result2 = await import_source(source, content, db)
|
||||
assert result2.ips_imported == 2
|
||||
assert ban_ip.call_count == 2 # Only called once, not twice
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 10. Code Style & Tooling
|
||||
|
||||
| Tool | Purpose |
|
||||
|---|---|
|
||||
|
||||
Reference in New Issue
Block a user