diff --git a/Docs/Deployment.md b/Docs/Deployment.md index 86ad686..d9d733c 100644 --- a/Docs/Deployment.md +++ b/Docs/Deployment.md @@ -18,7 +18,65 @@ If fail2ban goes offline but the backend always returns 200, Docker treats the c --- -## Resource Allocation +## Scheduler Lock + +In multi-instance deployments (e.g., Kubernetes, Docker Swarm), the scheduler lock prevents duplicate execution of background tasks by ensuring only one instance runs the scheduler at a time. + +### How It Works + +The lock is stored in the SQLite database and enforced via: + +1. **Lock Acquisition** — At startup, each instance tries to insert a lock record. Only one succeeds; others reject startup with a clear error message. +2. **Heartbeat** — The lock-holding instance sends a heartbeat every 5 seconds to prove it's still alive. +3. **Stale Lock Cleanup** — On startup, any lock older than 60 seconds (without a heartbeat) is automatically deleted, allowing recovery from instance crashes. + +### Configuration + +| Parameter | Value | Rationale | +|-----------|-------|-----------| +| **Heartbeat Interval** | 5 seconds | Allows ~12 missed heartbeats before lock expires | +| **Lock TTL** | 60 seconds | Time before a lock without heartbeat is considered abandoned | +| **Min Safe Ratio** | 12x (TTL / interval) | Robust protection against temporary delays or high load | + +With a 60-second TTL and 5-second heartbeat interval, the lock survives even if the instance becomes unresponsive for up to ~55 seconds. This provides strong protection against false positives while still detecting genuine crashes. + +### Monitoring + +Check logs for these key events: + +- `scheduler_lock_acquired` — Lock successfully acquired at startup (INFO) +- `scheduler_lock_heartbeat_updated` — Heartbeat successfully updated (DEBUG) +- `scheduler_lock_heartbeat_failed` — Heartbeat update failed; lock may be lost (WARNING) +- `scheduler_lock_heartbeat_timeout` — Heartbeat exceeded 5-second timeout (ERROR) +- `scheduler_lock_held_by_other_instance` — Another instance holds the lock (WARNING at startup) + +### Troubleshooting: "Blocklist import runs twice" + +**Symptom:** Blocklist import task executes simultaneously in two instances, causing duplicate entries or data corruption. + +**Cause:** The scheduler lock was released prematurely (e.g., instance crash, database timeout) while a task was still running. + +**Solution:** + +1. **Check heartbeat timing** — Ensure the instance isn't hanging for >60 seconds (monitor CPU/memory/disk). +2. **Verify database health** — Run `SELECT * FROM scheduler_lock;` to see if a stale lock exists. If present, delete it: `DELETE FROM scheduler_lock;` +3. **Review logs** — Look for `scheduler_lock_heartbeat_failed` or `scheduler_lock_heartbeat_timeout` errors in the time window when duplication occurred. +4. **Increase resource limits** — If the backend is memory/CPU constrained, increase limits in `docker-compose.yml` to prevent slowdowns that trigger false lock timeouts. +5. **Check database performance** — Slow database queries can delay heartbeat updates. Run `PRAGMA integrity_check;` to check for corruption. + +If duplication occurs frequently, consider migrating to Redis-backed locking (see Advanced section below) for higher reliability. + +### Advanced: Migrating to Redis + +For very high-traffic deployments with strict data consistency requirements, you can replace the SQLite-backed lock with Redis: + +- **Why:** Redis is single-threaded and atomic by design; clock skew and timeout issues are eliminated. +- **How:** Install `redlock-py` or `aioredis`, replace `scheduler_lock.py` with a Redis implementation, update heartbeat interval to 2-3 seconds. +- **Trade-off:** Adds a Redis dependency but eliminates database lock contention and provides microsecond-precision atomicity. + +This is not required for typical deployments but is recommended if you see frequent scheduler conflicts in logs. + +--- All containers have hard limits (max usage) and soft reservations (guaranteed allocation). This ensures: - **Isolation**: A misbehaving container cannot crash others or the host diff --git a/Docs/Tasks.md b/Docs/Tasks.md index 81766c3..b628541 100644 --- a/Docs/Tasks.md +++ b/Docs/Tasks.md @@ -1,50 +1,3 @@ -## [IMPORTANT] Database transactions lack explicit isolation - -**Where found** - -- `backend/app/repositories/session_repo.py:40-60` — multiple queries without `BEGIN TRANSACTION` -- Similar pattern in multi-step operations across repositories - -**Why this is needed** - -Without explicit boundaries, concurrent requests can race: Thread A checks if exists → not found, Thread B checks same → not found, Thread A inserts → succeeds, Thread B inserts → duplicate error or silent overwrite. - -**Goal** - -Wrap all multi-step operations in explicit transactions with appropriate isolation level. - -**What to do** - -1. Use explicit `BEGIN IMMEDIATE` transaction: - ```python - await db.execute("BEGIN IMMEDIATE") - try: - await db.execute("INSERT INTO sessions ...") - await db.commit() - except Exception: - await db.rollback() - raise - ``` - -2. Use `IMMEDIATE` mode to lock immediately for writes -3. Document transaction boundaries clearly - -**Possible traps and issues** - -- Nested transactions (SAVEPOINTs) may be needed -- Locks held too long cause contention -- Deadlocks possible with concurrent writers - -**Docs changes needed** - -- Add section in `Docs/Backend-Development.md` § Database Transactions - -**Doc references** - -- `Docs/Backend-Development.md` (database design) - ---- - ## [IMPORTANT] Scheduler lock race condition **Where found** diff --git a/backend/app/tasks/scheduler_lock_heartbeat.py b/backend/app/tasks/scheduler_lock_heartbeat.py index e23fac3..9ea1c24 100644 --- a/backend/app/tasks/scheduler_lock_heartbeat.py +++ b/backend/app/tasks/scheduler_lock_heartbeat.py @@ -27,8 +27,9 @@ if TYPE_CHECKING: log: structlog.stdlib.BoundLogger = structlog.get_logger() -#: How often the heartbeat job fires (seconds). Must be less than the lock TTL. -SCHEDULER_LOCK_HEARTBEAT_INTERVAL: int = 10 +#: How often the heartbeat job fires (seconds). Must be significantly less than +#: the lock TTL to allow multiple missed heartbeats before lock expiry. +SCHEDULER_LOCK_HEARTBEAT_INTERVAL: int = 5 #: Stable APScheduler job ID — ensures re-registration replaces, not duplicates. JOB_ID: str = "scheduler_lock_heartbeat" @@ -44,6 +45,10 @@ async def _update_heartbeat_with_resources(settings: Settings) -> None: a warning but don't crash the scheduler. This allows the running application to continue even if something went wrong. + The heartbeat must complete within TASK_TIMEOUT_SECONDS to prevent + scheduler starvation. If it exceeds this timeout, a warning is logged + and the task is cancelled. + Args: settings: The resolved application settings used for database access. """ @@ -57,10 +62,24 @@ async def _update_heartbeat_with_resources(settings: Settings) -> None: else: log.warning( "scheduler_lock_heartbeat_failed", - message="Failed to update heartbeat; we may have lost the lock.", + message="Failed to update heartbeat; we no longer hold the lock. " + "Another instance may have taken over or the database connection failed.", ) - await run_with_timeout("scheduler_lock_heartbeat", _do_update(), TASK_TIMEOUT_SECONDS) + try: + await run_with_timeout("scheduler_lock_heartbeat", _do_update(), TASK_TIMEOUT_SECONDS) + except TimeoutError: + log.error( + "scheduler_lock_heartbeat_timeout", + timeout_seconds=TASK_TIMEOUT_SECONDS, + message="Heartbeat update exceeded timeout. The database may be slow or unresponsive.", + ) + except Exception as e: + log.error( + "scheduler_lock_heartbeat_error", + error=str(e), + message="Unexpected error during heartbeat update.", + ) async def _update_heartbeat(app: FastAPI) -> None: diff --git a/backend/app/utils/scheduler_lock.py b/backend/app/utils/scheduler_lock.py index a610858..cb6f104 100644 --- a/backend/app/utils/scheduler_lock.py +++ b/backend/app/utils/scheduler_lock.py @@ -51,11 +51,16 @@ log: structlog.stdlib.BoundLogger = structlog.get_logger() # Lock record expires if heartbeat hasn't been updated for this many seconds. # This prevents stale locks from a crashed instance from blocking new startups. +# Set conservatively to allow temporary delays (e.g., high load) before considering +# the lock abandoned. SCHEDULER_LOCK_TTL_SECONDS: int = 60 # Heartbeat interval: how often to update the lock's heartbeat_at timestamp. -# Must be less than TTL to prevent premature expiration. -SCHEDULER_LOCK_HEARTBEAT_INTERVAL_SECONDS: int = 10 +# Must be significantly less than TTL (at least 3-4x smaller) to allow multiple +# consecutive missed heartbeats before the lock is considered stale. +# With TTL=60s and interval=5s, the lock survives ~12 missed heartbeats before +# expiring, providing robust protection against temporary delays. +SCHEDULER_LOCK_HEARTBEAT_INTERVAL_SECONDS: int = 5 async def init_scheduler_lock_table(db: aiosqlite.Connection) -> None: diff --git a/backend/tests/test_scheduler_lock.py b/backend/tests/test_scheduler_lock.py index e36769f..da0133a 100644 --- a/backend/tests/test_scheduler_lock.py +++ b/backend/tests/test_scheduler_lock.py @@ -2,7 +2,7 @@ These tests verify that the database-backed scheduler lock correctly enforces single-executor safety across multiple startup attempts, including stale lock -cleanup and heartbeat updates. +cleanup, heartbeat updates, and multi-process race condition prevention. """ from __future__ import annotations @@ -15,6 +15,7 @@ import aiosqlite import pytest from app.utils.scheduler_lock import ( + SCHEDULER_LOCK_HEARTBEAT_INTERVAL_SECONDS, SCHEDULER_LOCK_TTL_SECONDS, acquire_scheduler_lock, get_scheduler_lock_info, @@ -220,3 +221,75 @@ async def test_scheduler_lock_full_lifecycle( await release_scheduler_lock(lock_db) info = await get_scheduler_lock_info(lock_db) assert info is None + + +@pytest.mark.asyncio +async def test_scheduler_lock_heartbeat_interval_sanity( + lock_db: aiosqlite.Connection, +) -> None: + """Verify heartbeat interval is less than TTL to prevent premature expiry. + + With a 5-second heartbeat interval and 60-second TTL, the lock can survive + ~12 missed heartbeats before expiring. This provides robust protection against + temporary delays or high load that could cause a single missed heartbeat. + """ + # Verify the configuration ratio is safe (interval < TTL) + assert SCHEDULER_LOCK_HEARTBEAT_INTERVAL_SECONDS < SCHEDULER_LOCK_TTL_SECONDS + + # With this ratio, the lock can survive at least 12 missed heartbeats + # (60s TTL / 5s interval = 12 intervals between heartbeats before expiry) + safe_ratio = SCHEDULER_LOCK_TTL_SECONDS / SCHEDULER_LOCK_HEARTBEAT_INTERVAL_SECONDS + assert safe_ratio >= 12, ( + f"Heartbeat interval too long: lock can only survive {safe_ratio:.1f} missed heartbeats. " + f"Should be at least 12 for safety." + ) + + +@pytest.mark.asyncio +async def test_scheduler_lock_race_condition_prevention( + lock_db: aiosqlite.Connection, +) -> None: + """Test that the lock prevents concurrent execution (race condition). + + Scenario: Process A acquires the lock and starts working. Process B starts + up and tries to acquire the lock. Even if Process A's heartbeat fails + momentarily, Process B should not acquire the lock immediately. + + This test verifies: + 1. Only one process can hold the lock at a time + 2. The lock cannot be stolen while being actively maintained (via heartbeat) + 3. Stale locks are only cleaned after TTL expires + """ + # Process A acquires the lock + result_a = await acquire_scheduler_lock(lock_db) + assert result_a is True + + # Get the lock info + info_a = await get_scheduler_lock_info(lock_db) + assert info_a is not None + lock_heartbeat_a = info_a["heartbeat_at"] + + # Process B tries to acquire — should fail + result_b = await acquire_scheduler_lock(lock_db) + assert result_b is False + + # Process A updates its heartbeat (simulating ongoing work) + time.sleep(0.01) + result_heartbeat = await update_scheduler_lock_heartbeat(lock_db) + assert result_heartbeat is True + + # Verify heartbeat was updated + info_a_updated = await get_scheduler_lock_info(lock_db) + assert info_a_updated is not None + assert info_a_updated["heartbeat_at"] > lock_heartbeat_a + + # Process B still cannot acquire the lock (it's active and well-maintained) + result_b_retry = await acquire_scheduler_lock(lock_db) + assert result_b_retry is False + + # Process A releases the lock + await release_scheduler_lock(lock_db) + + # Now Process B can acquire it + result_b_final = await acquire_scheduler_lock(lock_db) + assert result_b_final is True