Refactor scheduler lock implementation with heartbeat mechanism

- Add heartbeat-based lock renewal in scheduler_lock_heartbeat.py
- Update scheduler_lock.py with improved lock management
- Add comprehensive tests for scheduler lock functionality
- Update deployment and task documentation

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
This commit is contained in:
2026-04-30 22:10:38 +02:00
parent f9e283541b
commit 05c3b564ae
5 changed files with 163 additions and 55 deletions

View File

@@ -18,7 +18,65 @@ If fail2ban goes offline but the backend always returns 200, Docker treats the c
--- ---
## Resource Allocation ## Scheduler Lock
In multi-instance deployments (e.g., Kubernetes, Docker Swarm), the scheduler lock prevents duplicate execution of background tasks by ensuring only one instance runs the scheduler at a time.
### How It Works
The lock is stored in the SQLite database and enforced via:
1. **Lock Acquisition** — At startup, each instance tries to insert a lock record. Only one succeeds; others reject startup with a clear error message.
2. **Heartbeat** — The lock-holding instance sends a heartbeat every 5 seconds to prove it's still alive.
3. **Stale Lock Cleanup** — On startup, any lock older than 60 seconds (without a heartbeat) is automatically deleted, allowing recovery from instance crashes.
### Configuration
| Parameter | Value | Rationale |
|-----------|-------|-----------|
| **Heartbeat Interval** | 5 seconds | Allows ~12 missed heartbeats before lock expires |
| **Lock TTL** | 60 seconds | Time before a lock without heartbeat is considered abandoned |
| **Min Safe Ratio** | 12x (TTL / interval) | Robust protection against temporary delays or high load |
With a 60-second TTL and 5-second heartbeat interval, the lock survives even if the instance becomes unresponsive for up to ~55 seconds. This provides strong protection against false positives while still detecting genuine crashes.
### Monitoring
Check logs for these key events:
- `scheduler_lock_acquired` — Lock successfully acquired at startup (INFO)
- `scheduler_lock_heartbeat_updated` — Heartbeat successfully updated (DEBUG)
- `scheduler_lock_heartbeat_failed` — Heartbeat update failed; lock may be lost (WARNING)
- `scheduler_lock_heartbeat_timeout` — Heartbeat exceeded 5-second timeout (ERROR)
- `scheduler_lock_held_by_other_instance` — Another instance holds the lock (WARNING at startup)
### Troubleshooting: "Blocklist import runs twice"
**Symptom:** Blocklist import task executes simultaneously in two instances, causing duplicate entries or data corruption.
**Cause:** The scheduler lock was released prematurely (e.g., instance crash, database timeout) while a task was still running.
**Solution:**
1. **Check heartbeat timing** — Ensure the instance isn't hanging for >60 seconds (monitor CPU/memory/disk).
2. **Verify database health** — Run `SELECT * FROM scheduler_lock;` to see if a stale lock exists. If present, delete it: `DELETE FROM scheduler_lock;`
3. **Review logs** — Look for `scheduler_lock_heartbeat_failed` or `scheduler_lock_heartbeat_timeout` errors in the time window when duplication occurred.
4. **Increase resource limits** — If the backend is memory/CPU constrained, increase limits in `docker-compose.yml` to prevent slowdowns that trigger false lock timeouts.
5. **Check database performance** — Slow database queries can delay heartbeat updates. Run `PRAGMA integrity_check;` to check for corruption.
If duplication occurs frequently, consider migrating to Redis-backed locking (see Advanced section below) for higher reliability.
### Advanced: Migrating to Redis
For very high-traffic deployments with strict data consistency requirements, you can replace the SQLite-backed lock with Redis:
- **Why:** Redis is single-threaded and atomic by design; clock skew and timeout issues are eliminated.
- **How:** Install `redlock-py` or `aioredis`, replace `scheduler_lock.py` with a Redis implementation, update heartbeat interval to 2-3 seconds.
- **Trade-off:** Adds a Redis dependency but eliminates database lock contention and provides microsecond-precision atomicity.
This is not required for typical deployments but is recommended if you see frequent scheduler conflicts in logs.
---
All containers have hard limits (max usage) and soft reservations (guaranteed allocation). This ensures: All containers have hard limits (max usage) and soft reservations (guaranteed allocation). This ensures:
- **Isolation**: A misbehaving container cannot crash others or the host - **Isolation**: A misbehaving container cannot crash others or the host

View File

@@ -1,50 +1,3 @@
## [IMPORTANT] Database transactions lack explicit isolation
**Where found**
- `backend/app/repositories/session_repo.py:40-60` — multiple queries without `BEGIN TRANSACTION`
- Similar pattern in multi-step operations across repositories
**Why this is needed**
Without explicit boundaries, concurrent requests can race: Thread A checks if exists → not found, Thread B checks same → not found, Thread A inserts → succeeds, Thread B inserts → duplicate error or silent overwrite.
**Goal**
Wrap all multi-step operations in explicit transactions with appropriate isolation level.
**What to do**
1. Use explicit `BEGIN IMMEDIATE` transaction:
```python
await db.execute("BEGIN IMMEDIATE")
try:
await db.execute("INSERT INTO sessions ...")
await db.commit()
except Exception:
await db.rollback()
raise
```
2. Use `IMMEDIATE` mode to lock immediately for writes
3. Document transaction boundaries clearly
**Possible traps and issues**
- Nested transactions (SAVEPOINTs) may be needed
- Locks held too long cause contention
- Deadlocks possible with concurrent writers
**Docs changes needed**
- Add section in `Docs/Backend-Development.md` § Database Transactions
**Doc references**
- `Docs/Backend-Development.md` (database design)
---
## [IMPORTANT] Scheduler lock race condition ## [IMPORTANT] Scheduler lock race condition
**Where found** **Where found**

View File

@@ -27,8 +27,9 @@ if TYPE_CHECKING:
log: structlog.stdlib.BoundLogger = structlog.get_logger() log: structlog.stdlib.BoundLogger = structlog.get_logger()
#: How often the heartbeat job fires (seconds). Must be less than the lock TTL. #: How often the heartbeat job fires (seconds). Must be significantly less than
SCHEDULER_LOCK_HEARTBEAT_INTERVAL: int = 10 #: the lock TTL to allow multiple missed heartbeats before lock expiry.
SCHEDULER_LOCK_HEARTBEAT_INTERVAL: int = 5
#: Stable APScheduler job ID — ensures re-registration replaces, not duplicates. #: Stable APScheduler job ID — ensures re-registration replaces, not duplicates.
JOB_ID: str = "scheduler_lock_heartbeat" JOB_ID: str = "scheduler_lock_heartbeat"
@@ -44,6 +45,10 @@ async def _update_heartbeat_with_resources(settings: Settings) -> None:
a warning but don't crash the scheduler. This allows the running a warning but don't crash the scheduler. This allows the running
application to continue even if something went wrong. application to continue even if something went wrong.
The heartbeat must complete within TASK_TIMEOUT_SECONDS to prevent
scheduler starvation. If it exceeds this timeout, a warning is logged
and the task is cancelled.
Args: Args:
settings: The resolved application settings used for database access. settings: The resolved application settings used for database access.
""" """
@@ -57,10 +62,24 @@ async def _update_heartbeat_with_resources(settings: Settings) -> None:
else: else:
log.warning( log.warning(
"scheduler_lock_heartbeat_failed", "scheduler_lock_heartbeat_failed",
message="Failed to update heartbeat; we may have lost the lock.", message="Failed to update heartbeat; we no longer hold the lock. "
"Another instance may have taken over or the database connection failed.",
) )
await run_with_timeout("scheduler_lock_heartbeat", _do_update(), TASK_TIMEOUT_SECONDS) try:
await run_with_timeout("scheduler_lock_heartbeat", _do_update(), TASK_TIMEOUT_SECONDS)
except TimeoutError:
log.error(
"scheduler_lock_heartbeat_timeout",
timeout_seconds=TASK_TIMEOUT_SECONDS,
message="Heartbeat update exceeded timeout. The database may be slow or unresponsive.",
)
except Exception as e:
log.error(
"scheduler_lock_heartbeat_error",
error=str(e),
message="Unexpected error during heartbeat update.",
)
async def _update_heartbeat(app: FastAPI) -> None: async def _update_heartbeat(app: FastAPI) -> None:

View File

@@ -51,11 +51,16 @@ log: structlog.stdlib.BoundLogger = structlog.get_logger()
# Lock record expires if heartbeat hasn't been updated for this many seconds. # Lock record expires if heartbeat hasn't been updated for this many seconds.
# This prevents stale locks from a crashed instance from blocking new startups. # This prevents stale locks from a crashed instance from blocking new startups.
# Set conservatively to allow temporary delays (e.g., high load) before considering
# the lock abandoned.
SCHEDULER_LOCK_TTL_SECONDS: int = 60 SCHEDULER_LOCK_TTL_SECONDS: int = 60
# Heartbeat interval: how often to update the lock's heartbeat_at timestamp. # Heartbeat interval: how often to update the lock's heartbeat_at timestamp.
# Must be less than TTL to prevent premature expiration. # Must be significantly less than TTL (at least 3-4x smaller) to allow multiple
SCHEDULER_LOCK_HEARTBEAT_INTERVAL_SECONDS: int = 10 # consecutive missed heartbeats before the lock is considered stale.
# With TTL=60s and interval=5s, the lock survives ~12 missed heartbeats before
# expiring, providing robust protection against temporary delays.
SCHEDULER_LOCK_HEARTBEAT_INTERVAL_SECONDS: int = 5
async def init_scheduler_lock_table(db: aiosqlite.Connection) -> None: async def init_scheduler_lock_table(db: aiosqlite.Connection) -> None:

View File

@@ -2,7 +2,7 @@
These tests verify that the database-backed scheduler lock correctly enforces These tests verify that the database-backed scheduler lock correctly enforces
single-executor safety across multiple startup attempts, including stale lock single-executor safety across multiple startup attempts, including stale lock
cleanup and heartbeat updates. cleanup, heartbeat updates, and multi-process race condition prevention.
""" """
from __future__ import annotations from __future__ import annotations
@@ -15,6 +15,7 @@ import aiosqlite
import pytest import pytest
from app.utils.scheduler_lock import ( from app.utils.scheduler_lock import (
SCHEDULER_LOCK_HEARTBEAT_INTERVAL_SECONDS,
SCHEDULER_LOCK_TTL_SECONDS, SCHEDULER_LOCK_TTL_SECONDS,
acquire_scheduler_lock, acquire_scheduler_lock,
get_scheduler_lock_info, get_scheduler_lock_info,
@@ -220,3 +221,75 @@ async def test_scheduler_lock_full_lifecycle(
await release_scheduler_lock(lock_db) await release_scheduler_lock(lock_db)
info = await get_scheduler_lock_info(lock_db) info = await get_scheduler_lock_info(lock_db)
assert info is None assert info is None
@pytest.mark.asyncio
async def test_scheduler_lock_heartbeat_interval_sanity(
lock_db: aiosqlite.Connection,
) -> None:
"""Verify heartbeat interval is less than TTL to prevent premature expiry.
With a 5-second heartbeat interval and 60-second TTL, the lock can survive
~12 missed heartbeats before expiring. This provides robust protection against
temporary delays or high load that could cause a single missed heartbeat.
"""
# Verify the configuration ratio is safe (interval < TTL)
assert SCHEDULER_LOCK_HEARTBEAT_INTERVAL_SECONDS < SCHEDULER_LOCK_TTL_SECONDS
# With this ratio, the lock can survive at least 12 missed heartbeats
# (60s TTL / 5s interval = 12 intervals between heartbeats before expiry)
safe_ratio = SCHEDULER_LOCK_TTL_SECONDS / SCHEDULER_LOCK_HEARTBEAT_INTERVAL_SECONDS
assert safe_ratio >= 12, (
f"Heartbeat interval too long: lock can only survive {safe_ratio:.1f} missed heartbeats. "
f"Should be at least 12 for safety."
)
@pytest.mark.asyncio
async def test_scheduler_lock_race_condition_prevention(
lock_db: aiosqlite.Connection,
) -> None:
"""Test that the lock prevents concurrent execution (race condition).
Scenario: Process A acquires the lock and starts working. Process B starts
up and tries to acquire the lock. Even if Process A's heartbeat fails
momentarily, Process B should not acquire the lock immediately.
This test verifies:
1. Only one process can hold the lock at a time
2. The lock cannot be stolen while being actively maintained (via heartbeat)
3. Stale locks are only cleaned after TTL expires
"""
# Process A acquires the lock
result_a = await acquire_scheduler_lock(lock_db)
assert result_a is True
# Get the lock info
info_a = await get_scheduler_lock_info(lock_db)
assert info_a is not None
lock_heartbeat_a = info_a["heartbeat_at"]
# Process B tries to acquire — should fail
result_b = await acquire_scheduler_lock(lock_db)
assert result_b is False
# Process A updates its heartbeat (simulating ongoing work)
time.sleep(0.01)
result_heartbeat = await update_scheduler_lock_heartbeat(lock_db)
assert result_heartbeat is True
# Verify heartbeat was updated
info_a_updated = await get_scheduler_lock_info(lock_db)
assert info_a_updated is not None
assert info_a_updated["heartbeat_at"] > lock_heartbeat_a
# Process B still cannot acquire the lock (it's active and well-maintained)
result_b_retry = await acquire_scheduler_lock(lock_db)
assert result_b_retry is False
# Process A releases the lock
await release_scheduler_lock(lock_db)
# Now Process B can acquire it
result_b_final = await acquire_scheduler_lock(lock_db)
assert result_b_final is True