Refactor scheduler lock implementation with heartbeat mechanism
- Add heartbeat-based lock renewal in scheduler_lock_heartbeat.py - Update scheduler_lock.py with improved lock management - Add comprehensive tests for scheduler lock functionality - Update deployment and task documentation Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
This commit is contained in:
@@ -18,7 +18,65 @@ If fail2ban goes offline but the backend always returns 200, Docker treats the c
|
||||
---
|
||||
|
||||
|
||||
## Resource Allocation
|
||||
## Scheduler Lock
|
||||
|
||||
In multi-instance deployments (e.g., Kubernetes, Docker Swarm), the scheduler lock prevents duplicate execution of background tasks by ensuring only one instance runs the scheduler at a time.
|
||||
|
||||
### How It Works
|
||||
|
||||
The lock is stored in the SQLite database and enforced via:
|
||||
|
||||
1. **Lock Acquisition** — At startup, each instance tries to insert a lock record. Only one succeeds; others reject startup with a clear error message.
|
||||
2. **Heartbeat** — The lock-holding instance sends a heartbeat every 5 seconds to prove it's still alive.
|
||||
3. **Stale Lock Cleanup** — On startup, any lock older than 60 seconds (without a heartbeat) is automatically deleted, allowing recovery from instance crashes.
|
||||
|
||||
### Configuration
|
||||
|
||||
| Parameter | Value | Rationale |
|
||||
|-----------|-------|-----------|
|
||||
| **Heartbeat Interval** | 5 seconds | Allows ~12 missed heartbeats before lock expires |
|
||||
| **Lock TTL** | 60 seconds | Time before a lock without heartbeat is considered abandoned |
|
||||
| **Min Safe Ratio** | 12x (TTL / interval) | Robust protection against temporary delays or high load |
|
||||
|
||||
With a 60-second TTL and 5-second heartbeat interval, the lock survives even if the instance becomes unresponsive for up to ~55 seconds. This provides strong protection against false positives while still detecting genuine crashes.
|
||||
|
||||
### Monitoring
|
||||
|
||||
Check logs for these key events:
|
||||
|
||||
- `scheduler_lock_acquired` — Lock successfully acquired at startup (INFO)
|
||||
- `scheduler_lock_heartbeat_updated` — Heartbeat successfully updated (DEBUG)
|
||||
- `scheduler_lock_heartbeat_failed` — Heartbeat update failed; lock may be lost (WARNING)
|
||||
- `scheduler_lock_heartbeat_timeout` — Heartbeat exceeded 5-second timeout (ERROR)
|
||||
- `scheduler_lock_held_by_other_instance` — Another instance holds the lock (WARNING at startup)
|
||||
|
||||
### Troubleshooting: "Blocklist import runs twice"
|
||||
|
||||
**Symptom:** Blocklist import task executes simultaneously in two instances, causing duplicate entries or data corruption.
|
||||
|
||||
**Cause:** The scheduler lock was released prematurely (e.g., instance crash, database timeout) while a task was still running.
|
||||
|
||||
**Solution:**
|
||||
|
||||
1. **Check heartbeat timing** — Ensure the instance isn't hanging for >60 seconds (monitor CPU/memory/disk).
|
||||
2. **Verify database health** — Run `SELECT * FROM scheduler_lock;` to see if a stale lock exists. If present, delete it: `DELETE FROM scheduler_lock;`
|
||||
3. **Review logs** — Look for `scheduler_lock_heartbeat_failed` or `scheduler_lock_heartbeat_timeout` errors in the time window when duplication occurred.
|
||||
4. **Increase resource limits** — If the backend is memory/CPU constrained, increase limits in `docker-compose.yml` to prevent slowdowns that trigger false lock timeouts.
|
||||
5. **Check database performance** — Slow database queries can delay heartbeat updates. Run `PRAGMA integrity_check;` to check for corruption.
|
||||
|
||||
If duplication occurs frequently, consider migrating to Redis-backed locking (see Advanced section below) for higher reliability.
|
||||
|
||||
### Advanced: Migrating to Redis
|
||||
|
||||
For very high-traffic deployments with strict data consistency requirements, you can replace the SQLite-backed lock with Redis:
|
||||
|
||||
- **Why:** Redis is single-threaded and atomic by design; clock skew and timeout issues are eliminated.
|
||||
- **How:** Install `redlock-py` or `aioredis`, replace `scheduler_lock.py` with a Redis implementation, update heartbeat interval to 2-3 seconds.
|
||||
- **Trade-off:** Adds a Redis dependency but eliminates database lock contention and provides microsecond-precision atomicity.
|
||||
|
||||
This is not required for typical deployments but is recommended if you see frequent scheduler conflicts in logs.
|
||||
|
||||
---
|
||||
|
||||
All containers have hard limits (max usage) and soft reservations (guaranteed allocation). This ensures:
|
||||
- **Isolation**: A misbehaving container cannot crash others or the host
|
||||
|
||||
@@ -1,50 +1,3 @@
|
||||
## [IMPORTANT] Database transactions lack explicit isolation
|
||||
|
||||
**Where found**
|
||||
|
||||
- `backend/app/repositories/session_repo.py:40-60` — multiple queries without `BEGIN TRANSACTION`
|
||||
- Similar pattern in multi-step operations across repositories
|
||||
|
||||
**Why this is needed**
|
||||
|
||||
Without explicit boundaries, concurrent requests can race: Thread A checks if exists → not found, Thread B checks same → not found, Thread A inserts → succeeds, Thread B inserts → duplicate error or silent overwrite.
|
||||
|
||||
**Goal**
|
||||
|
||||
Wrap all multi-step operations in explicit transactions with appropriate isolation level.
|
||||
|
||||
**What to do**
|
||||
|
||||
1. Use explicit `BEGIN IMMEDIATE` transaction:
|
||||
```python
|
||||
await db.execute("BEGIN IMMEDIATE")
|
||||
try:
|
||||
await db.execute("INSERT INTO sessions ...")
|
||||
await db.commit()
|
||||
except Exception:
|
||||
await db.rollback()
|
||||
raise
|
||||
```
|
||||
|
||||
2. Use `IMMEDIATE` mode to lock immediately for writes
|
||||
3. Document transaction boundaries clearly
|
||||
|
||||
**Possible traps and issues**
|
||||
|
||||
- Nested transactions (SAVEPOINTs) may be needed
|
||||
- Locks held too long cause contention
|
||||
- Deadlocks possible with concurrent writers
|
||||
|
||||
**Docs changes needed**
|
||||
|
||||
- Add section in `Docs/Backend-Development.md` § Database Transactions
|
||||
|
||||
**Doc references**
|
||||
|
||||
- `Docs/Backend-Development.md` (database design)
|
||||
|
||||
---
|
||||
|
||||
## [IMPORTANT] Scheduler lock race condition
|
||||
|
||||
**Where found**
|
||||
|
||||
Reference in New Issue
Block a user