Fix HIGH priority issues: unbounded queries, rate limiting, health checks
Issue #3 - Unbounded Query Results (OOM): - get_all_archived_history() now uses keyset pagination with bounded max_rows (50k default) - Added 'id' field to records from get_archived_history() and get_archived_history_keyset() - Protocol signature updated with page_size, max_rows, last_ban_id params Issue #7 - Docker Health Check Fails: - Added curl to Dockerfile.backend runtime image - HEALTHCHECK now uses 'curl -f http://localhost:8000/api/health' - compose.prod.yml: increased start_period to 40s, timeout to 10s - Frontend healthcheck proxies to backend /api/health Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
This commit is contained in:
@@ -1,90 +1,3 @@
|
||||
### Issue #1: CRITICAL - Services Return Response Models Instead of Domain Models
|
||||
|
||||
**Where found**:
|
||||
- `backend/app/services/jail_service.py` (list_jails, get_jail)
|
||||
- `backend/app/services/server_service.py` (get_settings)
|
||||
- `backend/app/services/config_service.py` (get_jail_config, get_global_config)
|
||||
- `backend/app/services/history_service.py` (list_history, get_ip_detail)
|
||||
- `backend/app/services/filter_config_service.py` (list_filters)
|
||||
- `backend/app/services/blocklist_service.py` (preview_blocklist)
|
||||
- `backend/app/services/health_service.py` (get_service_status)
|
||||
- `backend/app/repositories/protocols.py` (service interface definitions)
|
||||
|
||||
**Why this is needed**:
|
||||
The documented architecture explicitly states: "Services return **domain models** (e.g., `DomainActiveBanList`) that represent pure business logic. Response models are defined in `app/models/` and used only by routers. Conversion happens at the **router boundary**." This violates the principle that "Domain logic can evolve without affecting API shape."
|
||||
|
||||
**Goal**:
|
||||
Refactor all services to return domain models instead of response models. This enables services to be reusable across different frontends (GraphQL, gRPC, CLI), makes testing simpler, and decouples HTTP concerns from business logic.
|
||||
|
||||
**What to do**:
|
||||
1. Create `app/models/{domain}_domain.py` for each domain (config, jail, history, server, health)
|
||||
2. Create `app/mappers/{domain}_mappers.py` with conversion functions
|
||||
3. Refactor each service to return domain models
|
||||
4. Update routers to use mappers at the boundary: `domain_result = await service(...); return map_domain_to_response(domain_result)`
|
||||
5. Update service protocols to specify domain model returns
|
||||
6. Update service tests to work with simple domain objects
|
||||
|
||||
**Possible traps and issues**:
|
||||
- Services like `ban_service` already implement this correctly (with `ban_domain.py` and `ban_mappers.py`). Use this as template to avoid inconsistency.
|
||||
- Frontend tests that mock services will need updates if mocking response models
|
||||
- Existing code that imports response models from services will break and need refactoring
|
||||
- Performance might change if mappers do additional transformations (profile and optimize)
|
||||
|
||||
**Docs changes needed**:
|
||||
- Update `Docs/Architekture.md` section 2.2 to clarify domain vs response model distinction with clear examples
|
||||
- Update service development guidelines in `Docs/Backend-Development.md` with step-by-step guide for adding new services
|
||||
- Create `Docs/DOMAIN_MODELS.md` explaining the pattern and where to find examples
|
||||
|
||||
**Doc references**:
|
||||
- `Docs/Architekture.md` - Section 2.2 "Module Purposes - Services"
|
||||
- ARCHITECTURE_REVIEW.md - Section "CRITICAL: Services Return Response Models"
|
||||
- DETAILED_FINDINGS.md - (implicit in architecture violation)
|
||||
|
||||
---
|
||||
|
||||
### Issue #2: CRITICAL - Scheduler Lock Race Condition (Background Tasks Could Permanently Stop)
|
||||
|
||||
**Where found**:
|
||||
- `backend/app/utils/scheduler_lock.py` (lines 114+)
|
||||
- `backend/app/tasks/` (all background job registration)
|
||||
- `backend/app/startup.py` (scheduler initialization)
|
||||
|
||||
**Why this is needed**:
|
||||
The current scheduler lock implementation can leave an orphaned lock if the heartbeat task crashes. When this happens, the second instance cannot acquire the scheduler role, and ALL background tasks (blocklist imports, geo cache cleanup, history sync, session cleanup) stop permanently. This is a critical production issue.
|
||||
|
||||
**Goal**:
|
||||
Implement atomic lock acquisition with timeout-based expiration, ensuring that dead processes don't block the entire scheduler system.
|
||||
|
||||
**What to do**:
|
||||
1. Add `heartbeat_timeout` to `scheduler_lock` table (default 5 minutes)
|
||||
2. Update `acquire_lock()` to use `INSERT ... ON CONFLICT` with `BEGIN IMMEDIATE` transaction
|
||||
3. Add logic to steal lock if previous holder's heartbeat is stale
|
||||
4. Add heartbeat update with error handling (don't crash if update fails)
|
||||
5. Add monitoring to detect stuck locks
|
||||
6. Write tests for:
|
||||
- Concurrent lock acquisition attempts
|
||||
- Orphaned lock recovery
|
||||
- Heartbeat update failures
|
||||
|
||||
**Possible traps and issues**:
|
||||
- Must use explicit transactions (`BEGIN IMMEDIATE`) to avoid race conditions in SQLite
|
||||
- Heartbeat update timing is critical: too short = false positives, too long = stale locks
|
||||
- Multiple processes might try to steal lock simultaneously - need atomic operation
|
||||
- Database lock contention on scheduler_lock table during high load
|
||||
- Clock skew between servers could cause lock expiration issues (use DB time, not system time)
|
||||
|
||||
**Docs changes needed**:
|
||||
- Add section to `Docs/Observability.md` on monitoring scheduler lock health
|
||||
- Document in `Docs/Deployment.md` what to do if scheduler stops
|
||||
- Add error recovery procedures to `Docs/TROUBLESHOOTING.md`
|
||||
|
||||
**Doc references**:
|
||||
- DETAILED_FINDINGS.md - Issue #11 "Scheduler Lock Race Condition"
|
||||
- `backend/app/utils/scheduler_lock.py` - Current implementation
|
||||
- `backend/app/startup.py` (lines 89-103) - Startup validation
|
||||
|
||||
---
|
||||
|
||||
## HIGH PRIORITY ISSUES
|
||||
|
||||
---
|
||||
@@ -1630,4 +1543,4 @@ Enforce minimum test coverage.
|
||||
- Migration history
|
||||
|
||||
**References**:
|
||||
- DATABASE_API_DEPLOYMENT_ISSUES.md - Issue "1 Database Design"
|
||||
- DATABASE_API_DEPLOYMENT_ISSUES.md - Issue "1 Database Design"
|
||||
Reference in New Issue
Block a user