Commit Graph

2 Commits

Author SHA1 Message Date
c4ede71fa6 Fix: Enforce single-worker deployment for session cache cluster safety
Addresses: Backend session cache not cluster-safe (multi-worker issue)

Problem:
- Session cache is process-local (InMemorySessionCache)
- Multi-worker deployments (uvicorn --workers N) create separate processes
- Each process has its own independent session cache
- Sessions cached in Worker A are invisible to Workers B, C, D
- Users randomly logged out when requests land on different workers
- Also affects RuntimeState, rate limiter, and background jobs

Solution (Option A - Strict single-worker enforcement):
- Enhance startup validation with clearer error messages
- Update error messages to explain the problem and how to fix it
- Document single-worker requirement prominently in Docker configs
- Update module docstrings to clarify constraints

Changes:
1. app/startup.py:
   - Enhanced _check_single_worker_mode() error message with troubleshooting
   - Enhanced _stage_check_worker_mode_and_acquire_lock() error message
   - Removed unused import

2. app/utils/session_cache.py:
   - Updated module docstring to explain constraints more clearly
   - Added references to deployment documentation
   - Clarified multi-worker solution for future implementation

3. app/utils/runtime_state.py:
   - Updated module docstring with deployment constraint references
   - Aligned messaging with session_cache.py

4. Docker/Dockerfile.backend:
   - Added comprehensive comments about single-worker requirement
   - Explained impact in multi-worker deployments
   - Referenced deployment constraints documentation

5. Docker/docker-compose.yml, compose.prod.yml, compose.debug.yml:
   - Added documentation comments about BANGUI_WORKERS constraint
   - Explained why single-worker is required

6. backend/tests/test_startup_integration.py:
   - Fixed test unpacking to match function return signature (3 values, not 2)

This ensures multi-worker deployments fail loudly at startup with clear
guidance on what went wrong and how to fix it. The database-backed scheduler
lock provides defense-in-depth for container orchestration scenarios.

For future multi-worker support, implement:
- Redis or database-backed session cache
- Shared RuntimeState coordination
- Distributed APScheduler backend

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-04-30 20:54:24 +02:00
e86ab6dad1 10) Implement explicit startup DAG for resource initialization
- Created StartupDAG class to orchestrate startup stages with explicit dependencies
- Defined 6 startup stages: WORKER_MODE → DATABASE → GEO_CACHE → HTTP_SESSION → SCHEDULER → TASKS
- Each stage has prerequisites, error handling, and rollback support
- Refactored startup_shared_resources() to use the DAG
- Added StartupContext for resource tracking and failure management
- Partial failures automatically roll back all completed resources in reverse order
- Added health checks to verify all resources initialized successfully
- Comprehensive test coverage: 15 DAG unit tests + 3 integration tests + 6 existing tests
- Documented startup DAG in Architekture.md with detailed stage descriptions and failure modes

This replaces implicit ordering with explicit dependency tracking, making lifecycle
changes safe and failure modes predictable. Hidden order dependencies no longer exist.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-04-28 08:08:05 +02:00