feat: implement graceful shutdown with SIGINT/SIGTERM support

- Add WebSocket shutdown() with client notification and graceful close
- Enhance download service stop() with pending state persistence
- Expand FastAPI lifespan shutdown with proper cleanup sequence
- Add SQLite WAL checkpoint before database close
- Update stop_server.sh to use SIGTERM with timeout fallback
- Configure uvicorn timeout_graceful_shutdown=30s
- Update ARCHITECTURE.md with shutdown documentation
This commit is contained in:
2025-12-25 18:59:07 +01:00
parent 1ba67357dc
commit d70d70e193
9 changed files with 443 additions and 175 deletions

View File

@@ -141,6 +141,78 @@ Source: [src/config/settings.py](../src/config/settings.py#L1-L96)
---
## 11. Graceful Shutdown
The application implements a comprehensive graceful shutdown mechanism that ensures data integrity and proper cleanup when the server is stopped via Ctrl+C (SIGINT) or SIGTERM.
### 11.1 Shutdown Sequence
```
1. SIGINT/SIGTERM received
+-- Uvicorn catches signal
+-- Stops accepting new requests
2. FastAPI lifespan shutdown triggered
+-- 30 second total timeout
3. WebSocket shutdown (5s timeout)
+-- Broadcast {"type": "server_shutdown"} to all clients
+-- Close each connection with code 1001 (Going Away)
+-- Clear connection tracking data
4. Download service stop (10s timeout)
+-- Set shutdown flag
+-- Persist active download as "pending" in database
+-- Cancel active download task
+-- Shutdown ThreadPoolExecutor with wait
5. Progress service cleanup
+-- Clear event subscribers
+-- Clear active progress tracking
6. Database cleanup (10s timeout)
+-- SQLite: Run PRAGMA wal_checkpoint(TRUNCATE)
+-- Dispose async engine
+-- Dispose sync engine
7. Process exits cleanly
```
Source: [src/server/fastapi_app.py](../src/server/fastapi_app.py#L142-L210)
### 11.2 Key Components
| Component | File | Shutdown Method |
| ------------------- | ------------------------------------------------------------------------------------------ | ------------------------ |
| WebSocket Service | [websocket_service.py](../src/server/services/websocket_service.py) | `shutdown(timeout=5.0)` |
| Download Service | [download_service.py](../src/server/services/download_service.py) | `stop(timeout=10.0)` |
| Database Connection | [connection.py](../src/server/database/connection.py) | `close_db()` |
| Uvicorn Config | [run_server.py](../run_server.py) | `timeout_graceful_shutdown=30` |
| Stop Script | [stop_server.sh](../stop_server.sh) | SIGTERM with fallback |
### 11.3 Data Integrity Guarantees
1. **Active downloads preserved**: In-progress downloads are saved as "pending" and can resume on restart.
2. **Database WAL flushed**: SQLite WAL checkpoint ensures all writes are in the main database file.
3. **WebSocket clients notified**: Clients receive shutdown message before connection closes.
4. **Thread pool cleanup**: Background threads complete or are gracefully cancelled.
### 11.4 Manual Stop
```bash
# Graceful stop via script (sends SIGTERM, waits up to 30s)
./stop_server.sh
# Or press Ctrl+C in terminal running the server
```
Source: [stop_server.sh](../stop_server.sh#L1-L80)
---
## 3. Component Interactions
### 3.1 Request Flow (REST API)

View File

@@ -143,180 +143,92 @@ Currently, the application uses SQLAlchemy sessions with auto-commit behavior th
---
### Step 1: Create Transaction Utilities Module
## Task: Graceful Shutdown Implementation ✅ COMPLETED
**File**: `src/server/database/transaction.py`
### Objective
Create a new module providing transaction management utilities:
Implement proper graceful shutdown handling so that Ctrl+C (SIGINT) or SIGTERM triggers a clean shutdown sequence that terminates all concurrent processes and prevents database corruption.
1. **`@transactional` decorator** - Wraps a function in a transaction boundary
### Background
- Accepts a session parameter or retrieves one via dependency injection
- Commits on success, rolls back on exception
- Re-raises exceptions after rollback
- Logs transaction start, commit, and rollback events
The application runs multiple concurrent services (WebSocket connections, download service with ThreadPoolExecutor, database sessions) that need to be properly cleaned up during shutdown. Without graceful shutdown, active downloads may corrupt state, database writes may be incomplete, and WebSocket clients won't receive disconnect notifications.
2. **`TransactionContext` class** - Context manager for explicit transaction control
### Implementation Summary
- Supports `with` statement usage
- Provides `savepoint()` method for nested transactions using `begin_nested()`
- Handles commit/rollback automatically
The following components were implemented:
3. **`atomic()` function** - Async context manager for async operations
- Same behavior as `TransactionContext` but for async code
#### 1. WebSocket Service Shutdown ([src/server/services/websocket_service.py](src/server/services/websocket_service.py))
**Interface Requirements**:
- Added `shutdown()` method to `ConnectionManager` that:
- Broadcasts `{"type": "server_shutdown"}` notification to all connected clients
- Gracefully closes each WebSocket connection with code 1001 (Going Away)
- Clears all connection tracking data structures
- Supports configurable timeout (default 5 seconds)
- Added `shutdown()` method to `WebSocketService` that delegates to the manager
- Decorator must work with both sync and async functions
- Must handle the case where session is already in a transaction
- Must support optional `propagation` parameter (REQUIRED, REQUIRES_NEW, NESTED)
#### 2. Download Service Stop ([src/server/services/download_service.py](src/server/services/download_service.py))
---
- Enhanced `stop()` method to:
- Persist active downloads back to "pending" status in database (allows resume on restart)
- Cancel active download tasks with proper timeout handling
- Shutdown ThreadPoolExecutor with `wait=True` and configurable timeout (default 10 seconds)
- Fall back to forced shutdown if timeout expires
### Step 2: Update Connection Module
#### 3. FastAPI Lifespan Shutdown ([src/server/fastapi_app.py](src/server/fastapi_app.py))
**File**: `src/server/database/connection.py`
- Expanded shutdown sequence in proper order:
1. Broadcast shutdown notification via WebSocket
2. Stop download service and persist state
3. Clean up progress service (clear subscribers and active progress)
4. Close database connections with WAL checkpoint
- Added timeout protection (30 seconds total) with remaining time tracking
- Each step has individual timeout to prevent hanging
Modify the existing session management:
#### 4. Uvicorn Graceful Shutdown ([run_server.py](run_server.py))
1. Add `get_transactional_session()` generator that does NOT auto-commit
2. Add `TransactionManager` class for manual transaction control
3. Keep `get_db_session()` unchanged for backward compatibility
4. Add session state inspection utilities (`is_in_transaction()`, `get_transaction_depth()`)
- Added `timeout_graceful_shutdown=30` parameter to uvicorn.run()
- Ensures uvicorn allows sufficient time for lifespan shutdown to complete
- Updated docstring to document Ctrl+C behavior
---
#### 5. Stop Script ([stop_server.sh](stop_server.sh))
### Step 3: Wrap Service Layer Operations
- Replaced `kill -9` (SIGKILL) with `kill -TERM` (SIGTERM)
- Added `wait_for_process()` function that waits up to 30 seconds for graceful shutdown
- Only falls back to SIGKILL if graceful shutdown times out
- Improved user feedback during shutdown process
**File**: `src/server/database/service.py`
#### 6. Database WAL Checkpoint ([src/server/database/connection.py](src/server/database/connection.py))
Apply transaction handling to all compound write operations:
- Enhanced `close_db()` to run `PRAGMA wal_checkpoint(TRUNCATE)` for SQLite
- Ensures all pending WAL writes are flushed to main database file
- Prevents database corruption during shutdown
**AnimeService**:
### How Graceful Shutdown Works
- `create_anime_with_episodes()` - if exists, wrap in transaction
- Any method that calls multiple repository methods
1. **Ctrl+C or SIGTERM received** → uvicorn catches signal
2. **uvicorn triggers lifespan shutdown** → FastAPI's lifespan context manager exits
3. **WebSocket broadcast** → All connected clients receive shutdown notification
4. **Download service stops** → Active downloads persisted, executor shutdown
5. **Progress service cleanup** → Event subscribers cleared
6. **Database cleanup** → WAL checkpoint, connections disposed
7. **Process exits cleanly** → No data loss or corruption
**EpisodeService**:
### Testing
- `bulk_update_episodes()` - if exists
- `mark_episodes_downloaded()` - if handles multiple episodes
```bash
# Start server
conda run -n AniWorld python run_server.py
**DownloadQueueService**:
- `add_batch_to_queue()` - if exists
- `clear_and_repopulate()` - if exists
- Any method performing multiple writes
**SessionService**:
- `rotate_session()` - delete old + create new must be atomic
- `cleanup_expired_sessions()` - bulk delete operation
**Pattern to follow**:
```python
@transactional
def compound_operation(self, session: Session, data: SomeModel) -> Result:
# Multiple write operations here
# All succeed or all fail
# Press Ctrl+C to trigger graceful shutdown
# Or use the stop script:
./stop_server.sh
```
---
### Verification
### Step 4: Update Queue Repository
**File**: `src/server/services/queue_repository.py`
Ensure atomic operations for:
1. `save_item()` - check existence + insert/update must be atomic
2. `remove_item()` - if involves multiple deletes
3. `clear_all_items()` - bulk delete should be transactional
4. `reorder_queue()` - multiple position updates must be atomic
- All existing tests pass (websocket, download service, database transactions)
- WebSocket clients receive disconnect notification before connection closes
- Active downloads are preserved and can resume on restart
- SQLite WAL file is checkpointed before shutdown
---
### Step 5: Update API Endpoints
**Files**: `src/server/api/anime.py`, `src/server/api/downloads.py`, `src/server/api/auth.py`
Review and update endpoints that perform multiple database operations:
1. Identify endpoints calling multiple service methods
2. Wrap in transaction boundary at the endpoint level OR ensure services handle it
3. Prefer service-level transactions over endpoint-level for reusability
---
### Step 6: Add Unit Tests
**File**: `tests/unit/test_transactions.py`
Create comprehensive tests:
1. **Test successful transaction commit** - verify all changes persisted
2. **Test rollback on exception** - verify no partial writes
3. **Test nested transaction with savepoint** - verify partial rollback works
4. **Test decorator with sync function**
5. **Test decorator with async function**
6. **Test context manager usage**
7. **Test transaction propagation modes**
**File**: `tests/unit/test_service_transactions.py`
1. Test each service's compound operations for atomicity
2. Mock exceptions mid-operation to verify rollback
3. Verify no orphaned data after failed operations
---
### Step 7: Update Integration Tests
**File**: `tests/integration/test_db_transactions.py`
1. Test real database transaction behavior
2. Test concurrent transaction handling
3. Test transaction isolation levels if applicable
---
### Step 7: Update Dokumentation
1. Check Docs folder and updated the needed files
---
### Implementation Notes
- **SQLAlchemy Pattern**: Use `session.begin_nested()` for savepoints
- **Error Handling**: Always log transaction failures with full context
- **Performance**: Transactions have overhead - don't wrap single operations unnecessarily
- **Testing**: Use `session.rollback()` in test fixtures to ensure clean state
### Files to Modify
| File | Action |
| ------------------------------------------- | ------------------------------------------ |
| `src/server/database/transaction.py` | CREATE - New transaction utilities |
| `src/server/database/connection.py` | MODIFY - Add transactional session support |
| `src/server/database/service.py` | MODIFY - Apply @transactional decorator |
| `src/server/services/queue_repository.py` | MODIFY - Ensure atomic operations |
| `src/server/api/anime.py` | REVIEW - Check for multi-write endpoints |
| `src/server/api/downloads.py` | REVIEW - Check for multi-write endpoints |
| `src/server/api/auth.py` | REVIEW - Check for multi-write endpoints |
| `tests/unit/test_transactions.py` | CREATE - Transaction unit tests |
| `tests/unit/test_service_transactions.py` | CREATE - Service transaction tests |
| `tests/integration/test_db_transactions.py` | CREATE - Integration tests |
### Acceptance Criteria
- [x] All database write operations use explicit transactions
- [x] Compound operations are atomic (all-or-nothing)
- [x] Exceptions trigger proper rollback
- [x] No partial writes occur on failures
- [x] All existing tests pass (1090 tests passing)
- [x] New transaction tests pass with >90% coverage (90% achieved)
- [x] Logging captures transaction lifecycle events
- [x] Documentation updated in DATABASE.md
- [x] Code follows project coding standards