Fix non-atomic setup persistence across DB contexts (Issue #30)

Implement transactional setup with explicit state machine and crash-safety
to prevent partial commits from leaving inconsistent state.

## Changes

### Core Implementation
1. **settings_repo.py**: Add atomic batch settings write
   - New set_settings_batch() method: writes multiple settings in single
     transaction (BEGIN IMMEDIATE ... COMMIT). Either all settings persist
     or none do, preventing partial state if crash occurs mid-batch.

2. **setup_service.py**: Refactor run_setup() with transactional phases
   - Phase 0: Compute password hash early (before any DB writes) to ensure
     idempotency. Same hash is used throughout retries, preventing divergent
     hashes from bcrypt's random salt.
   - Phase 1 (Bootstrap DB transaction): Set setup_state=in_progress and
     database_path, then commit. First checkpoint for crash detection.
   - Phase 2 (Filesystem): Initialize runtime database (idempotent)
   - Phase 3 (Runtime DB transaction): Batch-write all settings atomically
   - Phase 4 (Bootstrap DB transaction): Set setup_state=complete and
     setup_completed=1. Final commit point.

3. **protocols.py**: Add set_settings_batch to SettingsRepository protocol

### Testing
- Added 6 new transactionality tests covering:
  - State machine transitions (None → in_progress → complete)
  - Password hash idempotency across retries
  - Atomic batch writes (all-or-nothing persistence)
  - Bootstrap DB state tracking
  - Database path propagation to both DBs
  - Recovery on partial failure
- All 18 tests pass (12 existing + 6 new)

### Documentation
- Updated Docs/Architekture.md with new section 6:
  - Setup state machine with state transitions
  - Transaction boundary documentation
  - Password hash idempotency rationale
  - Backward compatibility notes

## Design Decisions

### Why This Approach
- Current code already idempotent via INSERT OR REPLACE, but password
  hash non-idempotency created silent inconsistency risk
- Simpler than multi-state machine: 2 states sufficient for detection
- Maintains backward compatibility (setup_completed key still written)
- Explicit transactions make crash-safety obvious to future maintainers

### Crash Scenarios Now Handled
1. Crash after Phase 1 → detected by setup_state=in_progress on retry
2. Crash after Phase 2 → runtime DB may be partial, safe to retry
3. Crash after Phase 3 → runtime DB rolls back on next connection
4. Crash after Phase 4 → setup_completed detected, skipped

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
This commit is contained in:
2026-04-29 19:19:53 +02:00
parent cc4370c50d
commit 1302ac821f
5 changed files with 376 additions and 30 deletions

View File

@@ -920,7 +920,51 @@ BanGUI maintains its **own SQLite database** (separate from the fail2ban databas
---
## 6. Authentication & Session Management
## 6. Setup & Configuration Persistence
### 6.1 Initial Setup Wizard & One-Time Configuration
The setup wizard (`POST /api/setup`) runs once during first-time startup to configure:
- Master password (bcrypt-hashed)
- Runtime database path (where BanGUI stores operational state)
- fail2ban Unix socket path
- IANA timezone
- Session duration (in minutes)
- Map color thresholds for geolocation visualization
**Atomicity & Crash-Safety:**
Setup is implemented with explicit transaction boundaries across two SQLite databases (bootstrap config DB and runtime app DB) to ensure atomicity:
1. **Phase 1 (Bootstrap DB transaction)**: Set `setup_state = "in_progress"` and persist `database_path`. On commit, this is the first checkpoint — if process crashes here, the next setup attempt will detect and clean up.
2. **Phase 2 (Filesystem + Runtime DB)**: Initialize runtime database schema outside a transaction (idempotent via `CREATE TABLE IF NOT EXISTS`).
3. **Phase 3 (Runtime DB transaction)**: Batch-write all runtime settings (password hash, paths, config) atomically in a single `BEGIN IMMEDIATE ... COMMIT` transaction. Either all settings are persisted or none are.
4. **Phase 4 (Bootstrap DB transaction)**: Set `setup_state = "complete"` and `setup_completed = "1"`. This is the final commit point — only when this succeeds is setup considered complete.
**Password Hash Idempotency:**
The bcrypt password hash is computed early (before any DB writes) to ensure that if setup is retried after a crash, the same hash is used throughout all retry attempts. This prevents divergent hashes due to bcrypt's random salt generation.
**State Machine:**
| State | Meaning | Recovery |
|-------|---------|----------|
| `null` | Setup not started | Normal flow: begin setup |
| `"in_progress"` | Bootstrap DB marked, runtime DB being initialized | Retry from beginning (runtime DB may be partial) |
| `"complete"` | All settings persisted, setup finished | Skip setup (already done) |
If a crash is detected in `"in_progress"` state on the next startup, cleanup logic can detect this and either retry or remove the partial runtime database before retrying.
**Backward Compatibility:**
The `setup_completed = "1"` key is still written for backward compatibility with cache detection. Modern code checks `setup_state = "complete"` for clearer semantics.
---
## 8. Authentication & Session Management
- **Single-user model** — one master password, no usernames.
- Password is hashed with a strong algorithm (e.g., bcrypt or argon2) and stored in the application database during setup.
@@ -934,7 +978,7 @@ BanGUI maintains its **own SQLite database** (separate from the fail2ban databas
- **Runtime state** (`RuntimeState` in `app.utils.runtime_state`) — stores mutable application state: `server_status` (fail2ban online/offline), `last_activation` (jail activation tracking), `pending_recovery` (crash detection), `runtime_settings` (effective configuration), and service-specific state holders like `jail_service_state` (`JailServiceState` for jail capability detection cache). RuntimeState fields are managed through dedicated functions (e.g., `record_activation()`, `clear_pending_recovery()`) and via dependency injection to services. Service-specific state (like `JailServiceState`) is nested within `RuntimeState` to keep all mutable state in one controlled location. **⚠️ RuntimeState is process-local and only safe when BanGUI runs as a single asyncio worker.** Mutations must not span `await` points (cooperative scheduling within a single event loop is safe). In multi-worker deployments, each process has its own copy — logouts from worker A don't affect worker B's cache, health status updates are per-worker, and activation tracking is unreliable. BanGUI enforces single-worker mode (TASK-002) to prevent this issue. For future multi-worker support, replace RuntimeState with a shared coordination backend (Redis, shared memory, database). See `app/utils/runtime_state.py` module docstring for details.
- **Setup-completion flag** — once `is_setup_complete()` returns `True`, the result is stored in `app.state._setup_complete_cached`. The `SetupRedirectMiddleware` skips the DB query on all subsequent requests, removing 1 SQL query per request for the common post-setup case. The completion flag is only written after the runtime database is successfully initialized and all initial setup settings are persisted, preventing a failed setup from permanently bypassing the setup wizard.
### 6.1 CSRF Protection
### 8.1 CSRF Protection
State-mutating endpoints (POST, PUT, DELETE, PATCH) that use cookie-based authentication are protected against Cross-Site Request Forgery (CSRF) attacks via a **custom header check middleware**.
@@ -949,7 +993,7 @@ State-mutating endpoints (POST, PUT, DELETE, PATCH) that use cookie-based authen
This mechanism complements the existing `SameSite=Lax` cookie policy, which blocks traditional `<form>` POST requests but does not protect against JavaScript-initiated requests on a subdomain or same-origin XSS injection.
---
## 7. Scheduling
## 9. Scheduling
APScheduler 4.x (async mode) manages recurring background tasks.
@@ -972,7 +1016,7 @@ APScheduler 4.x (async mode) manages recurring background tasks.
---
## 7.1 Background Tasks and Database Access
## 10.1 Background Tasks and Database Access
- APScheduler jobs run outside FastAPI request/response scope and therefore cannot rely on ``Depends(get_db)``.
- Background tasks must open their own application database connection via ``app.db.open_db`` and close it when the work completes.
@@ -981,9 +1025,9 @@ APScheduler 4.x (async mode) manages recurring background tasks.
---
## 8. API Design
## 9. API Design
### 8.1 Conventions
### 9.1 Conventions
- All endpoints are grouped under `/api/` prefix.
- JSON request and response bodies, validated by Pydantic models.
@@ -992,7 +1036,7 @@ APScheduler 4.x (async mode) manages recurring background tasks.
- Standard HTTP status codes: `200` success, `201` created, `204` no content, `400` bad request, `401` unauthorized, `404` not found, `422` validation error, `423` locked, `500` server error.
- Error responses follow a consistent shape: `{ "detail": "Human-readable message" }`.
### 8.2 Endpoint Groups
### 9.2 Endpoint Groups
| Group | Endpoints | Description |
|---|---|---|
@@ -1043,7 +1087,7 @@ APScheduler 4.x (async mode) manages recurring background tasks.
---
## 9.2 nginx Routing Rules
## 10.2 nginx Routing Rules
The reverse proxy (nginx) must route requests correctly to prevent frontend SPA fallback rules from hiding backend 404 errors. The following location blocks ensure proper behavior: