refactoring-backend #3
Reference in New Issue
Block a user
Delete Branch "refactoring-backend"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Previously, the tab content wrapper used 'key={tab}' which caused React to unmount and remount the entire subtree when switching tabs. This destroyed all component state, including unsaved form data and pending auto-saves. Changes: - Removed 'key={tab}' from the wrapper div - All tab panels now render at page initialization - Inactive tabs use CSS 'display: none' to hide without unmounting - Tabs remain mounted throughout the page lifetime - Users can now switch tabs without losing form input Updated ConfigPage.test.tsx to reflect that inactive tabs remain in the DOM (just hidden with CSS) rather than being removed entirely. Documentation: Added 'Tab Panels' section to Web-Development.md explaining the rule and rationale. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>Replace the flawed join-based comparison (entryKeys.join(',')) with JSON.stringify() to properly handle keys containing commas. The previous implementation could produce false equality when different key sets shared the same comma-separated representation (e.g., 'a,b' key vs separate 'a' and 'b' keys). This ensures the effect fires correctly when keys change, fixing silent failures to update derived state. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>The AppState Protocol (lines 42-54) and ApplicationContext dataclass (lines 57-69) described identical fields, creating maintenance burden. The Protocol was only used for a single cast() in _build_app_context. Changes: - Remove AppState Protocol class - Import ApplicationState from runtime_state.py - Replace cast('AppState', request.app.state) with cast(ApplicationState, request.app.state) - Remove unused Protocol import This eliminates the redundancy while maintaining the same typing guarantees. request.app.state is set to ApplicationState instances during app initialization in main.py. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>- Extract path validation logic into shared helper function in backend/app/utils/path_utils.py (validate_log_path) - Refactor AddLogPathRequest to use the helper function - Apply the same validation to DELETE /api/config/jails/{name}/logpath endpoint by validating the log_path query parameter - Return HTTP 422 with descriptive error if validation fails - Add comprehensive unit tests for path validation - Update Backend-Development.md with usage examples This prevents path-traversal attacks on the delete_log_path endpoint by ensuring all log paths are within allowlisted directories. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>Implement transactional setup with explicit state machine and crash-safety to prevent partial commits from leaving inconsistent state. ## Changes ### Core Implementation 1. **settings_repo.py**: Add atomic batch settings write - New set_settings_batch() method: writes multiple settings in single transaction (BEGIN IMMEDIATE ... COMMIT). Either all settings persist or none do, preventing partial state if crash occurs mid-batch. 2. **setup_service.py**: Refactor run_setup() with transactional phases - Phase 0: Compute password hash early (before any DB writes) to ensure idempotency. Same hash is used throughout retries, preventing divergent hashes from bcrypt's random salt. - Phase 1 (Bootstrap DB transaction): Set setup_state=in_progress and database_path, then commit. First checkpoint for crash detection. - Phase 2 (Filesystem): Initialize runtime database (idempotent) - Phase 3 (Runtime DB transaction): Batch-write all settings atomically - Phase 4 (Bootstrap DB transaction): Set setup_state=complete and setup_completed=1. Final commit point. 3. **protocols.py**: Add set_settings_batch to SettingsRepository protocol ### Testing - Added 6 new transactionality tests covering: - State machine transitions (None → in_progress → complete) - Password hash idempotency across retries - Atomic batch writes (all-or-nothing persistence) - Bootstrap DB state tracking - Database path propagation to both DBs - Recovery on partial failure - All 18 tests pass (12 existing + 6 new) ### Documentation - Updated Docs/Architekture.md with new section 6: - Setup state machine with state transitions - Transaction boundary documentation - Password hash idempotency rationale - Backward compatibility notes ## Design Decisions ### Why This Approach - Current code already idempotent via INSERT OR REPLACE, but password hash non-idempotency created silent inconsistency risk - Simpler than multi-state machine: 2 states sufficient for detection - Maintains backward compatibility (setup_completed key still written) - Explicit transactions make crash-safety obvious to future maintainers ### Crash Scenarios Now Handled 1. Crash after Phase 1 → detected by setup_state=in_progress on retry 2. Crash after Phase 2 → runtime DB may be partial, safe to retry 3. Crash after Phase 3 → runtime DB rolls back on next connection 4. Crash after Phase 4 → setup_completed detected, skipped Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>Enforce single-executor safety regardless of process launcher through a robust database-backed lock mechanism that works reliably in container orchestration environments. Key changes: 1. Add scheduler_lock table to database schema (migration 4) - Singleton row (id=1) prevents concurrent execution - Stores PID, hostname, creation timestamp, heartbeat timestamp - Atomic transaction prevents race conditions 2. Create scheduler lock utility (app/utils/scheduler_lock.py) - acquire_scheduler_lock(): Atomically acquire or fail - release_scheduler_lock(): Clean up on shutdown - update_scheduler_lock_heartbeat(): Keep lock alive (every 10 seconds) - get_scheduler_lock_info(): Debug/inspect lock status - Stale lock detection: TTL-based (60 second expiry) 3. Reorder startup DAG stages - DATABASE now comes first (required for lock acquisition) - WORKER_MODE depends on DATABASE (performs lock check after initialization) - Maintains all other stage dependencies intact 4. Update startup process (app/startup.py) - Replace _check_single_worker_mode() with two-tier check: * Fast check: BANGUI_WORKERS env var (if explicitly set to >1) * Authoritative check: Database lock (catches misconfiguration) - Return startup_db from startup_shared_resources() for lock management 5. Register scheduler lock heartbeat task - New task: scheduler_lock_heartbeat (app/tasks/scheduler_lock_heartbeat.py) - Updates lock heartbeat every 10 seconds (keeps lock alive) - Prevents false positives from temporary load spikes 6. Add lock release to lifespan shutdown (app/main.py) - Release lock before closing database - Allows other instances to acquire during rolling deployments - Graceful handoff between instances 7. Comprehensive test coverage (backend/tests/test_scheduler_lock.py) - Lock acquisition success and failure cases - Stale lock cleanup on startup - Lock release and heartbeat updates - Full lifecycle: acquire → heartbeat → release 8. Update documentation (Docs/Architekture.md § 9.3) - Explain single-executor requirement - Document database-backed locking mechanism - Compare with alternative approaches (filesystem, env var) - Include troubleshooting guide - Container orchestration examples (Docker, Kubernetes, systemd) Why database-backed instead of filesystem? - Atomicity: SQLite transactions prevent TOCTOU race windows - Container-safe: Works across containers with shared DB volumes - No NFS/SMB edge cases - Timestamp-based stale detection (PID reuse is unreliable) - More reliable in rolling deployments Benefits: - Works with any process manager (uvicorn, gunicorn, etc.) - Handles simultaneous startup attempts correctly - Automatic failover on instance crash (stale lock cleanup) - Clear error messages with troubleshooting steps - No environment variable required (lock is authoritative) - Scales to multi-worker deployments if combined with external job store Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>- Create PaginationMetadata model with computed derived fields (total_pages, has_next_page, has_prev_page) - Update PaginatedListResponse to embed pagination metadata in a separate 'pagination' object - Add create_pagination_metadata() factory function in utils/pagination.py for consistent computation - Update all paginated service functions to use new structure: - history_service.list_history() - blocklist_service.get_import_logs() - jail_service.get_jail_banned_ips() - ban_mappers.map_domain_dashboard_ban_list_to_response() - Update response model docstrings with new structure examples - Update Backend-Development.md documentation with new pagination patterns - Update test fixtures to work with new response structure Response shape changes from: {"items": [...], "total": 100, "page": 1, "page_size": 50} To: {"items": [...], "pagination": {"page": 1, "page_size": 50, "total": 100, "total_pages": 2, "has_next_page": true, "has_prev_page": false}} Benefits: - Frontend receives all pagination state needed for UI controls - No need for frontend to calculate total_pages or page navigation logic - Consolidated pagination metadata reduces field sprawl - OpenAPI schema automatically reflects changes Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>The API pagination infrastructure was already correctly implemented with: - PaginatedListResponse base model containing 'items' and 'pagination' fields - PaginationMetadata object with all required fields (page, page_size, total, total_pages, has_next_page, has_prev_page) - All services correctly calling create_pagination_metadata() However, there were two bugs preventing tests from passing: 1. IMPORT BUG: time_utils.py was importing TIME_RANGE_SECONDS from app.models.ban when it's actually defined in app.models._common. This caused import errors in tests that exercise time-range filtering. 2. TEST BUG: Test assertions were using outdated API structure, accessing .total, .page, .page_size directly on paginated responses instead of through the .pagination object. Fixed locations: - test_mappers/test_ban_mappers.py: 3 assertions updated to use .pagination.* - test_services/test_blocklist_service.py: 6 assertions updated - test_services/test_history_service.py: 14 assertions updated All paginated API endpoints now correctly return pagination metadata: - GET /api/history - GET /api/history/archive - GET /api/dashboard/bans - GET /api/jails/{name}/banned - GET /api/blocklists/log Verified with 24 passing pagination tests demonstrating correct behavior. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>- Backend: Implement Prometheus metrics collection - Add prometheus-client dependency - Create metrics utility module with HTTP request tracking counters, histograms, gauges - Implement MetricsMiddleware to track request latency, count, and active requests - Add /metrics endpoint to expose metrics in Prometheus text format - Normalize paths to prevent cardinality explosion (e.g., /api/{id} for UUIDs) - Exclude /metrics and /health from detailed tracking - Frontend: Add web vitals and API metrics collection - Install web-vitals library (v4.0.0) for Core Web Vitals tracking - Create metrics utility module for FCP, LCP, CLS, INP, TTFB collection - Implement useTrackedFetch hook for automatic API call metrics (method, endpoint, status, duration) - Initialize web vitals tracking in App component on mount - Provide exportMetrics() for sending metrics to backend - Testing: - Add comprehensive backend metrics tests (9 tests, 100% coverage) - Add comprehensive frontend metrics tests (10 tests) - All tests passing - Documentation: - Expand Docs/Observability.md with complete APM section - Include metrics reference, integration examples (Prometheus, Datadog, NewRelic) - Add troubleshooting guide and best practices for cardinality management - Update Tasks.md to mark APM task as complete Metrics exposed: - bangui_http_requests_total: HTTP request count by method, endpoint, status - bangui_http_request_duration_seconds: Request latency histogram - bangui_http_active_requests: Active request gauge - Web Vitals: CLS, FCP, INP, LCP, TTFB with ratings - API metrics: endpoint, method, status, duration, timestamp Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>- All backend routers moved to /api/v1/ prefix - Frontend BASE_URL updated to /api/v1 - Setup redirect middleware updated to redirect to /api/v1/setup - Health router path fixed: prefix=/api/v1/health, @router.get('') - conftest.py: set server_status=online for test fixture - Created Docs/API_VERSIONING.md with deprecation policy - Updated Docs/Backend-Development.md with versioning section - Updated Instructions.md curl examples Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>- Rework Login As Admin: use sessionStorage flag + relative fetch login + polling loop - Add data-testid to JailDetailPage error render path - Add Collections library import for Get From List keyword - Fix /jails API response extraction (returns {items, total} not plain list) - Change Close Context to Close Browser for proper browser cleanup - Add domcontentloaded + Sleep + polling to Config test to avoid premature timeout Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>auth.resource: - add Login Via HTTP keyword for RequestsLibrary auth (CSRF-aware) - fix session_duration_minutes type: bare int → ${60} - add Process library import to common.resource 03_blocklist_import.robot: - fix selector to button[data-testid] (was matching all buttons) - use GET/POST On Session with auth session for blocklist API calls - fix log response key: entries → items - fix enabled=true → ${TRUE} for boolean type - fix ${len(sources)} → Get Length keyword - make Ensure Blocklist Source Exists accept session argument - replace strict error assertion with specific error banner check - add graceful Terminate Process teardown 02_ban_records.robot: - add Process library import Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>