Commit Graph

215 Commits

Author SHA1 Message Date
c96b87ee8b feat: reject common passwords in SetupRequest
- Add ~75 common plaintext passwords to setup.py validator
- Check case-insensitively; passes complexity but blocked
- Add tests: reject common, accept unique, short common fail on length
- Update Security.md docs

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-05-03 18:25:17 +02:00
5f0ab40816 refactor(backend): clean up models setup, improve ip utils, add adr docs
- Extract ADR documents for architectural decisions (SQLite, FastAPI, React, APScheduler, Scheduler)
- Refactor setup.py: improve code structure and readability
- Add IP validation utilities with test coverage
- Update frontend components (BanTable, HistoryPage)
- Add pre-commit hooks and CONTRIBUTING.md
- Add .editorconfig for consistent coding standards
2026-05-03 18:04:45 +02:00
5058a50143 Refactor backend: fix geo cache cleanup, scheduler heartbeat, correlation middleware; update docs 2026-05-03 16:02:40 +02:00
896751ada9 fix: handle socket close errors properly in PapertrailLogHandler
- Replace contextlib.suppress with try/except + warning log
- Add test for fail2ban client
- Remove stale Issue #21 from Tasks.md (indexes)

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-05-03 12:25:14 +02:00
Copilot
22db607875 Add fail2ban DB index management and socket-based path resolution
- New get_fail2ban_db_path() in setup_service resolves DB path from configured socket path
- New ensure_fail2ban_indexes() creates missing performance indexes on bans table
- Call ensure_fail2ban_indexes on every startup before first ban query
- Remove completed tasks from Docs/Tasks.md
- Update Docs/PERFORMANCE.md with index findings
2026-05-03 12:17:31 +02:00
0133489920 Update observability docs and task utilities
- Add Observability.md documentation
- Standardize task logging with correlation_id support
- Add log_sanitizer utility for PII masking
- Update Tasks.md tracking
- Update geo_cache tasks and other task modules with correlation_id

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-05-03 11:52:09 +02:00
7b93499551 Refactor config loading and add status code docs
- Move config loading to dedicated ConfigLoader class with validation
- Add DATABASE_MIGRATIONS.md content to TROUBLESHOOTING.md
- Add API_STATUS_CODES.md documenting all API response codes
- Update runner.csx to use new config structure
- Add check_responses.py validation script
- Update config tests for new structure

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-05-03 11:52:01 +02:00
bd6170722a feat(geo): add cache hit/miss metrics and prewarm support
- Add _hits/_misses counters to GeoCache for cache hit/miss ratio tracking
- Reset counters on clear()
- Count hits before misses in lookup_batch() to avoid interleaving
- Add synchronous prewarm() using asyncio.create_task for fire-and-forget
- Add hits/misses fields to GeoCacheStatsResponse model
- Add TestCacheMetrics (5 tests), TestPrewarm (3 tests), TestLargeBanList (2 tests)
- Fix _make_async_db() mock: db.execute is not async, returns ctx manager
- Move collections.abc to TYPE_CHECKING block (TC003)

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-05-03 00:35:47 +02:00
0817a4cb47 fix(regex_validator): add ReDoS detection via regexploit
Detect catastrophic backtracking patterns before regex compilation
using regexploit library. Add ReDoSDetectedError exception and
_MINIMUM_STARRINESS threshold (>=3) to catch dangerous patterns
like (a+)+b. Update pyproject.toml deps, add tests for detection.
2026-05-03 00:05:33 +02:00
e436727942 fix: atomic upsert for import runs (Issue #12)
Replace check-then-insert race condition with INSERT ON CONFLICT.
- upsert_pending uses RETURNING id for atomic upsert
- UNIQUE(source_id, content_hash) constraint from migration 6
- blocklist_import_workflow updated to use upsert_pending
- test_import_source_success fixed for async mock patterns

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-05-02 23:39:43 +02:00
1285bc8571 feat: comprehensive health check with DB, scheduler, cache
- Add /api/v1/health endpoint with component-level checks
- Verify DB connectivity, fail2ban socket, scheduler, session cache
- Add SQLite WAL cleanup on startup (orphan crash files)
- Migration 8: import_log.timestamp → INTEGER UNIX epoch
- Align import_log timestamps with history_archive (already UNIX int)
- Add unit tests for DB cleanup and health router

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-05-02 23:03:57 +02:00
cc6dbcf3f0 feat: implement API versioning /api/v1/
- All backend routers moved to /api/v1/ prefix
- Frontend BASE_URL updated to /api/v1
- Setup redirect middleware updated to redirect to /api/v1/setup
- Health router path fixed: prefix=/api/v1/health, @router.get('')
- conftest.py: set server_status=online for test fixture
- Created Docs/API_VERSIONING.md with deprecation policy
- Updated Docs/Backend-Development.md with versioning section
- Updated Instructions.md curl examples

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-05-02 21:29:30 +02:00
0d5882b32f Fix HIGH priority issues: unbounded queries, rate limiting, health checks
Issue #3 - Unbounded Query Results (OOM):
- get_all_archived_history() now uses keyset pagination with bounded max_rows (50k default)
- Added 'id' field to records from get_archived_history() and get_archived_history_keyset()
- Protocol signature updated with page_size, max_rows, last_ban_id params

Issue #7 - Docker Health Check Fails:
- Added curl to Dockerfile.backend runtime image
- HEALTHCHECK now uses 'curl -f http://localhost:8000/api/health'
- compose.prod.yml: increased start_period to 40s, timeout to 10s
- Frontend healthcheck proxies to backend /api/health

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-05-01 21:47:36 +02:00
1af67eb0ce Add Application Performance Monitoring (APM) with Prometheus metrics
- Backend: Implement Prometheus metrics collection
  - Add prometheus-client dependency
  - Create metrics utility module with HTTP request tracking counters, histograms, gauges
  - Implement MetricsMiddleware to track request latency, count, and active requests
  - Add /metrics endpoint to expose metrics in Prometheus text format
  - Normalize paths to prevent cardinality explosion (e.g., /api/{id} for UUIDs)
  - Exclude /metrics and /health from detailed tracking

- Frontend: Add web vitals and API metrics collection
  - Install web-vitals library (v4.0.0) for Core Web Vitals tracking
  - Create metrics utility module for FCP, LCP, CLS, INP, TTFB collection
  - Implement useTrackedFetch hook for automatic API call metrics (method, endpoint, status, duration)
  - Initialize web vitals tracking in App component on mount
  - Provide exportMetrics() for sending metrics to backend

- Testing:
  - Add comprehensive backend metrics tests (9 tests, 100% coverage)
  - Add comprehensive frontend metrics tests (10 tests)
  - All tests passing

- Documentation:
  - Expand Docs/Observability.md with complete APM section
  - Include metrics reference, integration examples (Prometheus, Datadog, NewRelic)
  - Add troubleshooting guide and best practices for cardinality management
  - Update Tasks.md to mark APM task as complete

Metrics exposed:
- bangui_http_requests_total: HTTP request count by method, endpoint, status
- bangui_http_request_duration_seconds: Request latency histogram
- bangui_http_active_requests: Active request gauge
- Web Vitals: CLS, FCP, INP, LCP, TTFB with ratings
- API metrics: endpoint, method, status, duration, timestamp

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-05-01 18:33:14 +02:00
37078b742b Implement structured logging to centralized platforms (Datadog, Papertrail, ELK)
This commit adds support for shipping logs to external centralized logging platforms, addressing the MEDIUM priority task for structured logging infrastructure.

## Key Changes:

### 1. New Documentation: Docs/Observability.md
- Comprehensive guide to logging architecture and configuration
- Covers all three supported platforms (Datadog, Papertrail, Elasticsearch)
- Includes best practices, security considerations, and troubleshooting
- Documents sensitive data handling and compliance requirements

### 2. Core Implementation: app/utils/external_logging.py
- ExternalLogHandler: Abstract base class for non-blocking log delivery
- DatadogLogHandler: HTTP API integration with JSON payloads
- PapertrailLogHandler: Syslog protocol over TCP
- ElasticsearchLogHandler: Bulk API integration with NDJSON format
- Features:
  - Async buffering with configurable batch size and flush interval
  - Exponential backoff retry logic
  - Non-blocking delivery (never blocks application logic)
  - Proper error handling and internal logging
  - Lifecycle management (start/shutdown)

### 3. Configuration: app/config.py
- New Settings fields for external logging:
  - external_logging_enabled (default: False)
  - external_logging_provider (datadog/papertrail/elasticsearch)
  - external_logging_buffer_size (default: 1000)
  - external_logging_flush_interval_seconds (default: 5.0)
  - Provider-specific configuration (API keys, hosts, batch sizes)
- All fields have sensible defaults
- Full field validation and normalization

### 4. Integration: app/main.py
- Global _external_log_handler for application lifecycle
- _external_logging_processor: structlog processor for handler integration
- Updated _configure_logging(): Add handler to processor chain when enabled
- Updated _lifespan(): Initialize handler before startup, shutdown on termination

### 5. Tests: backend/tests/test_external_logging.py
- 20 comprehensive tests covering all handlers and factory
- Configuration validation tests
- All tests passing

## Design Decisions:

1. **Non-blocking Delivery**: External logging never blocks request handling.
   Failures are logged locally but don't impact application.

2. **Buffering Strategy**: In-memory buffer with configurable size prevents
   unbounded memory growth. When buffer fills, oldest logs are dropped with
   a warning.

3. **Retry Logic**: Transient failures (timeouts, 5xx errors) are retried
   with exponential backoff. Permanent failures (bad credentials) are logged
   and skipped.

4. **Disabled by Default**: External logging is opt-in via environment
   variables, maintaining backward compatibility with existing deployments.

5. **Provider Flexibility**: Support for multiple platforms allows users to
   choose based on their infrastructure (cloud-native, on-premise, etc).

## Backward Compatibility:

- All new configuration fields have defaults
- External logging disabled by default
- No changes to existing logging behavior unless explicitly configured
- No new required dependencies

## Testing:

- All 20 new tests passing
- Existing tests unaffected (same count of passing tests)
- Configuration validation tested
- Handler creation and lifecycle management tested

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-05-01 18:25:26 +02:00
60d9c5b340 Refactor filter configuration with regex validation
- Add regex validation utility for query strings
- Update filter_config_service to use regex validation
- Add comprehensive test coverage for regex validator
- Update exception handling for validation errors
- Update documentation for tasks

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-05-01 18:17:12 +02:00
8138857ee1 feat: Implement session secret rotation support
Adds support for gradual session secret rotation without forcing logout:

- Add BANGUI_SESSION_SECRET_PREVIOUS config field for rotation window
- Implement unwrap_session_token_with_rotation() to accept tokens signed with
  either current or previous secret
- Update validate_session() to transparently accept old tokens during rotation
- Update logout() to accept tokens from both secrets
- Add comprehensive logging for rotation events and metrics
- Add 8 new tests covering all rotation scenarios
- Update documentation with step-by-step rotation strategy
- Update .env.example with previous secret field

Key features:
- No forced logout: old tokens continue working during rotation window
- Transparent validation: old tokens are automatically logged for monitoring
- Production-safe: can rotate secrets without service interruption
- Metrics-ready: logs track token rotation for observability

Rotation workflow:
1. Generate new secret and set BANGUI_SESSION_SECRET
2. Set BANGUI_SESSION_SECRET_PREVIOUS to old secret
3. Wait for old tokens to expire (≥ session_duration_minutes)
4. Unset BANGUI_SESSION_SECRET_PREVIOUS to complete rotation

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-05-01 18:01:11 +02:00
67b26a3ef7 Refactor pagination with cursor-based support and standardized response format
- Implement cursor-based pagination in pagination.py
- Update response models to standardize pagination structure
- Add cursor pagination utilities for repositories
- Update HistoryArchiveRepository and ImportLogRepository with new pagination
- Add comprehensive tests for cursor pagination
- Update documentation for backend development and task tracking

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-05-01 17:54:05 +02:00
4f7316c484 Add unified RequestValidationError handler to unify error response schema
- Add RequestValidationError handler that converts Pydantic validation errors to unified ErrorResponse format
- Ensures all error responses return consistent schema: code, detail, metadata, correlation_id
- Add field_errors count and first_field location to metadata for validation errors
- Register handler in exception handler hierarchy before HTTPException handler
- Add comprehensive tests for validation error responses
- Update Backend-Development.md documentation to include correlation_id field and validation error details
- All 44 error-related tests pass (38 existing + 6 new validation tests)

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-05-01 15:49:39 +02:00
0221e423f2 Fix pagination metadata return structure and test assertions
The API pagination infrastructure was already correctly implemented with:
- PaginatedListResponse base model containing 'items' and 'pagination' fields
- PaginationMetadata object with all required fields (page, page_size, total, total_pages, has_next_page, has_prev_page)
- All services correctly calling create_pagination_metadata()

However, there were two bugs preventing tests from passing:

1. IMPORT BUG: time_utils.py was importing TIME_RANGE_SECONDS from app.models.ban
   when it's actually defined in app.models._common. This caused import errors
   in tests that exercise time-range filtering.

2. TEST BUG: Test assertions were using outdated API structure, accessing
   .total, .page, .page_size directly on paginated responses instead of
   through the .pagination object.

   Fixed locations:
   - test_mappers/test_ban_mappers.py: 3 assertions updated to use .pagination.*
   - test_services/test_blocklist_service.py: 6 assertions updated
   - test_services/test_history_service.py: 14 assertions updated

All paginated API endpoints now correctly return pagination metadata:
- GET /api/history
- GET /api/history/archive
- GET /api/dashboard/bans
- GET /api/jails/{name}/banned
- GET /api/blocklists/log

Verified with 24 passing pagination tests demonstrating correct behavior.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-05-01 15:42:05 +02:00
73021429f7 refactor: restructure API pagination metadata for better frontend usability
- Create PaginationMetadata model with computed derived fields (total_pages, has_next_page, has_prev_page)
- Update PaginatedListResponse to embed pagination metadata in a separate 'pagination' object
- Add create_pagination_metadata() factory function in utils/pagination.py for consistent computation
- Update all paginated service functions to use new structure:
  - history_service.list_history()
  - blocklist_service.get_import_logs()
  - jail_service.get_jail_banned_ips()
  - ban_mappers.map_domain_dashboard_ban_list_to_response()
- Update response model docstrings with new structure examples
- Update Backend-Development.md documentation with new pagination patterns
- Update test fixtures to work with new response structure

Response shape changes from:
  {"items": [...], "total": 100, "page": 1, "page_size": 50}
To:
  {"items": [...], "pagination": {"page": 1, "page_size": 50, "total": 100, "total_pages": 2, "has_next_page": true, "has_prev_page": false}}

Benefits:
- Frontend receives all pagination state needed for UI controls
- No need for frontend to calculate total_pages or page navigation logic
- Consolidated pagination metadata reduces field sprawl
- OpenAPI schema automatically reflects changes

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-04-30 22:24:42 +02:00
05c3b564ae Refactor scheduler lock implementation with heartbeat mechanism
- Add heartbeat-based lock renewal in scheduler_lock_heartbeat.py
- Update scheduler_lock.py with improved lock management
- Add comprehensive tests for scheduler lock functionality
- Update deployment and task documentation

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-04-30 22:10:38 +02:00
94d6352d1d Fix health check endpoint to return 503 when fail2ban is offline
The health check endpoint now properly indicates service unavailability:
- Returns HTTP 200 when fail2ban is online
- Returns HTTP 503 when fail2ban is offline

This allows Docker and other orchestration tools to correctly detect when
fail2ban is unreachable and automatically restart the backend container,
preventing the situation where Docker treats the container as healthy
despite fail2ban being down.

Changes:
- Update GET /api/health to return 503 on fail2ban offline
- Return appropriate JSON response bodies for each state
- Update tests to verify both online (200) and offline (503) scenarios
- Update Dockerfile HEALTHCHECK documentation
- Add Health Checks section to Deployment.md documentation

All tests pass with 100% coverage on health.py.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-04-30 21:56:42 +02:00
52f237d5d4 Make background tasks idempotent - prevent duplicate bans on retry
CRITICAL FIX: Background tasks (especially blocklist_import) crashed mid-execution,
leaving partial state. On retry, the same bans were applied again, causing duplicates.

Solution: Content-hash based operation tracking for blocklist imports:
- Added import_runs table (migration 6) to track operations by source + content hash
- Before banning, check if this exact content has already been imported
- If completed: skip banning (already done), optionally re-warm cache
- If new or failed: proceed with ban and mark as completed or failed

Changes:
- Database: Migration 6 adds import_runs table with operation state tracking
- Model: Added ImportRunEntry for import run records
- Repository: New import_run_repo module with CRUD operations
- Workflow: Updated blocklist_import_workflow to check operation history before banning
- Dependencies: Registered import_run_repo for dependency injection
- Tests: Added test_import_source_idempotent_on_retry and test_import_source_different_content_not_reused
- Documentation: Added Task Idempotency section to Backend-Development.md

Verification:
- All 7 import tests pass (5 existing + 2 new idempotency tests)
- Type checking: mypy --strict 
- Linting: ruff 
- No API changes, backwards compatible via automatic migration

Fixes: Background tasks not idempotent #CRITICAL

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-04-30 21:54:14 +02:00
400ab1a3f1 Add security headers middleware and documentation
- Add SecurityHeadersMiddleware to backend/app/main.py
  - Implements Content-Security-Policy: default-src 'self'
  - Implements X-Frame-Options: DENY (clickjacking protection)
  - Implements X-Content-Type-Options: nosniff (MIME-sniffing protection)
  - Implements X-XSS-Protection: 1; mode=block (browser XSS filters)
- Add CSP meta tag to frontend/index.html for defense-in-depth
- Create Docs/Security.md with comprehensive security headers documentation
- Add test suite (backend/tests/test_security_headers_middleware.py) with 5 tests
  - Tests verify headers are present on success and error responses
  - Tests ensure all four security headers are correctly set
- All existing tests continue to pass

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-04-30 21:33:08 +02:00
3bd9848a08 Implement global rate limiter and refactor auth middleware
- Add global rate limiter utility with configurable limits and cleanup
- Move rate limiting logic to middleware for consistent application
- Update auth routes to use new rate limiter
- Add comprehensive tests for rate limiter functionality
- Update documentation with backend development guidelines and tasks

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-04-30 21:26:31 +02:00
c4ede71fa6 Fix: Enforce single-worker deployment for session cache cluster safety
Addresses: Backend session cache not cluster-safe (multi-worker issue)

Problem:
- Session cache is process-local (InMemorySessionCache)
- Multi-worker deployments (uvicorn --workers N) create separate processes
- Each process has its own independent session cache
- Sessions cached in Worker A are invisible to Workers B, C, D
- Users randomly logged out when requests land on different workers
- Also affects RuntimeState, rate limiter, and background jobs

Solution (Option A - Strict single-worker enforcement):
- Enhance startup validation with clearer error messages
- Update error messages to explain the problem and how to fix it
- Document single-worker requirement prominently in Docker configs
- Update module docstrings to clarify constraints

Changes:
1. app/startup.py:
   - Enhanced _check_single_worker_mode() error message with troubleshooting
   - Enhanced _stage_check_worker_mode_and_acquire_lock() error message
   - Removed unused import

2. app/utils/session_cache.py:
   - Updated module docstring to explain constraints more clearly
   - Added references to deployment documentation
   - Clarified multi-worker solution for future implementation

3. app/utils/runtime_state.py:
   - Updated module docstring with deployment constraint references
   - Aligned messaging with session_cache.py

4. Docker/Dockerfile.backend:
   - Added comprehensive comments about single-worker requirement
   - Explained impact in multi-worker deployments
   - Referenced deployment constraints documentation

5. Docker/docker-compose.yml, compose.prod.yml, compose.debug.yml:
   - Added documentation comments about BANGUI_WORKERS constraint
   - Explained why single-worker is required

6. backend/tests/test_startup_integration.py:
   - Fixed test unpacking to match function return signature (3 values, not 2)

This ensures multi-worker deployments fail loudly at startup with clear
guidance on what went wrong and how to fix it. The database-backed scheduler
lock provides defense-in-depth for container orchestration scenarios.

For future multi-worker support, implement:
- Redis or database-backed session cache
- Shared RuntimeState coordination
- Distributed APScheduler backend

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-04-30 20:54:24 +02:00
9afdbe2852 Refactor auth and setup services
- Updated auth_service.py to improve authentication logic
- Modified setup_service.py for better configuration handling
- Added comprehensive tests for setup_service
- Updated documentation in Tasks.md

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-04-30 20:10:00 +02:00
277f2a467c Refactor rate limiting with exponential backoff strategy
- Update rate limiter to use exponential backoff instead of fixed limit
- Implement progressive delays for failed login attempts (0.5s, 1s, 2s, 4s, 5s max)
- Update auth router documentation and endpoint docs
- Refactor test suite to match new rate limiting behavior
- Update backend development documentation
- Clean up unused tasks documentation

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-04-30 19:58:09 +02:00
2db635ae19 Fix exception handler overlap issue - add DomainError catch-all handler
**Problem:** Broad exception handlers created fragility where adding a new
DomainError subclass without explicit registration would silently fall through
to the generic exception handler, losing the specific error_code and metadata.

**Solution:**
1. Import DomainError in main.py for explicit handler registration
2. Fix type hints in exception handlers from 'Exception' to specific types
   - NotFoundError handler now typed as 'NotFoundError'
   - BadRequestError handler now typed as 'BadRequestError'
   - ConflictError handler now typed as 'ConflictError'
   - DomainError handler now typed as 'DomainError'
   - ServiceUnavailableError handler now typed as 'ServiceUnavailableError'
3. Add DomainError as an explicit catch-all handler in the registration chain
   - Positioned after specific handlers, before HTTPException
   - Any unregistered DomainError subclass now gets correct error_code + metadata
4. Document the exception handler hierarchy with detailed comments
5. Update Backend-Development.md with handler hierarchy documentation
6. Update Architekture.md section 2.2 with exception handler details
7. Fix test expectations in test_main.py to verify ErrorResponse format

**Impact:** Any new DomainError subclass now automatically gets correct HTTP 500
status, error_code, and metadata - even if developer forgets explicit handler.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-04-30 19:44:43 +02:00
100fd47c4b Refactor: Make model packages true leaf nodes - remove app-layer dependencies
Models in app/models/ are now pure data classes with no cross-layer dependencies.
This ensures the models layer remains a true leaf node in the dependency graph.

Changes:
- Create app/models/_common.py with shared types (TimeRange, bucket_count, constants)
- Move TimeRange and time-range constants from ban.py to _common.py
- Update history.py, routers, and services to import from _common.py
- Remove imports from app.config and app.utils from config.py models
- Move field validators from models to router layer:
  - Add log_target validation in config_misc router
  - Add log_path validation in jail_config router
- Update test_models.py to reflect validators moved to router layer
- Update documentation (Architekture.md, Backend-Development.md) with model layering rules
- Fix import ordering and type annotations in affected files

Model layering rule: Models may only import from:
✓ Standard library and third-party packages (Pydantic, typing)
✓ Other models in app/models/ (sibling models)
✓ app.models.response (response envelopes)
✗ app.services, app.config, app.utils, or any application layer

Validation requiring app-level state (settings, allowed directories) now happens
at the router or service layer, not in model validators.

Fixes: Models were not true leaf nodes due to circular imports and app-layer dependencies

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-04-30 19:31:11 +02:00
3d1a6f5538 Implement frontend and backend observability alignment
Align frontend and backend error observability with correlation IDs and
structured telemetry for distributed tracing across systems.

Backend changes:
- Add CorrelationIdMiddleware to generate/extract correlation IDs
- Include correlation_id in all ErrorResponse objects
- Store correlation ID in structlog contextvars for automatic inclusion in logs
- Add correlation ID to response headers (X-Correlation-ID)

Frontend changes:
- API client automatically generates session-scoped UUID4 and includes
  X-Correlation-ID header in all requests
- Extract correlation ID from API error responses
- Update error handlers to use telemetry with correlation IDs
- Add telemetry logging to ErrorBoundary, PageErrorBoundary, SectionErrorBoundary
- Implement redaction utilities for privacy-safe logging of sensitive data

Documentation:
- Add observability guidelines to Web-Development.md
  * Correlation ID usage patterns
  * Privacy & security best practices
  * Telemetry event structure
  * Redaction utilities for sensitive data
- Add distributed tracing architecture section to Architecture.md
  * Correlation ID flow across frontend/backend
  * Example troubleshooting scenario
  * Implementation details for future enhancements

Testing:
- Add comprehensive tests for correlation middleware
- Update error boundary tests to verify telemetry integration
- Verify TypeScript and ESLint pass with no warnings

Fixes: Issue #40 - Frontend and backend observability are not aligned

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-04-30 18:32:19 +02:00
b6631b86e4 Add database migration 5: Indexes for history_archive query performance
- Add composite index on (jail, timeofban DESC) for dashboard filtering
- Add composite index on (timeofban DESC, jail, action) for time-range queries
- Add single-column indexes on ip and action for targeted filtering
- Update schema version to 5 and document in Backend-Development.md

Indexes optimize common dashboard and API query patterns with pagination.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-04-29 20:17:58 +02:00
187cd8250d Implement database-backed scheduler lock for multi-worker safety
Enforce single-executor safety regardless of process launcher through a
robust database-backed lock mechanism that works reliably in container
orchestration environments.

Key changes:
1. Add scheduler_lock table to database schema (migration 4)
   - Singleton row (id=1) prevents concurrent execution
   - Stores PID, hostname, creation timestamp, heartbeat timestamp
   - Atomic transaction prevents race conditions

2. Create scheduler lock utility (app/utils/scheduler_lock.py)
   - acquire_scheduler_lock(): Atomically acquire or fail
   - release_scheduler_lock(): Clean up on shutdown
   - update_scheduler_lock_heartbeat(): Keep lock alive (every 10 seconds)
   - get_scheduler_lock_info(): Debug/inspect lock status
   - Stale lock detection: TTL-based (60 second expiry)

3. Reorder startup DAG stages
   - DATABASE now comes first (required for lock acquisition)
   - WORKER_MODE depends on DATABASE (performs lock check after initialization)
   - Maintains all other stage dependencies intact

4. Update startup process (app/startup.py)
   - Replace _check_single_worker_mode() with two-tier check:
     * Fast check: BANGUI_WORKERS env var (if explicitly set to >1)
     * Authoritative check: Database lock (catches misconfiguration)
   - Return startup_db from startup_shared_resources() for lock management

5. Register scheduler lock heartbeat task
   - New task: scheduler_lock_heartbeat (app/tasks/scheduler_lock_heartbeat.py)
   - Updates lock heartbeat every 10 seconds (keeps lock alive)
   - Prevents false positives from temporary load spikes

6. Add lock release to lifespan shutdown (app/main.py)
   - Release lock before closing database
   - Allows other instances to acquire during rolling deployments
   - Graceful handoff between instances

7. Comprehensive test coverage (backend/tests/test_scheduler_lock.py)
   - Lock acquisition success and failure cases
   - Stale lock cleanup on startup
   - Lock release and heartbeat updates
   - Full lifecycle: acquire → heartbeat → release

8. Update documentation (Docs/Architekture.md § 9.3)
   - Explain single-executor requirement
   - Document database-backed locking mechanism
   - Compare with alternative approaches (filesystem, env var)
   - Include troubleshooting guide
   - Container orchestration examples (Docker, Kubernetes, systemd)

Why database-backed instead of filesystem?
   - Atomicity: SQLite transactions prevent TOCTOU race windows
   - Container-safe: Works across containers with shared DB volumes
   - No NFS/SMB edge cases
   - Timestamp-based stale detection (PID reuse is unreliable)
   - More reliable in rolling deployments

Benefits:
   - Works with any process manager (uvicorn, gunicorn, etc.)
   - Handles simultaneous startup attempts correctly
   - Automatic failover on instance crash (stale lock cleanup)
   - Clear error messages with troubleshooting steps
   - No environment variable required (lock is authoritative)
   - Scales to multi-worker deployments if combined with external job store

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-04-29 20:10:53 +02:00
6bc440dce4 Refactor backend configuration and authentication
- Add comprehensive documentation for backend development
- Improve client IP detection with utility functions and tests
- Update auth router with better error handling
- Refactor config module with environment-based settings

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-04-29 19:39:55 +02:00
18036d53bf Fix issue #31: Make schedule reschedule deterministic and observable
Replace fire-and-forget reschedule pattern with proper async/await:
- Changed reschedule() from fire-and-forget to awaitable async function
- Errors are now properly propagated instead of silently failing
- Added structured logging for reschedule start and completion
- Schedule updates are now deterministic and observable to callers

Changes:
- app/tasks/blocklist_import.py: Convert reschedule to async, remove asyncio.ensure_future
- tests/test_tasks/test_blocklist_import.py: Add tests for error propagation and logging
- Docs/Features.md: Document scheduling reliability guarantees

All 15 blocklist_import tests pass with 100% coverage.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-04-29 19:24:55 +02:00
1302ac821f Fix non-atomic setup persistence across DB contexts (Issue #30)
Implement transactional setup with explicit state machine and crash-safety
to prevent partial commits from leaving inconsistent state.

## Changes

### Core Implementation
1. **settings_repo.py**: Add atomic batch settings write
   - New set_settings_batch() method: writes multiple settings in single
     transaction (BEGIN IMMEDIATE ... COMMIT). Either all settings persist
     or none do, preventing partial state if crash occurs mid-batch.

2. **setup_service.py**: Refactor run_setup() with transactional phases
   - Phase 0: Compute password hash early (before any DB writes) to ensure
     idempotency. Same hash is used throughout retries, preventing divergent
     hashes from bcrypt's random salt.
   - Phase 1 (Bootstrap DB transaction): Set setup_state=in_progress and
     database_path, then commit. First checkpoint for crash detection.
   - Phase 2 (Filesystem): Initialize runtime database (idempotent)
   - Phase 3 (Runtime DB transaction): Batch-write all settings atomically
   - Phase 4 (Bootstrap DB transaction): Set setup_state=complete and
     setup_completed=1. Final commit point.

3. **protocols.py**: Add set_settings_batch to SettingsRepository protocol

### Testing
- Added 6 new transactionality tests covering:
  - State machine transitions (None → in_progress → complete)
  - Password hash idempotency across retries
  - Atomic batch writes (all-or-nothing persistence)
  - Bootstrap DB state tracking
  - Database path propagation to both DBs
  - Recovery on partial failure
- All 18 tests pass (12 existing + 6 new)

### Documentation
- Updated Docs/Architekture.md with new section 6:
  - Setup state machine with state transitions
  - Transaction boundary documentation
  - Password hash idempotency rationale
  - Backward compatibility notes

## Design Decisions

### Why This Approach
- Current code already idempotent via INSERT OR REPLACE, but password
  hash non-idempotency created silent inconsistency risk
- Simpler than multi-state machine: 2 states sufficient for detection
- Maintains backward compatibility (setup_completed key still written)
- Explicit transactions make crash-safety obvious to future maintainers

### Crash Scenarios Now Handled
1. Crash after Phase 1 → detected by setup_state=in_progress on retry
2. Crash after Phase 2 → runtime DB may be partial, safe to retry
3. Crash after Phase 3 → runtime DB rolls back on next connection
4. Crash after Phase 4 → setup_completed detected, skipped

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-04-29 19:19:53 +02:00
cc4370c50d feat: Add runtime DNS-rebinding protection for blocklist HTTP connections
## Problem
The blocklist URL validation at create/update time has a TOCTOU (time-of-check-to-time-of-use) window.
An attacker can perform a DNS-rebinding attack where:
1. User adds blocklist URL pointing to attacker.com
2. At create time, attacker.com resolves to a public IP → validation passes
3. Later, when fetching, attacker.com resolves to 192.168.1.1 (internal network)
4. HTTP client connects to the private IP, potentially accessing internal services

## Solution
Add runtime destination IP validation at connection time via a custom socket factory:

- Created 'dns_validated_connector.py' with create_dns_validated_socket_factory() that validates
  all resolved IPs before socket creation
- HTTP session now uses the validated socket factory, protecting all blocklist imports globally
- Rejects connections to RFC 1918 private ranges, loopback, link-local, ULA, multicast, and
  reserved addresses (IPv4 and IPv6)
- Added comprehensive test coverage with 13 test cases

## Changes
- backend/app/services/dns_validated_connector.py: Custom socket factory with IP validation
- backend/app/startup.py: Use DNS-validated socket factory in HTTP session creation
- backend/app/utils/ip_utils.py: Updated docstring explaining runtime validation
- backend/app/services/blocklist_downloader.py: Updated module docstring
- backend/app/services/blocklist_service.py: Updated docstrings explaining two-layer protection
- backend/tests/test_services/test_dns_validated_connector.py: Test suite for socket factory
- Docs/Architekture.md: Added detailed section on DNS-rebinding protection

## Testing
- All 13 DNS validation tests pass
- All blocklist downloader tests pass (unaffected by changes)
- Linting: ruff, mypy pass with --strict
- Test coverage: 90% line coverage on dns_validated_connector.py

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-04-29 19:10:51 +02:00
a2129bb9bd Pagination contract is not standardized across endpoints 2026-04-28 21:40:22 +02:00
b27765928a Standardize API response envelopes: use items for collection responses and update tests 2026-04-28 20:48:00 +02:00
1c673d600c Standardize API response envelope shapes across all endpoints
This commit standardizes how API responses are wrapped, solving issue #24.

Problem:
- Inconsistent response envelopes (jails vs items vs bans vs no wrapper)
- Frontend required multiple field name variants
- Integration bugs from branching logic
- No clear pattern for different response types

Solution:
- Created response.py with base classes: PaginatedListResponse,
  CollectionResponse, CommandResponse
- Standardized all list/collection responses to use 'items' field
- Domain-specific field names for detail and aggregation responses
- Updated all backends routers and mappers
- Updated frontend types and hooks to match

Changes:
Backend:
- backend/app/models/response.py (new): Base response models
- backend/app/models/ban.py: Updated responses to inherit from bases
- backend/app/models/jail.py: Updated JailListResponse, JailCommandResponse
- backend/app/models/config.py: Updated collection responses
- backend/app/services/jail_service.py: Updated return statements
- backend/app/mappers/ban_mappers.py: Updated 'bans' to 'items'
- backend/tests/test_mappers/test_ban_mappers.py: Updated tests

Frontend:
- frontend/src/types/jail.ts: Updated response interfaces
- frontend/src/types/config.ts: Updated response interfaces
- frontend/src/hooks/useActiveBans.ts: Updated selector
- frontend/src/hooks/useJailList.ts: Updated selector
- frontend/src/hooks/useJailConfigs.ts: Updated selector
- frontend/src/hooks/useConfigActiveStatus.ts: Updated field access
- frontend/src/hooks/useJailAdmin.ts: Updated field access

Documentation:
- Docs/Backend-Development.md: Added § 4.1 API Response Envelope Policy

The policy defines:
1. Paginated lists use PaginatedListResponse (items, total, page, page_size)
2. Non-paginated collections use CollectionResponse (items, total)
3. Detail responses use entity-specific field names (jail, status, settings)
4. Command responses use CommandResponse (message, success, optional target)
5. Aggregations use domain-specific fields (jails, countries, buckets, bans)

All responses now follow one of these patterns, reducing frontend complexity.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-04-28 10:12:55 +02:00
e86ab6dad1 10) Implement explicit startup DAG for resource initialization
- Created StartupDAG class to orchestrate startup stages with explicit dependencies
- Defined 6 startup stages: WORKER_MODE → DATABASE → GEO_CACHE → HTTP_SESSION → SCHEDULER → TASKS
- Each stage has prerequisites, error handling, and rollback support
- Refactored startup_shared_resources() to use the DAG
- Added StartupContext for resource tracking and failure management
- Partial failures automatically roll back all completed resources in reverse order
- Added health checks to verify all resources initialized successfully
- Comprehensive test coverage: 15 DAG unit tests + 3 integration tests + 6 existing tests
- Documented startup DAG in Architekture.md with detailed stage descriptions and failure modes

This replaces implicit ordering with explicit dependency tracking, making lifecycle
changes safe and failure modes predictable. Hidden order dependencies no longer exist.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-04-28 08:08:05 +02:00
a273b96563 feat: Complete repository protocol coverage
- Add missing protocol methods to Fail2BanDbRepository:
  - get_ban_event_counts: Aggregate ban events per IP (used in ban_service)

- Add missing protocol methods to GeoCacheRepository:
  - delete_stale_entries: Remove old geo cache entries (used in geo_cache_cleanup)

- Add missing protocol methods to HistoryArchiveRepository:
  - purge_archived_history: Remove archived entries older than age threshold

- Add comprehensive protocol compliance tests:
  - Created test_protocol_compliance.py with 8 test classes
  - Validates all 7 repository modules fully implement their protocols
  - Prevents silent protocol drift when methods change signatures
  - Tests verify no unexpected public methods in repository modules

- Update documentation:
  - Add Repository Protocol Coverage Checklist to Backend-Development.md
  - Document procedure for adding new repositories with protocol definitions
  - List current protocol coverage (all 7 repositories, 40 total methods)

- All repositories now have 100% protocol coverage:
  - SessionRepository: 4 methods
  - SettingsRepository: 4 methods
  - BlocklistRepository: 6 methods
  - ImportLogRepository: 4 methods
  - GeoCacheRepository: 13 methods
  - HistoryArchiveRepository: 5 methods
  - Fail2BanDbRepository: 8 methods

This ensures:
- Enhanced mockability for testing
- Static contract verification
- Prevention of protocol drift
- Better IDE support and type checking

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-04-28 07:58:57 +02:00
52a4d04d92 Task 8: Standardize modeling style (TypedDict vs Pydantic)
Convert inconsistent modeling style to standardized Pydantic models for all
external-facing data structures while maintaining TypedDict compatibility where
appropriate for internal layer-private structures.

Changes:
- Converted IpLookupResult TypedDict to use IpLookupResponse Pydantic model
  in jail_service.lookup_ip() for consistency with routers
- Added GeoCacheEntry Pydantic model for geo cache repository rows
- Converted GeoCacheRow TypedDict to use GeoCacheEntry alias
- Converted ImportLogRow TypedDict to use ImportLogEntry alias
- Updated routers and services to work with Pydantic models
- Updated all tests to use Pydantic model field access (attributes)
  instead of dict subscripting

Documentation:
- Added 'Model Type Usage by Layer' section to Backend-Development.md
- Defines when TypedDict is allowed (internal structures) vs Pydantic
  (external-facing, cross-boundary data)
- Provides clear guidance on modeling conventions per layer

Benefits:
- Consistent validation and serialization behavior
- Better IDE support and type checking
- Clearer separation of concerns by layer
- Reduced maintenance cost from mixed validation approaches

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-04-28 07:53:30 +02:00
3888c5eb3f Refactor ban management with domain models and mappers
- Add ban domain model for core business logic separation
- Implement mapper pattern for DTO/domain conversions
- Update ban service with new domain-driven approach
- Refactor router endpoints to use new architecture
- Add comprehensive mapper tests
- Update documentation with architecture changes

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-04-28 07:46:02 +02:00
afc1e44e99 Implement centralized exception handling and validation
- Add custom exception classes for structured error handling
- Implement global exception handlers in FastAPI application
- Add comprehensive request/response validation
- Create exception contract tests for validation
- Update backend development documentation

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-04-27 18:52:12 +02:00
2e221f6852 Refactor: Move module-level mutable flags to JailServiceState
TASK-004: Replace module-level mutable runtime flags in service layer with
injected state holder, eliminating hidden global state and improving testability
and synchronization boundaries.

Changes:
- Create JailServiceState dataclass in app/utils/runtime_state.py to hold
  backend capability cache and synchronization lock
- Add JailServiceState as a field in RuntimeState (with default_factory)
- Remove module-level _backend_cmd_supported and _backend_cmd_lock from
  jail_service.py
- Refactor _check_backend_cmd_supported() to accept state parameter
- Inject JailServiceState into list_jails() and _fetch_jail_summary() via
  parameters
- Add get_jail_service_state() dependency provider in app/dependencies.py
- Add JailServiceStateDep type alias for router injection
- Update jails router to receive and pass state to service functions
- Update all tests to use jail_service_state fixture and pass state to functions
- Remove duplicate _MAX_PAGE_SIZE constant definition
- Document mutable state management in Backend-Development.md
- Update Architecture.md to describe JailServiceState and state nesting pattern

Benefits:
- Eliminates global mutable state and associated race conditions
- Makes state visible to callers (not hidden in module scope)
- Enables test isolation (each test gets fresh state)
- Prepares codebase for multi-worker deployments (state can be extracted to
  shared backend)
- Synchronization boundaries are now explicit (state.get_backend_cmd_lock())

Compliance:
- All tests pass (17 passed in TestListJails, TestGetJail, TestLockInitialization)
- No ruff linting errors
- Type-safe: JailServiceState properly typed with asyncio.Lock, bool | None

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-04-27 18:42:52 +02:00
e08a16c7dd Refactor: Split blocklist import flow into focused components
Extracted the monolithic import_source() function (776 lines) into focused,
testable components with clear single responsibilities:

- BlocklistDownloader: HTTP download with exponential backoff retry logic
  * Handles transient failures (429, 5xx errors, timeouts)
  * Configurable retry attempts and backoff strategy
  * 93% test coverage

- BlocklistParser: Parse and validate IP addresses
  * Extract valid IPv4/IPv6 addresses from text
  * Skip CIDRs and malformed entries gracefully
  * Separate parsing from validation concerns
  * 100% test coverage

- BanExecutor: Ban execution with error handling
  * Ban IPs via fail2ban socket
  * Stop on JailNotFoundError (jail doesn't exist)
  * Continue on JailOperationError (individual ban failures)
  * 100% test coverage

- BlocklistImportWorkflow: Thin orchestrator
  * Coordinates the download → parse → ban → log flow
  * Pre-warms geo cache with newly banned IPs
  * 96% test coverage

- blocklist_service.py: Maintains public API
  * Source CRUD (create, read, update, delete)
  * URL validation and preview functionality
  * Scheduling configuration and import triggers
  * 92% test coverage

Benefits:
* Each component is independently testable with mock dependencies
* Error handling is explicit and localized
* Components can evolve independently
* Logging is contextual and clear
* Retry and transient error handling are isolated

Testing:
* All 36 existing blocklist_service tests pass
* All 13 blocklist import task tests pass
* Added 17 comprehensive component unit tests
* Combined 96%+ coverage on new modules
* Zero type errors in new code

Documentation:
* Updated Refactoring.md with detailed architecture notes
* Added component architecture diagram to Architekture.md
* Documented ownership and responsibilities of each component

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-04-27 18:34:11 +02:00
3bbf413c55 refactor: Make service dependencies explicit and injectable
Remove hidden cross-service coupling by making dependencies explicit through
dependency injection while maintaining backward compatibility via lazy imports.

Key changes:
- history_service and ban_service: Removed direct module-level imports of
  fail2ban_metadata_service, added optional service parameters to functions
- Added get_fail2ban_metadata_service() provider to dependencies.py
- Updated history router to inject Fail2BanMetadataService dependency
- history_service functions now use lazy imports in fallback paths for
  backward compatibility when service is not explicitly injected
- All test patches updated to use internal _get_fail2ban_db_path() helper
- jail_config_service and jail_service already follow best practices

This pattern prevents circular imports, makes services testable via explicit
mocking, and documents service dependencies clearly.

Fixes: Instructions.md #2 - Hidden cross-service coupling

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-04-27 18:26:08 +02:00
93021500c3 TASK-033: Remove session token from JSON response body
Fixes a critical security vulnerability where the session token was
being returned in the JSON response body of POST /api/auth/login.
This exposed the token to JavaScript, allowing malicious scripts to
steal it and bypass the HttpOnly cookie protection.

Changes:
- Backend: Remove 'token' field from LoginResponse model (auth.py)
- Backend: Update login() endpoint to return only 'expires_at'
- Frontend: Update LoginResponse type to exclude 'token' field
- Backend: Update test helper _login() to extract token from cookie
- Backend: Update test cases to verify token is NOT in response body
- Documentation: Add section 'Authentication Endpoints' in Backend-Development.md
- Documentation: Update Web-Development.md to explain HttpOnly cookie benefits

Security benefit: Session tokens are now only accessible via HttpOnly
cookies, protected from JavaScript access, XSS attacks, and malicious
third-party scripts. The frontend continues to use only the cookie for
authentication.

All auth tests pass (23 tests). Type checking and linting pass with
zero errors.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-04-26 19:38:33 +02:00