Commit Graph

310 Commits

Author SHA1 Message Date
eb339efcfd Add Kubernetes liveness/readiness probes and middleware order validation
- Split /health into /health/live (liveness) and /health/ready (readiness)
  following Kubernetes conventions. Combined /health retained for backward
  compatibility with existing Docker HEALTHCHECK definitions.
- Add ReadyCheck and ReadyResponse models for structured readiness output.
- Add _assert_middleware_order() startup check enforcing:
  RateLimit → Csrf → CorrelationId middleware chain.
- Register CorrelationIdMiddleware, CsrfMiddleware, RateLimitMiddleware
  in create_app() with documented required order (reverse of processing).
- Add correlation.py, csrf.py, rate_limit.py middleware modules.
- Add health probe tests in test_health_probes.py.
- Update test_main.py with middleware order assertion tests.
- Update frontend useFetchData hook tests.
- Docs: update Deployment.md with Kubernetes probe config examples.
2026-05-04 02:42:09 +02:00
65fe747cba feat(backend): add deprecation middleware and API versioning support
- Add deprecation middleware for warning headers on sunset endpoints
- Add jails_v2 router for API v2 migration path
- Update CI workflow with new test coverage
- Update API versioning documentation
- Remove completed tasks from Tasks.md
2026-05-04 00:03:52 +02:00
fc57c83f79 refactor: split pagination logic from response models
- Extract pagination logic to separate util module
- Update response models to use new pagination util
- Fix pagination calculation edge cases

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-05-03 22:57:21 +02:00
edebf1a339 feat(services): add ErrorContract enum and PartialResult type
Add typed wrappers for error handling patterns in error_handling.py:

- ErrorContract(enum): machine-checkable pattern selector with
  from_value() helper and string constants matching the existing
  ABORT_ON_ERROR/RETURN_DEFAULT/PARTIAL_RESULT module-level values
- ErrorEntry: typed error container for PARTIAL_RESULT (context + cause)
- PartialResult[T]: typed result wrapper for PARTIAL_RESULT operations

Existing string constants preserved for backward compat.
Updated module docstring with type annotation table and examples.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-05-03 22:46:47 +02:00
dafe8d61e2 feat(security): add CSRF header constants and security-headers endpoint
Move X-BanGUI-Request header name/value to backend/app/utils/constants.py as single source of truth. Add GET /api/v1/config/security-headers endpoint. Update csrf middleware, frontend api client, and docs to use shared constants.
2026-05-03 22:06:43 +02:00
cee3daffc1 fix: enforce PRAGMA query_only on fail2ban DB and refactor CSRF cookie name
- Add _acquire_readonly_connection() that applies PRAGMA query_only=ON after connect
- Verify PRAGMA value back to catch URI flag bypasses
- Wrap in async context manager _readonly_connection() used by all repo methods
- Replace hardcoded '_SESSION_COOKIE_NAME' in CSRF middleware with import from
  app.utils.constants
- Remove completed Issues #45 and #46 from Docs/Tasks.md (Issue #46 now fixed,
  #45 cache invalidation deferred to auth refactor branch)

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-05-03 21:47:42 +02:00
1c3dff31e8 feat(rate-limiting): add per-bucket limits and startup validation
- Add per-bucket rate limit config (ban, unban, import, config, jail, filter, action)
- Add process-local warning at startup for multi-worker deployments
- Document Redis migration path for shared state across workers
- Remove Issue #42 from Tasks.md (resolved)
2026-05-03 20:53:21 +02:00
c3cd1574dc fix(auth): invalidate session cache on login
Stale sessions from a stolen device could be reused up to the cache
TTL after a legitimate user re-logs in, because login never cleared
the existing cache entry.

Changes:
- Add invalidate_by_user(user_id) to SessionCache protocol
- InMemorySessionCache maintains a user_id -> set[token] index to
  support O(1) invalidation of all sessions for a given user
- NoOpSessionCache stub updated for API compatibility
- auth_service.login() now returns the Session object alongside
  signed_token and expires_at
- login router calls session_cache.invalidate_by_user(session.id)
  immediately after successful authentication

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-05-03 20:51:51 +02:00
ae9313568e feat: enforce single-worker at startup
Fail with RuntimeError when WEB_CONCURRENCY or BANGUI_WORKERS > 1.

In-memory session cache, rate-limit windows, and runtime state are
process-local. Multi-worker silently causes stale limits, ghost sessions,
inconsistent status.

Skipped when TESTING=1.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-05-03 20:33:23 +02:00
96525573fa Normalise IP addresses across backend
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-05-03 18:19:41 +02:00
5f0ab40816 refactor(backend): clean up models setup, improve ip utils, add adr docs
- Extract ADR documents for architectural decisions (SQLite, FastAPI, React, APScheduler, Scheduler)
- Refactor setup.py: improve code structure and readability
- Add IP validation utilities with test coverage
- Update frontend components (BanTable, HistoryPage)
- Add pre-commit hooks and CONTRIBUTING.md
- Add .editorconfig for consistent coding standards
2026-05-03 18:04:45 +02:00
2f9fc8076d refactor(backend): clean up jail service, add error handling service
- Extract jail status/processing to helper functions
- Add error_handling.py service for centralized error handling
- Update config.py with validation and defaults
- Update .env.example with all config options
- Remove obsolete Tasks.md, add Service-Development.md
- Minor fixes across routers and services

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-05-03 17:40:37 +02:00
2df029f7e8 refactor(ban_service): extract _bans_by_country_load_data helper
Break up long function into focused helper. Load data logic separate from aggregation.
2026-05-03 17:00:34 +02:00
5058a50143 Refactor backend: fix geo cache cleanup, scheduler heartbeat, correlation middleware; update docs 2026-05-03 16:02:40 +02:00
896751ada9 fix: handle socket close errors properly in PapertrailLogHandler
- Replace contextlib.suppress with try/except + warning log
- Add test for fail2ban client
- Remove stale Issue #21 from Tasks.md (indexes)

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-05-03 12:25:14 +02:00
Copilot
22db607875 Add fail2ban DB index management and socket-based path resolution
- New get_fail2ban_db_path() in setup_service resolves DB path from configured socket path
- New ensure_fail2ban_indexes() creates missing performance indexes on bans table
- Call ensure_fail2ban_indexes on every startup before first ban query
- Remove completed tasks from Docs/Tasks.md
- Update Docs/PERFORMANCE.md with index findings
2026-05-03 12:17:31 +02:00
0133489920 Update observability docs and task utilities
- Add Observability.md documentation
- Standardize task logging with correlation_id support
- Add log_sanitizer utility for PII masking
- Update Tasks.md tracking
- Update geo_cache tasks and other task modules with correlation_id

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-05-03 11:52:09 +02:00
7b93499551 Refactor config loading and add status code docs
- Move config loading to dedicated ConfigLoader class with validation
- Add DATABASE_MIGRATIONS.md content to TROUBLESHOOTING.md
- Add API_STATUS_CODES.md documenting all API response codes
- Update runner.csx to use new config structure
- Add check_responses.py validation script
- Update config tests for new structure

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-05-03 11:52:01 +02:00
8f26776bb3 docs: add OpenAPI responses={} to all router endpoints
Add explicit HTTP status code documentation to every endpoint
across 15 router files. Each endpoint now declares all possible
response codes (200/201/204/400/401/404/409/429/502/503) with
descriptions so frontend can distinguish error types.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-05-03 01:12:08 +02:00
7ad885d276 refactor: separate config service from jail config service
- Split config_service.py into config_service.py and jail_config_service.py
- Update Docs/Tasks.md, Security.md, TROUBLESHOOTING.md
2026-05-03 01:05:18 +02:00
881cfbdd71 fix: replace broad except Exception with specific exception types
- jail_service: catch ValueError (fail2ban protocol error) instead of Exception
- health.py: catch AttributeError (not OSError/TypeError) for defensive checks
- ban_service: re-raise programming errors in geo lookup handlers
- server_service: catch Fail2BanConnectionError, Fail2BanProtocolError, ValueError
- config_writer: catch OSError instead of Exception

Programming errors now bubble to global handler instead of being silently caught.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-05-03 00:54:44 +02:00
bd6170722a feat(geo): add cache hit/miss metrics and prewarm support
- Add _hits/_misses counters to GeoCache for cache hit/miss ratio tracking
- Reset counters on clear()
- Count hits before misses in lookup_batch() to avoid interleaving
- Add synchronous prewarm() using asyncio.create_task for fire-and-forget
- Add hits/misses fields to GeoCacheStatsResponse model
- Add TestCacheMetrics (5 tests), TestPrewarm (3 tests), TestLargeBanList (2 tests)
- Fix _make_async_db() mock: db.execute is not async, returns ctx manager
- Move collections.abc to TYPE_CHECKING block (TC003)

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-05-03 00:35:47 +02:00
0817a4cb47 fix(regex_validator): add ReDoS detection via regexploit
Detect catastrophic backtracking patterns before regex compilation
using regexploit library. Add ReDoSDetectedError exception and
_MINIMUM_STARRINESS threshold (>=3) to catch dangerous patterns
like (a+)+b. Update pyproject.toml deps, add tests for detection.
2026-05-03 00:05:33 +02:00
e436727942 fix: atomic upsert for import runs (Issue #12)
Replace check-then-insert race condition with INSERT ON CONFLICT.
- upsert_pending uses RETURNING id for atomic upsert
- UNIQUE(source_id, content_hash) constraint from migration 6
- blocklist_import_workflow updated to use upsert_pending
- test_import_source_success fixed for async mock patterns

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-05-02 23:39:43 +02:00
1285bc8571 feat: comprehensive health check with DB, scheduler, cache
- Add /api/v1/health endpoint with component-level checks
- Verify DB connectivity, fail2ban socket, scheduler, session cache
- Add SQLite WAL cleanup on startup (orphan crash files)
- Migration 8: import_log.timestamp → INTEGER UNIX epoch
- Align import_log timestamps with history_archive (already UNIX int)
- Add unit tests for DB cleanup and health router

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-05-02 23:03:57 +02:00
b631c1c546 feat(backend): implement graceful shutdown for container stop
Graceful shutdown ensures in-flight operations complete before process exits:
- Lifespan shutdown handler drains pending tasks with 25s timeout
- Scheduler stops accepting new jobs immediately
- HTTP session, external logging, scheduler lock, DB conn closed cleanly
- 25s Python timeout leaves 5s margin before Docker's 30s SIGKILL

Files changed:
- backend/app/main.py: enhanced _lifespan shutdown with task drain
- Docker/Dockerfile.backend: documented signal handling in header
- Docker/docker-compose.yml: added stop_grace_period: 30s
- Docker/compose.prod.yml: added stop_grace_period: 30s
- Docs/Deployment.md: new Graceful Shutdown section with sequence table
- Docs/TROUBLESHOOTING.md: new Graceful Shutdown Issues section

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-05-02 22:47:10 +02:00
f6c3c02183 Refactor response handling and health check endpoints
- Enhance response model with additional fields and validation
- Update health and server router implementations
- Improve frontend type definitions and API integration
- Clean up documentation

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-05-02 21:57:00 +02:00
cc6dbcf3f0 feat: implement API versioning /api/v1/
- All backend routers moved to /api/v1/ prefix
- Frontend BASE_URL updated to /api/v1
- Setup redirect middleware updated to redirect to /api/v1/setup
- Health router path fixed: prefix=/api/v1/health, @router.get('')
- conftest.py: set server_status=online for test fixture
- Created Docs/API_VERSIONING.md with deprecation policy
- Updated Docs/Backend-Development.md with versioning section
- Updated Instructions.md curl examples

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-05-02 21:29:30 +02:00
0d5882b32f Fix HIGH priority issues: unbounded queries, rate limiting, health checks
Issue #3 - Unbounded Query Results (OOM):
- get_all_archived_history() now uses keyset pagination with bounded max_rows (50k default)
- Added 'id' field to records from get_archived_history() and get_archived_history_keyset()
- Protocol signature updated with page_size, max_rows, last_ban_id params

Issue #7 - Docker Health Check Fails:
- Added curl to Dockerfile.backend runtime image
- HEALTHCHECK now uses 'curl -f http://localhost:8000/api/health'
- compose.prod.yml: increased start_period to 40s, timeout to 10s
- Frontend healthcheck proxies to backend /api/health

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-05-01 21:47:36 +02:00
1af67eb0ce Add Application Performance Monitoring (APM) with Prometheus metrics
- Backend: Implement Prometheus metrics collection
  - Add prometheus-client dependency
  - Create metrics utility module with HTTP request tracking counters, histograms, gauges
  - Implement MetricsMiddleware to track request latency, count, and active requests
  - Add /metrics endpoint to expose metrics in Prometheus text format
  - Normalize paths to prevent cardinality explosion (e.g., /api/{id} for UUIDs)
  - Exclude /metrics and /health from detailed tracking

- Frontend: Add web vitals and API metrics collection
  - Install web-vitals library (v4.0.0) for Core Web Vitals tracking
  - Create metrics utility module for FCP, LCP, CLS, INP, TTFB collection
  - Implement useTrackedFetch hook for automatic API call metrics (method, endpoint, status, duration)
  - Initialize web vitals tracking in App component on mount
  - Provide exportMetrics() for sending metrics to backend

- Testing:
  - Add comprehensive backend metrics tests (9 tests, 100% coverage)
  - Add comprehensive frontend metrics tests (10 tests)
  - All tests passing

- Documentation:
  - Expand Docs/Observability.md with complete APM section
  - Include metrics reference, integration examples (Prometheus, Datadog, NewRelic)
  - Add troubleshooting guide and best practices for cardinality management
  - Update Tasks.md to mark APM task as complete

Metrics exposed:
- bangui_http_requests_total: HTTP request count by method, endpoint, status
- bangui_http_request_duration_seconds: Request latency histogram
- bangui_http_active_requests: Active request gauge
- Web Vitals: CLS, FCP, INP, LCP, TTFB with ratings
- API metrics: endpoint, method, status, duration, timestamp

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-05-01 18:33:14 +02:00
37078b742b Implement structured logging to centralized platforms (Datadog, Papertrail, ELK)
This commit adds support for shipping logs to external centralized logging platforms, addressing the MEDIUM priority task for structured logging infrastructure.

## Key Changes:

### 1. New Documentation: Docs/Observability.md
- Comprehensive guide to logging architecture and configuration
- Covers all three supported platforms (Datadog, Papertrail, Elasticsearch)
- Includes best practices, security considerations, and troubleshooting
- Documents sensitive data handling and compliance requirements

### 2. Core Implementation: app/utils/external_logging.py
- ExternalLogHandler: Abstract base class for non-blocking log delivery
- DatadogLogHandler: HTTP API integration with JSON payloads
- PapertrailLogHandler: Syslog protocol over TCP
- ElasticsearchLogHandler: Bulk API integration with NDJSON format
- Features:
  - Async buffering with configurable batch size and flush interval
  - Exponential backoff retry logic
  - Non-blocking delivery (never blocks application logic)
  - Proper error handling and internal logging
  - Lifecycle management (start/shutdown)

### 3. Configuration: app/config.py
- New Settings fields for external logging:
  - external_logging_enabled (default: False)
  - external_logging_provider (datadog/papertrail/elasticsearch)
  - external_logging_buffer_size (default: 1000)
  - external_logging_flush_interval_seconds (default: 5.0)
  - Provider-specific configuration (API keys, hosts, batch sizes)
- All fields have sensible defaults
- Full field validation and normalization

### 4. Integration: app/main.py
- Global _external_log_handler for application lifecycle
- _external_logging_processor: structlog processor for handler integration
- Updated _configure_logging(): Add handler to processor chain when enabled
- Updated _lifespan(): Initialize handler before startup, shutdown on termination

### 5. Tests: backend/tests/test_external_logging.py
- 20 comprehensive tests covering all handlers and factory
- Configuration validation tests
- All tests passing

## Design Decisions:

1. **Non-blocking Delivery**: External logging never blocks request handling.
   Failures are logged locally but don't impact application.

2. **Buffering Strategy**: In-memory buffer with configurable size prevents
   unbounded memory growth. When buffer fills, oldest logs are dropped with
   a warning.

3. **Retry Logic**: Transient failures (timeouts, 5xx errors) are retried
   with exponential backoff. Permanent failures (bad credentials) are logged
   and skipped.

4. **Disabled by Default**: External logging is opt-in via environment
   variables, maintaining backward compatibility with existing deployments.

5. **Provider Flexibility**: Support for multiple platforms allows users to
   choose based on their infrastructure (cloud-native, on-premise, etc).

## Backward Compatibility:

- All new configuration fields have defaults
- External logging disabled by default
- No changes to existing logging behavior unless explicitly configured
- No new required dependencies

## Testing:

- All 20 new tests passing
- Existing tests unaffected (same count of passing tests)
- Configuration validation tested
- Handler creation and lifecycle management tested

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-05-01 18:25:26 +02:00
60d9c5b340 Refactor filter configuration with regex validation
- Add regex validation utility for query strings
- Update filter_config_service to use regex validation
- Add comprehensive test coverage for regex validator
- Update exception handling for validation errors
- Update documentation for tasks

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-05-01 18:17:12 +02:00
445c2c5418 Update configuration and documentation
- Update .env.example with latest environment variables
- Update deployment and task documentation
- Update backend configuration settings

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-05-01 18:10:03 +02:00
8138857ee1 feat: Implement session secret rotation support
Adds support for gradual session secret rotation without forcing logout:

- Add BANGUI_SESSION_SECRET_PREVIOUS config field for rotation window
- Implement unwrap_session_token_with_rotation() to accept tokens signed with
  either current or previous secret
- Update validate_session() to transparently accept old tokens during rotation
- Update logout() to accept tokens from both secrets
- Add comprehensive logging for rotation events and metrics
- Add 8 new tests covering all rotation scenarios
- Update documentation with step-by-step rotation strategy
- Update .env.example with previous secret field

Key features:
- No forced logout: old tokens continue working during rotation window
- Transparent validation: old tokens are automatically logged for monitoring
- Production-safe: can rotate secrets without service interruption
- Metrics-ready: logs track token rotation for observability

Rotation workflow:
1. Generate new secret and set BANGUI_SESSION_SECRET
2. Set BANGUI_SESSION_SECRET_PREVIOUS to old secret
3. Wait for old tokens to expire (≥ session_duration_minutes)
4. Unset BANGUI_SESSION_SECRET_PREVIOUS to complete rotation

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-05-01 18:01:11 +02:00
67b26a3ef7 Refactor pagination with cursor-based support and standardized response format
- Implement cursor-based pagination in pagination.py
- Update response models to standardize pagination structure
- Add cursor pagination utilities for repositories
- Update HistoryArchiveRepository and ImportLogRepository with new pagination
- Add comprehensive tests for cursor pagination
- Update documentation for backend development and task tracking

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-05-01 17:54:05 +02:00
4f7316c484 Add unified RequestValidationError handler to unify error response schema
- Add RequestValidationError handler that converts Pydantic validation errors to unified ErrorResponse format
- Ensures all error responses return consistent schema: code, detail, metadata, correlation_id
- Add field_errors count and first_field location to metadata for validation errors
- Register handler in exception handler hierarchy before HTTPException handler
- Add comprehensive tests for validation error responses
- Update Backend-Development.md documentation to include correlation_id field and validation error details
- All 44 error-related tests pass (38 existing + 6 new validation tests)

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-05-01 15:49:39 +02:00
0221e423f2 Fix pagination metadata return structure and test assertions
The API pagination infrastructure was already correctly implemented with:
- PaginatedListResponse base model containing 'items' and 'pagination' fields
- PaginationMetadata object with all required fields (page, page_size, total, total_pages, has_next_page, has_prev_page)
- All services correctly calling create_pagination_metadata()

However, there were two bugs preventing tests from passing:

1. IMPORT BUG: time_utils.py was importing TIME_RANGE_SECONDS from app.models.ban
   when it's actually defined in app.models._common. This caused import errors
   in tests that exercise time-range filtering.

2. TEST BUG: Test assertions were using outdated API structure, accessing
   .total, .page, .page_size directly on paginated responses instead of
   through the .pagination object.

   Fixed locations:
   - test_mappers/test_ban_mappers.py: 3 assertions updated to use .pagination.*
   - test_services/test_blocklist_service.py: 6 assertions updated
   - test_services/test_history_service.py: 14 assertions updated

All paginated API endpoints now correctly return pagination metadata:
- GET /api/history
- GET /api/history/archive
- GET /api/dashboard/bans
- GET /api/jails/{name}/banned
- GET /api/blocklists/log

Verified with 24 passing pagination tests demonstrating correct behavior.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-05-01 15:42:05 +02:00
73021429f7 refactor: restructure API pagination metadata for better frontend usability
- Create PaginationMetadata model with computed derived fields (total_pages, has_next_page, has_prev_page)
- Update PaginatedListResponse to embed pagination metadata in a separate 'pagination' object
- Add create_pagination_metadata() factory function in utils/pagination.py for consistent computation
- Update all paginated service functions to use new structure:
  - history_service.list_history()
  - blocklist_service.get_import_logs()
  - jail_service.get_jail_banned_ips()
  - ban_mappers.map_domain_dashboard_ban_list_to_response()
- Update response model docstrings with new structure examples
- Update Backend-Development.md documentation with new pagination patterns
- Update test fixtures to work with new response structure

Response shape changes from:
  {"items": [...], "total": 100, "page": 1, "page_size": 50}
To:
  {"items": [...], "pagination": {"page": 1, "page_size": 50, "total": 100, "total_pages": 2, "has_next_page": true, "has_prev_page": false}}

Benefits:
- Frontend receives all pagination state needed for UI controls
- No need for frontend to calculate total_pages or page navigation logic
- Consolidated pagination metadata reduces field sprawl
- OpenAPI schema automatically reflects changes

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-04-30 22:24:42 +02:00
05c3b564ae Refactor scheduler lock implementation with heartbeat mechanism
- Add heartbeat-based lock renewal in scheduler_lock_heartbeat.py
- Update scheduler_lock.py with improved lock management
- Add comprehensive tests for scheduler lock functionality
- Update deployment and task documentation

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-04-30 22:10:38 +02:00
f9e283541b Add explicit database transaction isolation to multi-step operations
This commit addresses race conditions in multi-step database operations by:

1. Wrap write operations in BEGIN IMMEDIATE ... COMMIT transactions:
   - import_run_repo: create_pending, mark_completed, mark_failed
   - geo_cache_repo: all upsert_*_and_commit functions
   - geo_cache_repo: bulk_upsert_entries_and_neg_entries_and_commit

2. Handle concurrent write collisions gracefully:
   - import_run_repo.create_pending can now raise IntegrityError
   - blocklist_import_workflow catches IntegrityError and retries lookup
   - Logs 'blocklist_import_lost_race' event when another request wins the race

3. Add comprehensive documentation:
   - Backend-Development.md § 6.3 Database Transactions
   - Explains when to use BEGIN IMMEDIATE
   - Shows transaction pattern with try-except-rollback
   - Documents race condition error handling pattern

The solution leverages SQLite's UNIQUE constraint for data integrity while
handling the concurrent case gracefully in application logic. This is more
efficient than using BEGIN EXCLUSIVE which would serialize all writers.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-04-30 22:04:15 +02:00
94d6352d1d Fix health check endpoint to return 503 when fail2ban is offline
The health check endpoint now properly indicates service unavailability:
- Returns HTTP 200 when fail2ban is online
- Returns HTTP 503 when fail2ban is offline

This allows Docker and other orchestration tools to correctly detect when
fail2ban is unreachable and automatically restart the backend container,
preventing the situation where Docker treats the container as healthy
despite fail2ban being down.

Changes:
- Update GET /api/health to return 503 on fail2ban offline
- Return appropriate JSON response bodies for each state
- Update tests to verify both online (200) and offline (503) scenarios
- Update Dockerfile HEALTHCHECK documentation
- Add Health Checks section to Deployment.md documentation

All tests pass with 100% coverage on health.py.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-04-30 21:56:42 +02:00
52f237d5d4 Make background tasks idempotent - prevent duplicate bans on retry
CRITICAL FIX: Background tasks (especially blocklist_import) crashed mid-execution,
leaving partial state. On retry, the same bans were applied again, causing duplicates.

Solution: Content-hash based operation tracking for blocklist imports:
- Added import_runs table (migration 6) to track operations by source + content hash
- Before banning, check if this exact content has already been imported
- If completed: skip banning (already done), optionally re-warm cache
- If new or failed: proceed with ban and mark as completed or failed

Changes:
- Database: Migration 6 adds import_runs table with operation state tracking
- Model: Added ImportRunEntry for import run records
- Repository: New import_run_repo module with CRUD operations
- Workflow: Updated blocklist_import_workflow to check operation history before banning
- Dependencies: Registered import_run_repo for dependency injection
- Tests: Added test_import_source_idempotent_on_retry and test_import_source_different_content_not_reused
- Documentation: Added Task Idempotency section to Backend-Development.md

Verification:
- All 7 import tests pass (5 existing + 2 new idempotency tests)
- Type checking: mypy --strict 
- Linting: ruff 
- No API changes, backwards compatible via automatic migration

Fixes: Background tasks not idempotent #CRITICAL

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-04-30 21:54:14 +02:00
400ab1a3f1 Add security headers middleware and documentation
- Add SecurityHeadersMiddleware to backend/app/main.py
  - Implements Content-Security-Policy: default-src 'self'
  - Implements X-Frame-Options: DENY (clickjacking protection)
  - Implements X-Content-Type-Options: nosniff (MIME-sniffing protection)
  - Implements X-XSS-Protection: 1; mode=block (browser XSS filters)
- Add CSP meta tag to frontend/index.html for defense-in-depth
- Create Docs/Security.md with comprehensive security headers documentation
- Add test suite (backend/tests/test_security_headers_middleware.py) with 5 tests
  - Tests verify headers are present on success and error responses
  - Tests ensure all four security headers are correctly set
- All existing tests continue to pass

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-04-30 21:33:08 +02:00
3bd9848a08 Implement global rate limiter and refactor auth middleware
- Add global rate limiter utility with configurable limits and cleanup
- Move rate limiting logic to middleware for consistent application
- Update auth routes to use new rate limiter
- Add comprehensive tests for rate limiter functionality
- Update documentation with backend development guidelines and tasks

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-04-30 21:26:31 +02:00
c4ede71fa6 Fix: Enforce single-worker deployment for session cache cluster safety
Addresses: Backend session cache not cluster-safe (multi-worker issue)

Problem:
- Session cache is process-local (InMemorySessionCache)
- Multi-worker deployments (uvicorn --workers N) create separate processes
- Each process has its own independent session cache
- Sessions cached in Worker A are invisible to Workers B, C, D
- Users randomly logged out when requests land on different workers
- Also affects RuntimeState, rate limiter, and background jobs

Solution (Option A - Strict single-worker enforcement):
- Enhance startup validation with clearer error messages
- Update error messages to explain the problem and how to fix it
- Document single-worker requirement prominently in Docker configs
- Update module docstrings to clarify constraints

Changes:
1. app/startup.py:
   - Enhanced _check_single_worker_mode() error message with troubleshooting
   - Enhanced _stage_check_worker_mode_and_acquire_lock() error message
   - Removed unused import

2. app/utils/session_cache.py:
   - Updated module docstring to explain constraints more clearly
   - Added references to deployment documentation
   - Clarified multi-worker solution for future implementation

3. app/utils/runtime_state.py:
   - Updated module docstring with deployment constraint references
   - Aligned messaging with session_cache.py

4. Docker/Dockerfile.backend:
   - Added comprehensive comments about single-worker requirement
   - Explained impact in multi-worker deployments
   - Referenced deployment constraints documentation

5. Docker/docker-compose.yml, compose.prod.yml, compose.debug.yml:
   - Added documentation comments about BANGUI_WORKERS constraint
   - Explained why single-worker is required

6. backend/tests/test_startup_integration.py:
   - Fixed test unpacking to match function return signature (3 values, not 2)

This ensures multi-worker deployments fail loudly at startup with clear
guidance on what went wrong and how to fix it. The database-backed scheduler
lock provides defense-in-depth for container orchestration scenarios.

For future multi-worker support, implement:
- Redis or database-backed session cache
- Shared RuntimeState coordination
- Distributed APScheduler backend

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-04-30 20:54:24 +02:00
ac53a56ae7 Update backend configuration and documentation
- Modified main.py with backend updates
- Updated Tasks.md documentation

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-04-30 20:10:57 +02:00
9afdbe2852 Refactor auth and setup services
- Updated auth_service.py to improve authentication logic
- Modified setup_service.py for better configuration handling
- Added comprehensive tests for setup_service
- Updated documentation in Tasks.md

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-04-30 20:10:00 +02:00
3d5acb756f refactor: move repository and service imports to module level in dependencies.py
Move all repository imports (session_repo, blocklist_repo, import_log_repo,
settings_repo, history_archive_repo, geo_cache_repo, fail2ban_db_repo) and
service imports (auth_service, health_service, default_fail2ban_metadata_service)
to module level in app/dependencies.py.

This eliminates the pattern of local imports inside provider functions,
providing consistency and reducing import overhead. The from app.db import
open_db remains a local import since it's only used within get_db().

- Verified no circular dependencies exist
- All repository and service provider functions simplified to return modules
- Updated Architekture.md § 2.3 to document the module-level import pattern
- All tests pass (28 dependency + auth tests)

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-04-30 20:06:10 +02:00
277f2a467c Refactor rate limiting with exponential backoff strategy
- Update rate limiter to use exponential backoff instead of fixed limit
- Implement progressive delays for failed login attempts (0.5s, 1s, 2s, 4s, 5s max)
- Update auth router documentation and endpoint docs
- Refactor test suite to match new rate limiting behavior
- Update backend development documentation
- Clean up unused tasks documentation

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-04-30 19:58:09 +02:00
2db635ae19 Fix exception handler overlap issue - add DomainError catch-all handler
**Problem:** Broad exception handlers created fragility where adding a new
DomainError subclass without explicit registration would silently fall through
to the generic exception handler, losing the specific error_code and metadata.

**Solution:**
1. Import DomainError in main.py for explicit handler registration
2. Fix type hints in exception handlers from 'Exception' to specific types
   - NotFoundError handler now typed as 'NotFoundError'
   - BadRequestError handler now typed as 'BadRequestError'
   - ConflictError handler now typed as 'ConflictError'
   - DomainError handler now typed as 'DomainError'
   - ServiceUnavailableError handler now typed as 'ServiceUnavailableError'
3. Add DomainError as an explicit catch-all handler in the registration chain
   - Positioned after specific handlers, before HTTPException
   - Any unregistered DomainError subclass now gets correct error_code + metadata
4. Document the exception handler hierarchy with detailed comments
5. Update Backend-Development.md with handler hierarchy documentation
6. Update Architekture.md section 2.2 with exception handler details
7. Fix test expectations in test_main.py to verify ErrorResponse format

**Impact:** Any new DomainError subclass now automatically gets correct HTTP 500
status, error_code, and metadata - even if developer forgets explicit handler.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-04-30 19:44:43 +02:00