Commit Graph

8 Commits

Author SHA1 Message Date
4d09d2538d docs: Add security best practices to Deployment.md
- Secrets management via environment variables
- Container security hardening (non-root user, filesystem permissions, capabilities)
- Network security and TLS termination guidance
- Prune obsolete task tracking from Tasks.md

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-05-03 19:48:52 +02:00
5058a50143 Refactor backend: fix geo cache cleanup, scheduler heartbeat, correlation middleware; update docs 2026-05-03 16:02:40 +02:00
b631c1c546 feat(backend): implement graceful shutdown for container stop
Graceful shutdown ensures in-flight operations complete before process exits:
- Lifespan shutdown handler drains pending tasks with 25s timeout
- Scheduler stops accepting new jobs immediately
- HTTP session, external logging, scheduler lock, DB conn closed cleanly
- 25s Python timeout leaves 5s margin before Docker's 30s SIGKILL

Files changed:
- backend/app/main.py: enhanced _lifespan shutdown with task drain
- Docker/Dockerfile.backend: documented signal handling in header
- Docker/docker-compose.yml: added stop_grace_period: 30s
- Docker/compose.prod.yml: added stop_grace_period: 30s
- Docs/Deployment.md: new Graceful Shutdown section with sequence table
- Docs/TROUBLESHOOTING.md: new Graceful Shutdown Issues section

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-05-02 22:47:10 +02:00
0d5882b32f Fix HIGH priority issues: unbounded queries, rate limiting, health checks
Issue #3 - Unbounded Query Results (OOM):
- get_all_archived_history() now uses keyset pagination with bounded max_rows (50k default)
- Added 'id' field to records from get_archived_history() and get_archived_history_keyset()
- Protocol signature updated with page_size, max_rows, last_ban_id params

Issue #7 - Docker Health Check Fails:
- Added curl to Dockerfile.backend runtime image
- HEALTHCHECK now uses 'curl -f http://localhost:8000/api/health'
- compose.prod.yml: increased start_period to 40s, timeout to 10s
- Frontend healthcheck proxies to backend /api/health

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-05-01 21:47:36 +02:00
445c2c5418 Update configuration and documentation
- Update .env.example with latest environment variables
- Update deployment and task documentation
- Update backend configuration settings

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-05-01 18:10:03 +02:00
05c3b564ae Refactor scheduler lock implementation with heartbeat mechanism
- Add heartbeat-based lock renewal in scheduler_lock_heartbeat.py
- Update scheduler_lock.py with improved lock management
- Add comprehensive tests for scheduler lock functionality
- Update deployment and task documentation

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-04-30 22:10:38 +02:00
94d6352d1d Fix health check endpoint to return 503 when fail2ban is offline
The health check endpoint now properly indicates service unavailability:
- Returns HTTP 200 when fail2ban is online
- Returns HTTP 503 when fail2ban is offline

This allows Docker and other orchestration tools to correctly detect when
fail2ban is unreachable and automatically restart the backend container,
preventing the situation where Docker treats the container as healthy
despite fail2ban being down.

Changes:
- Update GET /api/health to return 503 on fail2ban offline
- Return appropriate JSON response bodies for each state
- Update tests to verify both online (200) and offline (503) scenarios
- Update Dockerfile HEALTHCHECK documentation
- Add Health Checks section to Deployment.md documentation

All tests pass with 100% coverage on health.py.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-04-30 21:56:42 +02:00
90f4c6239c Add resource limits to all Docker containers
- fail2ban: 0.5 CPU / 128M memory limit, 0.1 CPU / 64M reserved
- backend: 2.0 CPU / 512M memory limit, 1.0 CPU / 256M reserved
- frontend: 0.5 CPU / 128M memory limit, 0.25 CPU / 64M reserved

Prevents 'noisy neighbor' scenarios where one container exhausts
host resources (CPU, memory, disk). Limits are hard caps; reservations
guarantee minimum allocation to prevent OOM kills and ensure
responsive service even under load.

Fixes resource contention issue in production and staging environments.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-04-30 21:03:56 +02:00