Files
BanGUI/Docs/Tasks.md
Lukas 37078b742b Implement structured logging to centralized platforms (Datadog, Papertrail, ELK)
This commit adds support for shipping logs to external centralized logging platforms, addressing the MEDIUM priority task for structured logging infrastructure.

## Key Changes:

### 1. New Documentation: Docs/Observability.md
- Comprehensive guide to logging architecture and configuration
- Covers all three supported platforms (Datadog, Papertrail, Elasticsearch)
- Includes best practices, security considerations, and troubleshooting
- Documents sensitive data handling and compliance requirements

### 2. Core Implementation: app/utils/external_logging.py
- ExternalLogHandler: Abstract base class for non-blocking log delivery
- DatadogLogHandler: HTTP API integration with JSON payloads
- PapertrailLogHandler: Syslog protocol over TCP
- ElasticsearchLogHandler: Bulk API integration with NDJSON format
- Features:
  - Async buffering with configurable batch size and flush interval
  - Exponential backoff retry logic
  - Non-blocking delivery (never blocks application logic)
  - Proper error handling and internal logging
  - Lifecycle management (start/shutdown)

### 3. Configuration: app/config.py
- New Settings fields for external logging:
  - external_logging_enabled (default: False)
  - external_logging_provider (datadog/papertrail/elasticsearch)
  - external_logging_buffer_size (default: 1000)
  - external_logging_flush_interval_seconds (default: 5.0)
  - Provider-specific configuration (API keys, hosts, batch sizes)
- All fields have sensible defaults
- Full field validation and normalization

### 4. Integration: app/main.py
- Global _external_log_handler for application lifecycle
- _external_logging_processor: structlog processor for handler integration
- Updated _configure_logging(): Add handler to processor chain when enabled
- Updated _lifespan(): Initialize handler before startup, shutdown on termination

### 5. Tests: backend/tests/test_external_logging.py
- 20 comprehensive tests covering all handlers and factory
- Configuration validation tests
- All tests passing

## Design Decisions:

1. **Non-blocking Delivery**: External logging never blocks request handling.
   Failures are logged locally but don't impact application.

2. **Buffering Strategy**: In-memory buffer with configurable size prevents
   unbounded memory growth. When buffer fills, oldest logs are dropped with
   a warning.

3. **Retry Logic**: Transient failures (timeouts, 5xx errors) are retried
   with exponential backoff. Permanent failures (bad credentials) are logged
   and skipped.

4. **Disabled by Default**: External logging is opt-in via environment
   variables, maintaining backward compatibility with existing deployments.

5. **Provider Flexibility**: Support for multiple platforms allows users to
   choose based on their infrastructure (cloud-native, on-premise, etc).

## Backward Compatibility:

- All new configuration fields have defaults
- External logging disabled by default
- No changes to existing logging behavior unless explicitly configured
- No new required dependencies

## Testing:

- All 20 new tests passing
- Existing tests unaffected (same count of passing tests)
- Configuration validation tested
- Handler creation and lifecycle management tested

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-05-01 18:25:26 +02:00

148 lines
3.0 KiB
Markdown

## [MEDIUM] No structured logging to external system
**Where found**
- Logs only go to stdout/file, no external aggregation
**Why this is needed**
Can't search across instances, historical logs lost on instance recycle.
**Goal**
Ship logs to centralized logging platform.
**What to do**
1. **Short-term:** Ensure `structlog` JSON output is valid (already done)
2. **Long-term:** Ship to logging platform (ELK, Datadog, Papertrail)
**Possible traps and issues**
- External logging adds latency
- Sensitive data must not be logged
- Log volume can be massive
**Docs changes needed**
- Add `Docs/Observability.md` section on logging
**Doc references**
- `Docs/Observability.md` (new)
---
## [MEDIUM] No Application Performance Monitoring (APM)
**Where found**
- Backend: no metrics collection, latency tracking
- Frontend: no error tracking, performance metrics
- No observability into request performance
**Why this is needed**
Without metrics, blind in production: API slow? Unknown. Which endpoints fail most? Unknown.
**Goal**
Add comprehensive metrics collection and monitoring.
**What to do**
1. **Backend metrics:**
- Add Prometheus metrics: request count, latency, active requests
- Expose `/metrics` endpoint
2. **Frontend metrics:**
- Page load time, FCP, LCP using `web-vitals`
- API error rates and latencies
3. **Aggregation:**
- Prometheus + Grafana, or Datadog/NewRelic
**Possible traps and issues**
- Metrics collection has performance cost
- Cardinality explosion with tags
- PII in metrics
**Docs changes needed**
- Add `Docs/Observability.md`
**Doc references**
- `Docs/Observability.md` (new)
---
## [LOW] Frontend charts not memoized
**Where found**
- `frontend/src/components/TopCountriesPieChart.tsx`
- `frontend/src/components/TopCountriesBarChart.tsx`
**Why this is needed**
Charts re-render on every parent update, Recharts reprocesses 5000+ points.
**Goal**
Memoize chart components.
**What to do**
1. Wrap with `React.memo` with custom comparison
2. Ensure data objects are stable
**Possible traps and issues**
- Shallow comparison might not be enough
- Memoization has memory cost
**Docs changes needed**
- No documentation changes
**Doc references**
- `frontend/src/components/TopCountriesChart.tsx`
---
## [LOW] No request deduplication on frontend
**Where found**
- `frontend/src/hooks/useFetchData.ts` — each call launches new request
- User clicks "Refresh" twice → two identical requests
**Why this is needed**
Duplicates waste bandwidth, cause race conditions (response 2 arrives first, then response 1 overwrites with stale data).
**Goal**
Deduplicate identical in-flight requests.
**What to do**
1. Implement request cache
2. Clear cache entry when response received
3. Use in `useFetchData`
**Possible traps and issues**
- Cache must be cleared on data mutation
- Stale data in cache possible if not careful
**Docs changes needed**
- No documentation changes
**Doc references**
- `frontend/src/hooks/useFetchData.ts`