Files
BanGUI/Docs/Tasks.md
Lukas 37078b742b Implement structured logging to centralized platforms (Datadog, Papertrail, ELK)
This commit adds support for shipping logs to external centralized logging platforms, addressing the MEDIUM priority task for structured logging infrastructure.

## Key Changes:

### 1. New Documentation: Docs/Observability.md
- Comprehensive guide to logging architecture and configuration
- Covers all three supported platforms (Datadog, Papertrail, Elasticsearch)
- Includes best practices, security considerations, and troubleshooting
- Documents sensitive data handling and compliance requirements

### 2. Core Implementation: app/utils/external_logging.py
- ExternalLogHandler: Abstract base class for non-blocking log delivery
- DatadogLogHandler: HTTP API integration with JSON payloads
- PapertrailLogHandler: Syslog protocol over TCP
- ElasticsearchLogHandler: Bulk API integration with NDJSON format
- Features:
  - Async buffering with configurable batch size and flush interval
  - Exponential backoff retry logic
  - Non-blocking delivery (never blocks application logic)
  - Proper error handling and internal logging
  - Lifecycle management (start/shutdown)

### 3. Configuration: app/config.py
- New Settings fields for external logging:
  - external_logging_enabled (default: False)
  - external_logging_provider (datadog/papertrail/elasticsearch)
  - external_logging_buffer_size (default: 1000)
  - external_logging_flush_interval_seconds (default: 5.0)
  - Provider-specific configuration (API keys, hosts, batch sizes)
- All fields have sensible defaults
- Full field validation and normalization

### 4. Integration: app/main.py
- Global _external_log_handler for application lifecycle
- _external_logging_processor: structlog processor for handler integration
- Updated _configure_logging(): Add handler to processor chain when enabled
- Updated _lifespan(): Initialize handler before startup, shutdown on termination

### 5. Tests: backend/tests/test_external_logging.py
- 20 comprehensive tests covering all handlers and factory
- Configuration validation tests
- All tests passing

## Design Decisions:

1. **Non-blocking Delivery**: External logging never blocks request handling.
   Failures are logged locally but don't impact application.

2. **Buffering Strategy**: In-memory buffer with configurable size prevents
   unbounded memory growth. When buffer fills, oldest logs are dropped with
   a warning.

3. **Retry Logic**: Transient failures (timeouts, 5xx errors) are retried
   with exponential backoff. Permanent failures (bad credentials) are logged
   and skipped.

4. **Disabled by Default**: External logging is opt-in via environment
   variables, maintaining backward compatibility with existing deployments.

5. **Provider Flexibility**: Support for multiple platforms allows users to
   choose based on their infrastructure (cloud-native, on-premise, etc).

## Backward Compatibility:

- All new configuration fields have defaults
- External logging disabled by default
- No changes to existing logging behavior unless explicitly configured
- No new required dependencies

## Testing:

- All 20 new tests passing
- Existing tests unaffected (same count of passing tests)
- Configuration validation tested
- Handler creation and lifecycle management tested

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-05-01 18:25:26 +02:00

3.0 KiB

[MEDIUM] No structured logging to external system

Where found

  • Logs only go to stdout/file, no external aggregation

Why this is needed

Can't search across instances, historical logs lost on instance recycle.

Goal

Ship logs to centralized logging platform.

What to do

  1. Short-term: Ensure structlog JSON output is valid (already done)
  2. Long-term: Ship to logging platform (ELK, Datadog, Papertrail)

Possible traps and issues

  • External logging adds latency
  • Sensitive data must not be logged
  • Log volume can be massive

Docs changes needed

  • Add Docs/Observability.md section on logging

Doc references

  • Docs/Observability.md (new)

[MEDIUM] No Application Performance Monitoring (APM)

Where found

  • Backend: no metrics collection, latency tracking
  • Frontend: no error tracking, performance metrics
  • No observability into request performance

Why this is needed

Without metrics, blind in production: API slow? Unknown. Which endpoints fail most? Unknown.

Goal

Add comprehensive metrics collection and monitoring.

What to do

  1. Backend metrics:

    • Add Prometheus metrics: request count, latency, active requests
    • Expose /metrics endpoint
  2. Frontend metrics:

    • Page load time, FCP, LCP using web-vitals
    • API error rates and latencies
  3. Aggregation:

    • Prometheus + Grafana, or Datadog/NewRelic

Possible traps and issues

  • Metrics collection has performance cost
  • Cardinality explosion with tags
  • PII in metrics

Docs changes needed

  • Add Docs/Observability.md

Doc references

  • Docs/Observability.md (new)

[LOW] Frontend charts not memoized

Where found

  • frontend/src/components/TopCountriesPieChart.tsx
  • frontend/src/components/TopCountriesBarChart.tsx

Why this is needed

Charts re-render on every parent update, Recharts reprocesses 5000+ points.

Goal

Memoize chart components.

What to do

  1. Wrap with React.memo with custom comparison
  2. Ensure data objects are stable

Possible traps and issues

  • Shallow comparison might not be enough
  • Memoization has memory cost

Docs changes needed

  • No documentation changes

Doc references

  • frontend/src/components/TopCountriesChart.tsx

[LOW] No request deduplication on frontend

Where found

  • frontend/src/hooks/useFetchData.ts — each call launches new request
  • User clicks "Refresh" twice → two identical requests

Why this is needed

Duplicates waste bandwidth, cause race conditions (response 2 arrives first, then response 1 overwrites with stale data).

Goal

Deduplicate identical in-flight requests.

What to do

  1. Implement request cache
  2. Clear cache entry when response received
  3. Use in useFetchData

Possible traps and issues

  • Cache must be cleared on data mutation
  • Stale data in cache possible if not careful

Docs changes needed

  • No documentation changes

Doc references

  • frontend/src/hooks/useFetchData.ts