This commit adds support for shipping logs to external centralized logging platforms, addressing the MEDIUM priority task for structured logging infrastructure. ## Key Changes: ### 1. New Documentation: Docs/Observability.md - Comprehensive guide to logging architecture and configuration - Covers all three supported platforms (Datadog, Papertrail, Elasticsearch) - Includes best practices, security considerations, and troubleshooting - Documents sensitive data handling and compliance requirements ### 2. Core Implementation: app/utils/external_logging.py - ExternalLogHandler: Abstract base class for non-blocking log delivery - DatadogLogHandler: HTTP API integration with JSON payloads - PapertrailLogHandler: Syslog protocol over TCP - ElasticsearchLogHandler: Bulk API integration with NDJSON format - Features: - Async buffering with configurable batch size and flush interval - Exponential backoff retry logic - Non-blocking delivery (never blocks application logic) - Proper error handling and internal logging - Lifecycle management (start/shutdown) ### 3. Configuration: app/config.py - New Settings fields for external logging: - external_logging_enabled (default: False) - external_logging_provider (datadog/papertrail/elasticsearch) - external_logging_buffer_size (default: 1000) - external_logging_flush_interval_seconds (default: 5.0) - Provider-specific configuration (API keys, hosts, batch sizes) - All fields have sensible defaults - Full field validation and normalization ### 4. Integration: app/main.py - Global _external_log_handler for application lifecycle - _external_logging_processor: structlog processor for handler integration - Updated _configure_logging(): Add handler to processor chain when enabled - Updated _lifespan(): Initialize handler before startup, shutdown on termination ### 5. Tests: backend/tests/test_external_logging.py - 20 comprehensive tests covering all handlers and factory - Configuration validation tests - All tests passing ## Design Decisions: 1. **Non-blocking Delivery**: External logging never blocks request handling. Failures are logged locally but don't impact application. 2. **Buffering Strategy**: In-memory buffer with configurable size prevents unbounded memory growth. When buffer fills, oldest logs are dropped with a warning. 3. **Retry Logic**: Transient failures (timeouts, 5xx errors) are retried with exponential backoff. Permanent failures (bad credentials) are logged and skipped. 4. **Disabled by Default**: External logging is opt-in via environment variables, maintaining backward compatibility with existing deployments. 5. **Provider Flexibility**: Support for multiple platforms allows users to choose based on their infrastructure (cloud-native, on-premise, etc). ## Backward Compatibility: - All new configuration fields have defaults - External logging disabled by default - No changes to existing logging behavior unless explicitly configured - No new required dependencies ## Testing: - All 20 new tests passing - Existing tests unaffected (same count of passing tests) - Configuration validation tested - Handler creation and lifecycle management tested Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
148 lines
3.0 KiB
Markdown
148 lines
3.0 KiB
Markdown
## [MEDIUM] No structured logging to external system
|
|
|
|
**Where found**
|
|
|
|
- Logs only go to stdout/file, no external aggregation
|
|
|
|
**Why this is needed**
|
|
|
|
Can't search across instances, historical logs lost on instance recycle.
|
|
|
|
**Goal**
|
|
|
|
Ship logs to centralized logging platform.
|
|
|
|
**What to do**
|
|
|
|
1. **Short-term:** Ensure `structlog` JSON output is valid (already done)
|
|
2. **Long-term:** Ship to logging platform (ELK, Datadog, Papertrail)
|
|
|
|
**Possible traps and issues**
|
|
|
|
- External logging adds latency
|
|
- Sensitive data must not be logged
|
|
- Log volume can be massive
|
|
|
|
**Docs changes needed**
|
|
|
|
- Add `Docs/Observability.md` section on logging
|
|
|
|
**Doc references**
|
|
|
|
- `Docs/Observability.md` (new)
|
|
|
|
---
|
|
|
|
## [MEDIUM] No Application Performance Monitoring (APM)
|
|
|
|
**Where found**
|
|
|
|
- Backend: no metrics collection, latency tracking
|
|
- Frontend: no error tracking, performance metrics
|
|
- No observability into request performance
|
|
|
|
**Why this is needed**
|
|
|
|
Without metrics, blind in production: API slow? Unknown. Which endpoints fail most? Unknown.
|
|
|
|
**Goal**
|
|
|
|
Add comprehensive metrics collection and monitoring.
|
|
|
|
**What to do**
|
|
|
|
1. **Backend metrics:**
|
|
- Add Prometheus metrics: request count, latency, active requests
|
|
- Expose `/metrics` endpoint
|
|
|
|
2. **Frontend metrics:**
|
|
- Page load time, FCP, LCP using `web-vitals`
|
|
- API error rates and latencies
|
|
|
|
3. **Aggregation:**
|
|
- Prometheus + Grafana, or Datadog/NewRelic
|
|
|
|
**Possible traps and issues**
|
|
|
|
- Metrics collection has performance cost
|
|
- Cardinality explosion with tags
|
|
- PII in metrics
|
|
|
|
**Docs changes needed**
|
|
|
|
- Add `Docs/Observability.md`
|
|
|
|
**Doc references**
|
|
|
|
- `Docs/Observability.md` (new)
|
|
|
|
---
|
|
|
|
## [LOW] Frontend charts not memoized
|
|
|
|
**Where found**
|
|
|
|
- `frontend/src/components/TopCountriesPieChart.tsx`
|
|
- `frontend/src/components/TopCountriesBarChart.tsx`
|
|
|
|
**Why this is needed**
|
|
|
|
Charts re-render on every parent update, Recharts reprocesses 5000+ points.
|
|
|
|
**Goal**
|
|
|
|
Memoize chart components.
|
|
|
|
**What to do**
|
|
|
|
1. Wrap with `React.memo` with custom comparison
|
|
2. Ensure data objects are stable
|
|
|
|
**Possible traps and issues**
|
|
|
|
- Shallow comparison might not be enough
|
|
- Memoization has memory cost
|
|
|
|
**Docs changes needed**
|
|
|
|
- No documentation changes
|
|
|
|
**Doc references**
|
|
|
|
- `frontend/src/components/TopCountriesChart.tsx`
|
|
|
|
---
|
|
|
|
## [LOW] No request deduplication on frontend
|
|
|
|
**Where found**
|
|
|
|
- `frontend/src/hooks/useFetchData.ts` — each call launches new request
|
|
- User clicks "Refresh" twice → two identical requests
|
|
|
|
**Why this is needed**
|
|
|
|
Duplicates waste bandwidth, cause race conditions (response 2 arrives first, then response 1 overwrites with stale data).
|
|
|
|
**Goal**
|
|
|
|
Deduplicate identical in-flight requests.
|
|
|
|
**What to do**
|
|
|
|
1. Implement request cache
|
|
2. Clear cache entry when response received
|
|
3. Use in `useFetchData`
|
|
|
|
**Possible traps and issues**
|
|
|
|
- Cache must be cleared on data mutation
|
|
- Stale data in cache possible if not careful
|
|
|
|
**Docs changes needed**
|
|
|
|
- No documentation changes
|
|
|
|
**Doc references**
|
|
|
|
- `frontend/src/hooks/useFetchData.ts` |