Implement frontend and backend observability alignment
Align frontend and backend error observability with correlation IDs and structured telemetry for distributed tracing across systems. Backend changes: - Add CorrelationIdMiddleware to generate/extract correlation IDs - Include correlation_id in all ErrorResponse objects - Store correlation ID in structlog contextvars for automatic inclusion in logs - Add correlation ID to response headers (X-Correlation-ID) Frontend changes: - API client automatically generates session-scoped UUID4 and includes X-Correlation-ID header in all requests - Extract correlation ID from API error responses - Update error handlers to use telemetry with correlation IDs - Add telemetry logging to ErrorBoundary, PageErrorBoundary, SectionErrorBoundary - Implement redaction utilities for privacy-safe logging of sensitive data Documentation: - Add observability guidelines to Web-Development.md * Correlation ID usage patterns * Privacy & security best practices * Telemetry event structure * Redaction utilities for sensitive data - Add distributed tracing architecture section to Architecture.md * Correlation ID flow across frontend/backend * Example troubleshooting scenario * Implementation details for future enhancements Testing: - Add comprehensive tests for correlation middleware - Update error boundary tests to verify telemetry integration - Verify TypeScript and ESLint pass with no warnings Fixes: Issue #40 - Frontend and backend observability are not aligned Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
This commit is contained in:
@@ -1451,7 +1451,149 @@ Currently, the single-executor approach is simple, maintainable, and sufficient
|
||||
|
||||
---
|
||||
|
||||
## 10. Design Principles
|
||||
## 10. Observability & Distributed Tracing
|
||||
|
||||
BanGUI implements **distributed tracing** via **correlation IDs** to correlate errors and requests across frontend and backend systems.
|
||||
|
||||
### Architecture
|
||||
|
||||
```
|
||||
┌─────────────────────────────────────────────────────────────┐
|
||||
│ Frontend (React + TypeScript) │
|
||||
├─────────────────────────────────────────────────────────────┤
|
||||
│ • API Client generates session-scoped UUID4 (correlation ID)│
|
||||
│ • Telemetry service records structured events │
|
||||
│ • Error boundaries catch render errors │
|
||||
│ • All telemetry events include correlation ID for tracing │
|
||||
└────────────────────┬────────────────────────────────────────┘
|
||||
│
|
||||
├─ Every request includes
|
||||
│ X-Correlation-ID header
|
||||
│
|
||||
┌────────────────────┴────────────────────────────────────────┐
|
||||
│ Backend (Python + FastAPI + structlog) │
|
||||
├─────────────────────────────────────────────────────────────┤
|
||||
│ • CorrelationIdMiddleware extracts/generates correlation ID │
|
||||
│ • All logs automatically include correlation ID │
|
||||
│ • Error responses include correlation_id field │
|
||||
│ • structlog outputs JSON with correlation ID in all events │
|
||||
└─────────────────────────────────────────────────────────────┘
|
||||
```
|
||||
|
||||
### Correlation ID Flow
|
||||
|
||||
1. **Frontend → Backend:**
|
||||
- API client generates/retrieves session-scoped UUID4
|
||||
- UUID4 sent in `X-Correlation-ID` request header
|
||||
- All requests use same session UUID (set once, reused)
|
||||
|
||||
2. **Backend Processing:**
|
||||
- CorrelationIdMiddleware extracts/generates correlation ID
|
||||
- ID stored in structlog contextvars
|
||||
- All structured log entries include correlation ID automatically
|
||||
- Error responses include `correlation_id` field in JSON
|
||||
|
||||
3. **Backend → Frontend:**
|
||||
- Response includes `X-Correlation-ID` header
|
||||
- Error responses include `correlation_id` in response body
|
||||
- Frontend error handlers extract correlation ID
|
||||
|
||||
4. **Frontend Error Logging:**
|
||||
- Error handlers extract correlation ID from API response
|
||||
- Telemetry service logs error with correlation ID
|
||||
- Browser console and telemetry backends receive linked events
|
||||
|
||||
### Example: Correlating an Error Across Systems
|
||||
|
||||
**Scenario:** User clicks "Ban IP" button → API returns 500 error → error logged and displayed
|
||||
|
||||
**Frontend telemetry event:**
|
||||
```json
|
||||
{
|
||||
"event": "api_error",
|
||||
"severity": "error",
|
||||
"message": "Server error banning IP",
|
||||
"correlation_id": "550e8400-e29b-41d4-a716-446655440000",
|
||||
"context": {
|
||||
"status": 500,
|
||||
"endpoint": "/api/bans"
|
||||
},
|
||||
"timestamp": "2025-04-30T18:30:00.000Z"
|
||||
}
|
||||
```
|
||||
|
||||
**Backend structured log:**
|
||||
```json
|
||||
{
|
||||
"event": "ban_service_error",
|
||||
"severity": "error",
|
||||
"message": "Failed to ban IP",
|
||||
"correlation_id": "550e8400-e29b-41d4-a716-446655440000",
|
||||
"context": {
|
||||
"ip": "192.168.1.1",
|
||||
"jail": "sshd",
|
||||
"error": "fail2ban socket error"
|
||||
},
|
||||
"timestamp": "2025-04-30T18:30:00.000Z"
|
||||
}
|
||||
```
|
||||
|
||||
**Troubleshooting:** Engineer searches logs for correlation ID `550e8400-e29b-41d4-a716-446655440000` and finds all related events (request received, jail lookup, fail2ban call, error response) in order.
|
||||
|
||||
### Implementation Details
|
||||
|
||||
**Backend:**
|
||||
- Middleware: `app/middleware/correlation.py`
|
||||
- Generates UUID4 if `X-Correlation-ID` header missing
|
||||
- Stores in structlog contextvars for automatic inclusion in all logs
|
||||
- Adds correlation ID to response header and error responses
|
||||
- All error handlers include `correlation_id` in `ErrorResponse`
|
||||
- See `backend/app/models/response.py` for `ErrorResponse.correlation_id` field
|
||||
|
||||
**Frontend:**
|
||||
- API client: `frontend/src/api/client.ts`
|
||||
- Generates session-scoped UUID4 on first use
|
||||
- Includes in `X-Correlation-ID` header for all requests
|
||||
- Extracts from response headers and stores in `ApiError`
|
||||
- Telemetry service: `frontend/src/utils/telemetry.ts`
|
||||
- Structured event logging with correlation ID support
|
||||
- Redaction utilities for privacy/security
|
||||
- Handlers for custom backends (console logger by default)
|
||||
- Error handlers: `frontend/src/utils/fetchError.ts`
|
||||
- Extract correlation ID from API errors
|
||||
- Log with telemetry for distributed tracing
|
||||
- Error boundaries: `frontend/src/components/{Error,Page,Section}ErrorBoundary.tsx`
|
||||
- Catch render-time exceptions
|
||||
- Log with telemetry for observability
|
||||
|
||||
### Privacy & Security
|
||||
|
||||
- **No sensitive data logged:**
|
||||
- Passwords, tokens, session IDs never logged
|
||||
- PII (names, emails, IPs) logged only with explicit intent and redaction
|
||||
- Redaction utilities: `telemetry.redact()`, `telemetry.redactObject()`
|
||||
|
||||
- **Backend:** Correlation IDs use opaque UUID4 (no user data embedded)
|
||||
- **Frontend:** Same session UUID for all requests (safe to expose in logs)
|
||||
|
||||
### Future Enhancements
|
||||
|
||||
1. **Backend error telemetry aggregation:**
|
||||
- Send structured logs to observability platform (DataDog, Grafana Loki, etc.)
|
||||
- Query by correlation ID to trace entire request flow
|
||||
|
||||
2. **Frontend error reporting:**
|
||||
- Send frontend telemetry to backend `/api/telemetry` endpoint
|
||||
- Store alongside backend logs for unified view
|
||||
|
||||
3. **Metrics & dashboards:**
|
||||
- Error rates by endpoint, severity, error type
|
||||
- Latency percentiles and distribution
|
||||
- Request success/failure trends
|
||||
|
||||
---
|
||||
|
||||
## 11. Design Principles
|
||||
|
||||
These principles govern all architectural decisions in BanGUI.
|
||||
|
||||
|
||||
Reference in New Issue
Block a user