Implement frontend and backend observability alignment

Align frontend and backend error observability with correlation IDs and
structured telemetry for distributed tracing across systems.

Backend changes:
- Add CorrelationIdMiddleware to generate/extract correlation IDs
- Include correlation_id in all ErrorResponse objects
- Store correlation ID in structlog contextvars for automatic inclusion in logs
- Add correlation ID to response headers (X-Correlation-ID)

Frontend changes:
- API client automatically generates session-scoped UUID4 and includes
  X-Correlation-ID header in all requests
- Extract correlation ID from API error responses
- Update error handlers to use telemetry with correlation IDs
- Add telemetry logging to ErrorBoundary, PageErrorBoundary, SectionErrorBoundary
- Implement redaction utilities for privacy-safe logging of sensitive data

Documentation:
- Add observability guidelines to Web-Development.md
  * Correlation ID usage patterns
  * Privacy & security best practices
  * Telemetry event structure
  * Redaction utilities for sensitive data
- Add distributed tracing architecture section to Architecture.md
  * Correlation ID flow across frontend/backend
  * Example troubleshooting scenario
  * Implementation details for future enhancements

Testing:
- Add comprehensive tests for correlation middleware
- Update error boundary tests to verify telemetry integration
- Verify TypeScript and ESLint pass with no warnings

Fixes: Issue #40 - Frontend and backend observability are not aligned

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
This commit is contained in:
2026-04-30 18:32:19 +02:00
parent 9a43123b3a
commit 3d1a6f5538
16 changed files with 916 additions and 54 deletions

View File

@@ -1451,7 +1451,149 @@ Currently, the single-executor approach is simple, maintainable, and sufficient
---
## 10. Design Principles
## 10. Observability & Distributed Tracing
BanGUI implements **distributed tracing** via **correlation IDs** to correlate errors and requests across frontend and backend systems.
### Architecture
```
┌─────────────────────────────────────────────────────────────┐
│ Frontend (React + TypeScript) │
├─────────────────────────────────────────────────────────────┤
│ • API Client generates session-scoped UUID4 (correlation ID)│
│ • Telemetry service records structured events │
│ • Error boundaries catch render errors │
│ • All telemetry events include correlation ID for tracing │
└────────────────────┬────────────────────────────────────────┘
├─ Every request includes
│ X-Correlation-ID header
┌────────────────────┴────────────────────────────────────────┐
│ Backend (Python + FastAPI + structlog) │
├─────────────────────────────────────────────────────────────┤
│ • CorrelationIdMiddleware extracts/generates correlation ID │
│ • All logs automatically include correlation ID │
│ • Error responses include correlation_id field │
│ • structlog outputs JSON with correlation ID in all events │
└─────────────────────────────────────────────────────────────┘
```
### Correlation ID Flow
1. **Frontend → Backend:**
- API client generates/retrieves session-scoped UUID4
- UUID4 sent in `X-Correlation-ID` request header
- All requests use same session UUID (set once, reused)
2. **Backend Processing:**
- CorrelationIdMiddleware extracts/generates correlation ID
- ID stored in structlog contextvars
- All structured log entries include correlation ID automatically
- Error responses include `correlation_id` field in JSON
3. **Backend → Frontend:**
- Response includes `X-Correlation-ID` header
- Error responses include `correlation_id` in response body
- Frontend error handlers extract correlation ID
4. **Frontend Error Logging:**
- Error handlers extract correlation ID from API response
- Telemetry service logs error with correlation ID
- Browser console and telemetry backends receive linked events
### Example: Correlating an Error Across Systems
**Scenario:** User clicks "Ban IP" button → API returns 500 error → error logged and displayed
**Frontend telemetry event:**
```json
{
"event": "api_error",
"severity": "error",
"message": "Server error banning IP",
"correlation_id": "550e8400-e29b-41d4-a716-446655440000",
"context": {
"status": 500,
"endpoint": "/api/bans"
},
"timestamp": "2025-04-30T18:30:00.000Z"
}
```
**Backend structured log:**
```json
{
"event": "ban_service_error",
"severity": "error",
"message": "Failed to ban IP",
"correlation_id": "550e8400-e29b-41d4-a716-446655440000",
"context": {
"ip": "192.168.1.1",
"jail": "sshd",
"error": "fail2ban socket error"
},
"timestamp": "2025-04-30T18:30:00.000Z"
}
```
**Troubleshooting:** Engineer searches logs for correlation ID `550e8400-e29b-41d4-a716-446655440000` and finds all related events (request received, jail lookup, fail2ban call, error response) in order.
### Implementation Details
**Backend:**
- Middleware: `app/middleware/correlation.py`
- Generates UUID4 if `X-Correlation-ID` header missing
- Stores in structlog contextvars for automatic inclusion in all logs
- Adds correlation ID to response header and error responses
- All error handlers include `correlation_id` in `ErrorResponse`
- See `backend/app/models/response.py` for `ErrorResponse.correlation_id` field
**Frontend:**
- API client: `frontend/src/api/client.ts`
- Generates session-scoped UUID4 on first use
- Includes in `X-Correlation-ID` header for all requests
- Extracts from response headers and stores in `ApiError`
- Telemetry service: `frontend/src/utils/telemetry.ts`
- Structured event logging with correlation ID support
- Redaction utilities for privacy/security
- Handlers for custom backends (console logger by default)
- Error handlers: `frontend/src/utils/fetchError.ts`
- Extract correlation ID from API errors
- Log with telemetry for distributed tracing
- Error boundaries: `frontend/src/components/{Error,Page,Section}ErrorBoundary.tsx`
- Catch render-time exceptions
- Log with telemetry for observability
### Privacy & Security
- **No sensitive data logged:**
- Passwords, tokens, session IDs never logged
- PII (names, emails, IPs) logged only with explicit intent and redaction
- Redaction utilities: `telemetry.redact()`, `telemetry.redactObject()`
- **Backend:** Correlation IDs use opaque UUID4 (no user data embedded)
- **Frontend:** Same session UUID for all requests (safe to expose in logs)
### Future Enhancements
1. **Backend error telemetry aggregation:**
- Send structured logs to observability platform (DataDog, Grafana Loki, etc.)
- Query by correlation ID to trace entire request flow
2. **Frontend error reporting:**
- Send frontend telemetry to backend `/api/telemetry` endpoint
- Store alongside backend logs for unified view
3. **Metrics & dashboards:**
- Error rates by endpoint, severity, error type
- Latency percentiles and distribution
- Request success/failure trends
---
## 11. Design Principles
These principles govern all architectural decisions in BanGUI.