Implement frontend and backend observability alignment

Align frontend and backend error observability with correlation IDs and
structured telemetry for distributed tracing across systems.

Backend changes:
- Add CorrelationIdMiddleware to generate/extract correlation IDs
- Include correlation_id in all ErrorResponse objects
- Store correlation ID in structlog contextvars for automatic inclusion in logs
- Add correlation ID to response headers (X-Correlation-ID)

Frontend changes:
- API client automatically generates session-scoped UUID4 and includes
  X-Correlation-ID header in all requests
- Extract correlation ID from API error responses
- Update error handlers to use telemetry with correlation IDs
- Add telemetry logging to ErrorBoundary, PageErrorBoundary, SectionErrorBoundary
- Implement redaction utilities for privacy-safe logging of sensitive data

Documentation:
- Add observability guidelines to Web-Development.md
  * Correlation ID usage patterns
  * Privacy & security best practices
  * Telemetry event structure
  * Redaction utilities for sensitive data
- Add distributed tracing architecture section to Architecture.md
  * Correlation ID flow across frontend/backend
  * Example troubleshooting scenario
  * Implementation details for future enhancements

Testing:
- Add comprehensive tests for correlation middleware
- Update error boundary tests to verify telemetry integration
- Verify TypeScript and ESLint pass with no warnings

Fixes: Issue #40 - Frontend and backend observability are not aligned

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
This commit is contained in:
2026-04-30 18:32:19 +02:00
parent 9a43123b3a
commit 3d1a6f5538
16 changed files with 916 additions and 54 deletions

View File

@@ -1451,7 +1451,149 @@ Currently, the single-executor approach is simple, maintainable, and sufficient
---
## 10. Design Principles
## 10. Observability & Distributed Tracing
BanGUI implements **distributed tracing** via **correlation IDs** to correlate errors and requests across frontend and backend systems.
### Architecture
```
┌─────────────────────────────────────────────────────────────┐
│ Frontend (React + TypeScript) │
├─────────────────────────────────────────────────────────────┤
│ • API Client generates session-scoped UUID4 (correlation ID)│
│ • Telemetry service records structured events │
│ • Error boundaries catch render errors │
│ • All telemetry events include correlation ID for tracing │
└────────────────────┬────────────────────────────────────────┘
├─ Every request includes
│ X-Correlation-ID header
┌────────────────────┴────────────────────────────────────────┐
│ Backend (Python + FastAPI + structlog) │
├─────────────────────────────────────────────────────────────┤
│ • CorrelationIdMiddleware extracts/generates correlation ID │
│ • All logs automatically include correlation ID │
│ • Error responses include correlation_id field │
│ • structlog outputs JSON with correlation ID in all events │
└─────────────────────────────────────────────────────────────┘
```
### Correlation ID Flow
1. **Frontend → Backend:**
- API client generates/retrieves session-scoped UUID4
- UUID4 sent in `X-Correlation-ID` request header
- All requests use same session UUID (set once, reused)
2. **Backend Processing:**
- CorrelationIdMiddleware extracts/generates correlation ID
- ID stored in structlog contextvars
- All structured log entries include correlation ID automatically
- Error responses include `correlation_id` field in JSON
3. **Backend → Frontend:**
- Response includes `X-Correlation-ID` header
- Error responses include `correlation_id` in response body
- Frontend error handlers extract correlation ID
4. **Frontend Error Logging:**
- Error handlers extract correlation ID from API response
- Telemetry service logs error with correlation ID
- Browser console and telemetry backends receive linked events
### Example: Correlating an Error Across Systems
**Scenario:** User clicks "Ban IP" button → API returns 500 error → error logged and displayed
**Frontend telemetry event:**
```json
{
"event": "api_error",
"severity": "error",
"message": "Server error banning IP",
"correlation_id": "550e8400-e29b-41d4-a716-446655440000",
"context": {
"status": 500,
"endpoint": "/api/bans"
},
"timestamp": "2025-04-30T18:30:00.000Z"
}
```
**Backend structured log:**
```json
{
"event": "ban_service_error",
"severity": "error",
"message": "Failed to ban IP",
"correlation_id": "550e8400-e29b-41d4-a716-446655440000",
"context": {
"ip": "192.168.1.1",
"jail": "sshd",
"error": "fail2ban socket error"
},
"timestamp": "2025-04-30T18:30:00.000Z"
}
```
**Troubleshooting:** Engineer searches logs for correlation ID `550e8400-e29b-41d4-a716-446655440000` and finds all related events (request received, jail lookup, fail2ban call, error response) in order.
### Implementation Details
**Backend:**
- Middleware: `app/middleware/correlation.py`
- Generates UUID4 if `X-Correlation-ID` header missing
- Stores in structlog contextvars for automatic inclusion in all logs
- Adds correlation ID to response header and error responses
- All error handlers include `correlation_id` in `ErrorResponse`
- See `backend/app/models/response.py` for `ErrorResponse.correlation_id` field
**Frontend:**
- API client: `frontend/src/api/client.ts`
- Generates session-scoped UUID4 on first use
- Includes in `X-Correlation-ID` header for all requests
- Extracts from response headers and stores in `ApiError`
- Telemetry service: `frontend/src/utils/telemetry.ts`
- Structured event logging with correlation ID support
- Redaction utilities for privacy/security
- Handlers for custom backends (console logger by default)
- Error handlers: `frontend/src/utils/fetchError.ts`
- Extract correlation ID from API errors
- Log with telemetry for distributed tracing
- Error boundaries: `frontend/src/components/{Error,Page,Section}ErrorBoundary.tsx`
- Catch render-time exceptions
- Log with telemetry for observability
### Privacy & Security
- **No sensitive data logged:**
- Passwords, tokens, session IDs never logged
- PII (names, emails, IPs) logged only with explicit intent and redaction
- Redaction utilities: `telemetry.redact()`, `telemetry.redactObject()`
- **Backend:** Correlation IDs use opaque UUID4 (no user data embedded)
- **Frontend:** Same session UUID for all requests (safe to expose in logs)
### Future Enhancements
1. **Backend error telemetry aggregation:**
- Send structured logs to observability platform (DataDog, Grafana Loki, etc.)
- Query by correlation ID to trace entire request flow
2. **Frontend error reporting:**
- Send frontend telemetry to backend `/api/telemetry` endpoint
- Store alongside backend logs for unified view
3. **Metrics & dashboards:**
- Error rates by endpoint, severity, error type
- Latency percentiles and distribution
- Request success/failure trends
---
## 11. Design Principles
These principles govern all architectural decisions in BanGUI.

View File

@@ -1,43 +1,3 @@
## 38) History archive query paths may need explicit indexing plan
- Where found:
- [backend/app/db.py](backend/app/db.py)
- [backend/app/repositories/history_archive_repo.py](backend/app/repositories/history_archive_repo.py)
- Why this is needed:
- Large archive datasets can degrade filter/sort performance.
- Goal:
- Add indexes aligned with real query patterns.
- What to do:
- Benchmark common history queries.
- Add migration with targeted indexes.
- Possible traps and issues:
- Extra indexes increase write cost and DB size.
- Docs changes needed:
- Add DB performance/indexing section for history.
- Doc references:
- [Docs/Backend-Development.md](Docs/Backend-Development.md)
- https://www.sqlite.org/queryplanner.html
---
## 39) No explicit DI container strategy for backend service graph
- Where found:
- [backend/app/dependencies.py](backend/app/dependencies.py)
- [backend/app/services](backend/app/services)
- Why this is needed:
- Dependency construction and lifecycle are partly implicit.
- Goal:
- Define a clear dependency wiring pattern for services and repositories.
- What to do:
- Create service composition root pattern and document usage.
- Possible traps and issues:
- Over-engineering if container abstraction is too heavy for current size.
- Docs changes needed:
- Add dependency wiring chapter.
- Doc references:
- [Docs/Architekture.md](Docs/Architekture.md)
---
## 40) Frontend and backend observability are not aligned
- Where found:
- [backend/app/main.py](backend/app/main.py)

View File

@@ -1608,7 +1608,88 @@ it("should render a row for each ban", () => {
---
## 15. Git & Workflow
## 15. Error Observability & Telemetry
Frontend errors must be reported with correlation IDs to enable distributed tracing across frontend and backend systems. This allows engineers to correlate errors in the UI with their corresponding backend logs.
### Correlation IDs
- **Automatic:** The API client automatically generates a **session-scoped UUID4** on first use and includes it in the `X-Correlation-ID` header for every request.
- **Backend responds:** The backend includes the correlation ID in the response header and in error responses (`correlation_id` field).
- **Frontend extraction:** Error handlers automatically extract the correlation ID and log it with telemetry events for debugging.
### Error Telemetry
Use the `telemetry.ts` utilities to log errors with correlation IDs:
```ts
import { recordError, recordWarning, redact } from "../utils/telemetry";
// Log API errors with correlation ID
try {
const data = await api.get("/jails");
} catch (error) {
const correlationId = (error as ApiError).correlationId;
recordError(
"fetch_jails_failed",
error instanceof Error ? error : new Error(String(error)),
{ endpoint: "/jails" },
correlationId
);
}
// Log validation errors
if (!validateEmail(email)) {
recordWarning(
"invalid_email_format",
`Email format invalid: ${redact(email)}`,
{ field: "email" }
);
}
```
### Privacy & Security
**NEVER log sensitive data:**
- Passwords, tokens, session IDs
- Personal information (names, email addresses, IP addresses)
- Configuration secrets or API keys
- Request/response bodies containing passwords
**Redact sensitive fields before logging:**
```ts
import { redact, redactObject } from "../utils/telemetry";
// Redact URLs with query parameters
const safeUrl = redact("https://api.example.com/login?password=secret");
// Result: "https://api.example.com/login?password=[REDACTED]"
// Redact object fields
const safeConfig = redactObject({
apiKey: "sk-1234567890",
username: "john@example.com",
serverUrl: "https://internal.api.example.com",
});
// Result: { apiKey: "[REDACTED]", username: "[REDACTED]", serverUrl: "..." }
```
### Telemetry Event Structure
All telemetry events are structured with:
- `event`: Machine-readable event name in snake_case (e.g., `"auth_error"`, `"component_render_error"`)
- `severity`: One of `"debug"`, `"info"`, `"warning"`, `"error"`, `"critical"`
- `correlation_id`: UUID for distributed tracing (optional, but recommended for errors)
- `message`: Human-readable description (optional)
- `context`: Structured data bag for additional context (no PII)
- `timestamp`: ISO 8601 timestamp
- `error`: Error instance for stack traces (if applicable)
This mirrors the backend structlog format, enabling consistent log analysis across frontend and backend.
---
## 16. Git & Workflow
- **Branch naming:** `feature/<short-description>`, `fix/<short-description>`, `chore/<short-description>`.
- **Commit messages:** imperative tense, max 72 chars first line (`Add ban table component`, `Fix date formatting in dashboard`).