# Observability BanGUI provides comprehensive observability through structured logging, metrics, and tracing capabilities. This document outlines the observability architecture and how to configure it for production deployments. --- ## Logging Architecture ### Overview BanGUI uses **structlog** to emit structured, machine-readable logs in JSON format. All logs are automatically enriched with: - **Timestamps** in ISO 8601 format (`timestamp`) - **Log levels** (`level` - debug, info, warning, error, critical) - **Logger names** (`logger_name`) - **Correlation IDs** for request tracking (`correlation_id`) - **Custom context** from business logic (via context variables) ### Log Output By default, logs are written to **stdout** in JSON format, making them suitable for: - Container environments (Docker, Kubernetes) - Log aggregation systems (ELK, Datadog, Papertrail) - CI/CD pipelines and monitoring platforms ```bash # Example log output (formatted for readability) { "timestamp": "2024-05-01T18:17:19.080+02:00", "level": "info", "logger_name": "app.main", "event": "bangui_starting_up", "database_path": "/var/lib/bangui/bangui.db", "pid": 1234 } ``` ### Sensitive Data Handling **CRITICAL: Never log sensitive data.** The following must NEVER appear in logs: - Session tokens or cookies - API keys or secrets - Passwords or password hashes - Private cryptographic keys - Personal information (PII) - Full IP addresses (when not required for security auditing) When logging authentication or sensitive operations: ```python # ✓ Correct: Log event type and result, not credentials log.info("user_login_attempt", username=username, ip=client_ip, success=True) # ✓ Correct: Log sanitized identifiers log.error("auth_token_validation_failed", token_hash=hashlib.sha256(token).hexdigest()[:16]) # ✗ WRONG: Don't do this log.debug("raw_token", token=token) # Never! log.info("password_check", password=password_hash) # Never! ``` Structlog provides context variable filtering to prevent accidental logging of sensitive data. Code reviews must verify compliance with this rule. --- ## Structured Logging Best Practices ### Log Levels Use log levels consistently: | Level | Use Case | Example | |-------|----------|---------| | **debug** | Verbose diagnostic information | `log.debug("parsing_config_file", lines=1024)` | | **info** | Operational events | `log.info("jail_created", jail_name="sshd", action_count=3)` | | **warning** | Recoverable issues | `log.warning("config_reload_skipped", reason="no_changes")` | | **error** | Failures that impact functionality | `log.error("fail2ban_connection_lost", error=str(e))` | | **critical** | System failures | `log.critical("database_corrupted", error=str(e))` | ### Context Variables Use structlog's context variables to automatically include request-scoped information in all logs within a request: ```python import structlog log = structlog.get_logger() # In middleware or early in request processing structlog.contextvars.clear_contextvars() structlog.contextvars.bind_contextvars( correlation_id=request_id, user_id=user_id, client_ip=client_ip, ) # All subsequent logs in this request will include these context variables log.info("user_action", action="create_jail") # Automatically includes correlation_id, user_id, etc. # Clear context at end of request structlog.contextvars.clear_contextvars() ``` ### Event Naming Convention Use snake_case for event names, prefixed with the component or module name: ```python # ✓ Good naming log.info("service_initialized", service="BanService", version="1.0") log.warning("blocklist_import_slow", duration_ms=5000) log.error("fail2ban_command_failed", command="list", exit_code=1) # ✗ Bad naming log.info("init") # Too generic log.warning("slow operation") # Not machine-readable log.error("ERROR: FAIL2BAN FAILED!") # Inconsistent formatting ``` ### Attaching Structured Data Always provide context as key-value pairs, not as unstructured strings: ```python # ✓ Correct: Structured, queryable log.info( "ban_executed", jail="sshd", ip="192.0.2.1", duration_seconds=3600, reason="brute_force", ) # ✗ Wrong: Unstructured, hard to query log.info(f"Banned {ip} in jail {jail} for 3600 seconds because brute_force") ``` --- ## Centralized Logging Configuration ### Environment Variables External logging is configured via environment variables (all prefixed with `BANGUI_`): #### Datadog Enable logging to Datadog via HTTP API: ```bash BANGUI_EXTERNAL_LOGGING_ENABLED=true BANGUI_EXTERNAL_LOGGING_PROVIDER=datadog BANGUI_DATADOG_API_KEY=your-api-key-here BANGUI_DATADOG_SITE=datadoghq.com # or datadoghq.eu for EU BANGUI_DATADOG_BATCH_SIZE=10 # Optional: logs per batch BANGUI_DATADOG_FLUSH_INTERVAL_SECONDS=5 # Optional: flush interval ``` #### Papertrail Enable logging to Papertrail via Syslog protocol: ```bash BANGUI_EXTERNAL_LOGGING_ENABLED=true BANGUI_EXTERNAL_LOGGING_PROVIDER=papertrail BANGUI_PAPERTRAIL_HOST=logs1.papertrailapp.com BANGUI_PAPERTRAIL_PORT=12345 BANGUI_PAPERTRAIL_PROGRAM_NAME=bangui # Optional: program name in syslog ``` #### ELK Stack Enable logging to Elasticsearch/Logstash: ```bash BANGUI_EXTERNAL_LOGGING_ENABLED=true BANGUI_EXTERNAL_LOGGING_PROVIDER=elasticsearch BANGUI_ELASTICSEARCH_HOSTS=http://elasticsearch:9200 BANGUI_ELASTICSEARCH_INDEX_PREFIX=bangui # Optional: index prefix BANGUI_ELASTICSEARCH_BATCH_SIZE=10 # Optional: docs per batch BANGUI_ELASTICSEARCH_FLUSH_INTERVAL_SECONDS=5 # Optional: flush interval ``` ### Local Development (Disabled by Default) External logging is **disabled by default**. In development, logs continue to write to stdout only: ```bash # No configuration needed — logs go to stdout docker compose up ``` To enable external logging in development for testing: ```bash BANGUI_EXTERNAL_LOGGING_ENABLED=true \ BANGUI_EXTERNAL_LOGGING_PROVIDER=datadog \ BANGUI_DATADOG_API_KEY=test-key \ python -m uvicorn app.main:create_app --host 0.0.0.0 --port 8000 ``` --- ## Performance and Reliability ### Non-Blocking Delivery External log delivery uses **asynchronous buffering** to prevent blocking the application: 1. Logs are written to an in-memory buffer 2. After the configured flush interval or batch size, the buffer is sent asynchronously 3. Send failures do not block application logic 4. Retries use exponential backoff (up to 5 attempts) This ensures that external logging never degrades application performance. ### Failure Modes If external logging becomes unavailable: - **Transient failures** (network timeouts, temporary 5xx errors): Logs are retried with exponential backoff - **Permanent failures** (invalid API key, host unreachable): A warning is logged; application continues - **Steady-state**: Logs are buffered up to a maximum queue size (default: 1000 logs); older logs are dropped if buffer fills The application **never crashes** due to external logging failures. ### Log Volume and Rate Limiting Large log volumes can increase data transfer and storage costs. To manage log volume: 1. **Reduce log level in production**: Set `BANGUI_LOG_LEVEL=warning` or `error` to suppress debug/info logs 2. **Sample logs**: Some providers (Datadog, Papertrail) support sampling rules 3. **Filter sensitive paths**: Middleware can suppress verbose logging for noisy endpoints Monitor actual log volume and adjust settings based on usage patterns. --- ## Integration Examples ### Docker Compose (Development with Datadog) ```yaml version: "3.9" services: bangui: build: context: . dockerfile: Docker/Dockerfile.app environment: BANGUI_EXTERNAL_LOGGING_ENABLED: "true" BANGUI_EXTERNAL_LOGGING_PROVIDER: "datadog" BANGUI_DATADOG_API_KEY: "${DATADOG_API_KEY}" BANGUI_DATADOG_SITE: "datadoghq.com" BANGUI_LOG_LEVEL: "info" ports: - "8000:8000" ``` ### Kubernetes Deployment (Papertrail) ```yaml apiVersion: v1 kind: ConfigMap metadata: name: bangui-logging data: BANGUI_EXTERNAL_LOGGING_ENABLED: "true" BANGUI_EXTERNAL_LOGGING_PROVIDER: "papertrail" BANGUI_PAPERTRAIL_HOST: "logs1.papertrailapp.com" BANGUI_PAPERTRAIL_PORT: "12345" BANGUI_PAPERTRAIL_PROGRAM_NAME: "bangui" BANGUI_LOG_LEVEL: "info" --- apiVersion: apps/v1 kind: Deployment metadata: name: bangui spec: template: spec: containers: - name: bangui image: bangui:latest envFrom: - configMapRef: name: bangui-logging env: - name: BANGUI_DATADOG_API_KEY valueFrom: secretKeyRef: name: bangui-secrets key: datadog-api-key ``` --- ## Monitoring Logging Infrastructure ### Datadog Dashboard Query Search for all BanGUI logs: ``` service:bangui ``` Search for errors in authentication: ``` service:bangui status:error component:auth ``` ### Papertrail Search Search for all startup events: ``` program:bangui bangui_starting_up ``` Search for authentication failures: ``` program:bangui auth_token_validation_failed ``` ### Elasticsearch Query (ELK) ```json { "query": { "bool": { "must": [ { "match": { "logger_name": "app.auth" } }, { "match": { "level": "error" } } ] } } } ``` --- ## Testing and Debugging ### Verify JSON Output Inspect the actual JSON emitted by the logging system: ```bash # Start the app and capture logs python -m uvicorn app.main:create_app --host 0.0.0.0 --port 8000 2>&1 | head -10 | python -m json.tool ``` Expected output: ```json { "timestamp": "2024-05-01T18:20:45.123456+02:00", "level": "info", "logger_name": "app.main", "event": "bangui_starting_up", "database_path": "/var/lib/bangui/bangui.db" } ``` ### Enable Debug Logging for External Log Delivery Set the log level to `debug` to see internal logs from the external logging system: ```bash BANGUI_LOG_LEVEL=debug BANGUI_EXTERNAL_LOGGING_ENABLED=true python -m uvicorn app.main:create_app ``` This will emit logs like: ```json { "level": "debug", "event": "external_log_batch_sent", "provider": "datadog", "batch_size": 10, "duration_ms": 42 } ``` ### Validate Configuration Validate external logging configuration on startup: ```bash python -c "from app.config import get_settings; s = get_settings(); print(s.model_dump())" ``` --- ## Security Considerations ### API Key Rotation Rotate API keys regularly: 1. Update `BANGUI_DATADOG_API_KEY` with the new key 2. Restart the application 3. Old keys can be revoked after restart ### Network Security When sending logs over the network: - **Datadog HTTP API**: Uses HTTPS, encrypted in transit - **Papertrail Syslog**: Use TLS-enabled Syslog (if supported) or send over VPN/private network - **Elasticsearch**: Use HTTPS and HTTP Basic Auth or API Key authentication Never send logs over unencrypted channels in production. ### Compliance Ensure that your external logging platform complies with your organization's data protection requirements: - **GDPR**: Verify the platform's data processing agreements - **HIPAA**: Ensure the provider is HIPAA-eligible - **SOC 2**: Request audit reports from your logging provider - **Data retention**: Configure appropriate log retention policies --- ## Troubleshooting ### Logs Not Appearing in External System 1. **Verify configuration**: Check that environment variables are set correctly 2. **Check API credentials**: Ensure the API key or credentials are valid 3. **Check network connectivity**: Verify the external system is reachable 4. **Review logs locally**: Run with `BANGUI_LOG_LEVEL=debug` and check stdout for errors 5. **Check disk space**: Ensure the local buffer directory has sufficient disk space ### Performance Degradation 1. **Check buffer size**: If the buffer is full, logs are dropped; increase `BANGUI_EXTERNAL_LOGGING_BUFFER_SIZE` 2. **Adjust flush interval**: Decrease flush interval if experiencing large batches 3. **Reduce log level**: Set `BANGUI_LOG_LEVEL=warning` to reduce log volume 4. **Monitor network**: Check bandwidth usage between application and external system ### Lost Logs In the rare event that logs are lost: 1. **Buffer overflow**: The in-memory buffer has a maximum size; excess logs are dropped with a warning 2. **Network failure during batch send**: Logs are retried; after max retries, a warning is logged 3. **External system outage**: Logs may be dropped if buffer fills before service is restored To minimize data loss: - Increase buffer size (`BANGUI_EXTERNAL_LOGGING_BUFFER_SIZE`) - Use persistent external logging platforms - Monitor for warnings in application logs about dropped batches --- ## Application Performance Monitoring (Metrics) BanGUI collects comprehensive metrics for request performance, application health, and resource utilization through **Prometheus**. Metrics are exposed in standard Prometheus text format and can be scraped by monitoring systems. ### Backend Metrics #### HTTP Request Metrics The backend automatically tracks HTTP request performance: - **`bangui_http_requests_total`** (Counter) — Total HTTP requests by method, endpoint, and status code ``` bangui_http_requests_total{method="GET",endpoint="/api/jails",status_code="200"} 125 ``` - **`bangui_http_request_duration_seconds`** (Histogram) — Request latency distribution by method and endpoint ``` bangui_http_request_duration_seconds_bucket{method="GET",endpoint="/api/jails",le="0.1"} 120 bangui_http_request_duration_seconds_sum{method="GET",endpoint="/api/jails"} 45.23 ``` - **`bangui_http_active_requests`** (Gauge) — Current number of in-flight requests by method and endpoint ``` bangui_http_active_requests{method="GET",endpoint="/api/jails"} 5 ``` #### Application Metrics Domain-specific metrics track application state: - **`bangui_bans_total`** (Gauge) — Total number of currently banned IPs across all jails - **`bangui_jails_total`** (Gauge) — Total number of fail2ban jails - **`bangui_fail2ban_connection_errors_total`** (Counter) — Total fail2ban connection errors #### Accessing Metrics Prometheus metrics are exposed at the `/metrics` endpoint: ```bash curl http://localhost:8000/metrics ``` Response format: ``` # HELP bangui_http_requests_total Total HTTP requests by method, endpoint, and status code # TYPE bangui_http_requests_total counter bangui_http_requests_total{method="GET",endpoint="/api/dashboard/status",status_code="200"} 1523.0 # HELP bangui_http_request_duration_seconds HTTP request latency in seconds by method and endpoint # TYPE bangui_http_request_duration_seconds histogram bangui_http_request_duration_seconds_bucket{method="GET",endpoint="/api/dashboard/status",le="0.01"} 1200.0 bangui_http_request_duration_seconds_sum{method="GET",endpoint="/api/dashboard/status"} 156.78 ``` ### Frontend Metrics #### Web Vitals The frontend automatically measures Core Web Vitals using the `web-vitals` library: - **Cumulative Layout Shift (CLS)** — Visual stability score (good: ≤0.1) - **First Contentful Paint (FCP)** — Time until first content appears (good: ≤1.8s) - **First Input Delay (FID)** — Responsiveness to user input (good: ≤100ms) - **Largest Contentful Paint (LCP)** — Time until largest content is visible (good: ≤2.5s) - **Time to First Byte (TTFB)** — Server response time (good: ≤600ms) #### API Call Metrics API calls are automatically tracked with: - HTTP method and endpoint - Response status code - Duration in milliseconds - Timestamp ### Integrating with Monitoring Systems #### Prometheus + Grafana Configure Prometheus to scrape BanGUI metrics: ```yaml # prometheus.yml scrape_configs: - job_name: "bangui" static_configs: - targets: ["localhost:8000"] metrics_path: "/metrics" ``` Then import a Grafana dashboard to visualize: - Request rates by endpoint - Latency percentiles (p50, p95, p99) - Error rate trends - Active request counts #### Datadog Configure BanGUI to send metrics via StatsD or HTTP API: ```bash BANGUI_METRICS_ENABLED=true BANGUI_METRICS_PROVIDER=datadog BANGUI_DATADOG_API_KEY=your-api-key BANGUI_DATADOG_SITE=datadoghq.com ``` #### New Relic Send metrics to New Relic (custom event collection): ```bash BANGUI_METRICS_ENABLED=true BANGUI_METRICS_PROVIDER=newrelic BANGUI_NEWRELIC_API_KEY=your-api-key BANGUI_NEWRELIC_ACCOUNT_ID=your-account-id ``` ### Metrics Best Practices #### Cardinality Management Metric labels (tags) can cause cardinality explosion if not carefully managed. BanGUI uses: - Path normalization — `/api/jails/123` becomes `/api/{id}` to prevent unique labels per resource - Status code grouping — errors are grouped by category, not individual codes - Endpoint aggregation — only significant endpoints are tracked #### Performance Considerations - Metrics collection has negligible performance impact (<1ms per request) - In-memory buffering prevents database writes on every request - High-cardinality labels are avoided - Metric export (scraping) does not block request processing #### PII Protection **NEVER include sensitive data in metric labels:** - User IDs or session tokens - Passwords or API keys - Private IP addresses - Full request/response bodies Allowed: HTTP method, endpoint path (normalized), status code, duration, timestamp. ### Query Examples #### Prometheus Queries Find p95 request latency for `/api/jails`: ```promql histogram_quantile(0.95, bangui_http_request_duration_seconds_bucket{endpoint="/api/jails"}) ``` Find error rate (5xx responses): ```promql rate(bangui_http_requests_total{status_code=~"5.."}[5m]) ``` Find active requests per endpoint: ```promql bangui_http_active_requests ``` #### Grafana Dashboard Recommended panels: 1. **Request Rate** — `rate(bangui_http_requests_total[1m])` by endpoint 2. **Latency Percentiles** — `histogram_quantile([0.5, 0.95, 0.99], ...)` 3. **Error Rate** — `rate(bangui_http_requests_total{status_code=~"5.."}[5m])` 4. **Active Requests** — `bangui_http_active_requests` (gauge) 5. **fail2ban Connection Health** — `rate(bangui_fail2ban_connection_errors_total[5m])` ### Troubleshooting Metrics #### Metrics endpoint not responding 1. Verify the `/metrics` endpoint is accessible: `curl http://localhost:8000/metrics` 2. Check application logs for errors during middleware initialization 3. Ensure prometheus-client is installed: `pip show prometheus-client` #### High cardinality warnings If Prometheus warns about high cardinality: 1. Check if custom labels are being added to metrics 2. Ensure path normalization is working (IDs should be replaced with `{id}`) 3. Consider sampling metrics for high-volume endpoints #### Missing metrics 1. Check that endpoints are being called (look for 200 responses in logs) 2. Verify the metrics middleware is registered (check `app.add_middleware(MetricsMiddleware)`) 3. Ensure metrics are being recorded (call `recordApiCall()` on frontend) --- ## Future Enhancements Planned observability improvements: - [x] Application metrics collection (Prometheus) - [x] Web Vitals tracking (frontend) - [ ] Distributed tracing (OpenTelemetry integration) - [ ] Custom metric hooks for business events - [ ] Alerting rules and thresholds - [ ] Log sampling strategies - [ ] Additional provider support (Splunk, New Relic, CloudWatch) --- ## Scheduler Lock Health Monitoring The scheduler lock ensures only one instance runs background tasks. Monitoring its health is critical for production reliability. ### Key Metrics Monitor these log events for scheduler lock health: | Event | Level | Meaning | |-------|-------|---------| | `scheduler_lock_acquired` | info | Successfully acquired the scheduler lock | | `scheduler_lock_held_by_other_instance` | warning | Another instance holds the lock (expected during normal multi-instance operation) | | `scheduler_lock_stale_overwrite` | info | Took over a stale lock from a crashed instance | | `scheduler_lock_heartbeat_lost` | warning | Heartbeat update failed; we lost the lock | | `scheduler_lock_release_mismatch` | warning | Release attempted but we don't hold the lock | ### Lock Health Check Query current lock status via `get_lock_health()`: ```python from app.utils.scheduler_lock import get_lock_health health = await get_lock_health(db) # Returns: {"locked": bool, "pid": int|None, "hostname": str|None, # "age_seconds": float|None, "is_stale": bool, "ttl_remaining": float|None} ``` ### Alerting Rules **Critical alerts:** - `scheduler_lock_acquired` not seen for >5 minutes during startup → Instance may not have acquired lock - `scheduler_lock_heartbeat_lost` repeated >3 times → Lock keeps being stolen, possible contention issue **Warning alerts:** - `scheduler_lock_held_by_other_instance` every few minutes → Normal if multiple instances, abnormal if single instance ### Database Query Check lock state directly in SQLite: ```sql SELECT pid, hostname, heartbeat_at, heartbeat_timeout, (datetime('now') - datetime(heartbeat_at, 'unixepoch')) as age FROM scheduler_lock WHERE id = 1; ``` ### Common Issues 1. **Lock not acquired on startup**: Check logs for `scheduler_lock_held_by_other_instance`. If another instance holds it, verify if that instance is healthy. 2. **Background tasks not running**: Use `get_lock_health()` to verify the lock is held. If not held, the instance cannot run scheduled tasks. 3. **Frequent lock steals**: If `scheduler_lock_stale_overwrite` occurs frequently, the heartbeat interval may be too long or network latency is causing false staleness detection. --- ## References - [structlog Documentation](https://www.structlog.org/) - [Datadog Logging Documentation](https://docs.datadoghq.com/logs/) - [Papertrail Documentation](https://help.papertrailapp.com/) - [Elasticsearch JSON Logging](https://www.elastic.co/guide/en/elasticsearch/reference/current/logging.html) - [Observability Best Practices (OpenTelemetry)](https://opentelemetry.io/docs/concepts/observability-primer/)