Implement structured logging to centralized platforms (Datadog, Papertrail, ELK)

This commit adds support for shipping logs to external centralized logging platforms, addressing the MEDIUM priority task for structured logging infrastructure. ## Key Changes: ### 1. New Documentation: Docs/Observability.md - Comprehensive guide to logging architecture and configuration - Covers all three supported platforms (Datadog, Papertrail, Elasticsearch) - Includes best practices, security considerations, and troubleshooting - Documents sensitive data handling and compliance requirements ### 2. Core Implementation: app/utils/external_logging.py - ExternalLogHandler: Abstract base class for non-blocking log delivery - DatadogLogHandler: HTTP API integration with JSON payloads - PapertrailLogHandler: Syslog protocol over TCP - ElasticsearchLogHandler: Bulk API integration with NDJSON format - Features: - Async buffering with configurable batch size and flush interval - Exponential backoff retry logic - Non-blocking delivery (never blocks application logic) - Proper error handling and internal logging - Lifecycle management (start/shutdown) ### 3. Configuration: app/config.py - New Settings fields for external logging: - external_logging_enabled (default: False) - external_logging_provider (datadog/papertrail/elasticsearch) - external_logging_buffer_size (default: 1000) - external_logging_flush_interval_seconds (default: 5.0) - Provider-specific configuration (API keys, hosts, batch sizes) - All fields have sensible defaults - Full field validation and normalization ### 4. Integration: app/main.py - Global _external_log_handler for application lifecycle - _external_logging_processor: structlog processor for handler integration - Updated _configure_logging(): Add handler to processor chain when enabled - Updated _lifespan(): Initialize handler before startup, shutdown on termination ### 5. Tests: backend/tests/test_external_logging.py - 20 comprehensive tests covering all handlers and factory - Configuration validation tests - All tests passing ## Design Decisions: 1. **Non-blocking Delivery**: External logging never blocks request handling. Failures are logged locally but don't impact application. 2. **Buffering Strategy**: In-memory buffer with configurable size prevents unbounded memory growth. When buffer fills, oldest logs are dropped with a warning. 3. **Retry Logic**: Transient failures (timeouts, 5xx errors) are retried with exponential backoff. Permanent failures (bad credentials) are logged and skipped. 4. **Disabled by Default**: External logging is opt-in via environment variables, maintaining backward compatibility with existing deployments. 5. **Provider Flexibility**: Support for multiple platforms allows users to choose based on their infrastructure (cloud-native, on-premise, etc). ## Backward Compatibility: - All new configuration fields have defaults - External logging disabled by default - No changes to existing logging behavior unless explicitly configured - No new required dependencies ## Testing: - All 20 new tests passing - Existing tests unaffected (same count of passing tests) - Configuration validation tested - Handler creation and lifecycle management tested Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-05-01 18:25:26 +02:00
parent 60d9c5b340
commit 37078b742b
6 changed files with 1383 additions and 53 deletions
--- a/Docs/Observability.md
+++ b/Docs/Observability.md
@@ -0,0 +1,482 @@
+# Observability
+
+BanGUI provides comprehensive observability through structured logging, metrics, and tracing capabilities. This document outlines the observability architecture and how to configure it for production deployments.
+
+---
+
+## Logging Architecture
+
+### Overview
+
+BanGUI uses **structlog** to emit structured, machine-readable logs in JSON format. All logs are automatically enriched with:
+
+- **Timestamps** in ISO 8601 format (`timestamp`)
+- **Log levels** (`level` - debug, info, warning, error, critical)
+- **Logger names** (`logger_name`)
+- **Correlation IDs** for request tracking (`correlation_id`)
+- **Custom context** from business logic (via context variables)
+
+### Log Output
+
+By default, logs are written to **stdout** in JSON format, making them suitable for:
+- Container environments (Docker, Kubernetes)
+- Log aggregation systems (ELK, Datadog, Papertrail)
+- CI/CD pipelines and monitoring platforms
+
+```bash
+# Example log output (formatted for readability)
+{
+  "timestamp": "2024-05-01T18:17:19.080+02:00",
+  "level": "info",
+  "logger_name": "app.main",
+  "event": "bangui_starting_up",
+  "database_path": "/var/lib/bangui/bangui.db",
+  "pid": 1234
+}
+```
+
+### Sensitive Data Handling
+
+**CRITICAL: Never log sensitive data.** The following must NEVER appear in logs:
+
+- Session tokens or cookies
+- API keys or secrets
+- Passwords or password hashes
+- Private cryptographic keys
+- Personal information (PII)
+- Full IP addresses (when not required for security auditing)
+
+When logging authentication or sensitive operations:
+
+```python
+# ✓ Correct: Log event type and result, not credentials
+log.info("user_login_attempt", username=username, ip=client_ip, success=True)
+
+# ✓ Correct: Log sanitized identifiers
+log.error("auth_token_validation_failed", token_hash=hashlib.sha256(token).hexdigest()[:16])
+
+# ✗ WRONG: Don't do this
+log.debug("raw_token", token=token)  # Never!
+log.info("password_check", password=password_hash)  # Never!
+```
+
+Structlog provides context variable filtering to prevent accidental logging of sensitive data. Code reviews must verify compliance with this rule.
+
+---
+
+## Structured Logging Best Practices
+
+### Log Levels
+
+Use log levels consistently:
+
+| Level | Use Case | Example |
+|-------|----------|---------|
+| **debug** | Verbose diagnostic information | `log.debug("parsing_config_file", lines=1024)` |
+| **info** | Operational events | `log.info("jail_created", jail_name="sshd", action_count=3)` |
+| **warning** | Recoverable issues | `log.warning("config_reload_skipped", reason="no_changes")` |
+| **error** | Failures that impact functionality | `log.error("fail2ban_connection_lost", error=str(e))` |
+| **critical** | System failures | `log.critical("database_corrupted", error=str(e))` |
+
+### Context Variables
+
+Use structlog's context variables to automatically include request-scoped information in all logs within a request:
+
+```python
+import structlog
+
+log = structlog.get_logger()
+
+# In middleware or early in request processing
+structlog.contextvars.clear_contextvars()
+structlog.contextvars.bind_contextvars(
+    correlation_id=request_id,
+    user_id=user_id,
+    client_ip=client_ip,
+)
+
+# All subsequent logs in this request will include these context variables
+log.info("user_action", action="create_jail")  # Automatically includes correlation_id, user_id, etc.
+
+# Clear context at end of request
+structlog.contextvars.clear_contextvars()
+```
+
+### Event Naming Convention
+
+Use snake_case for event names, prefixed with the component or module name:
+
+```python
+# ✓ Good naming
+log.info("service_initialized", service="BanService", version="1.0")
+log.warning("blocklist_import_slow", duration_ms=5000)
+log.error("fail2ban_command_failed", command="list", exit_code=1)
+
+# ✗ Bad naming
+log.info("init")  # Too generic
+log.warning("slow operation")  # Not machine-readable
+log.error("ERROR: FAIL2BAN FAILED!")  # Inconsistent formatting
+```
+
+### Attaching Structured Data
+
+Always provide context as key-value pairs, not as unstructured strings:
+
+```python
+# ✓ Correct: Structured, queryable
+log.info(
+    "ban_executed",
+    jail="sshd",
+    ip="192.0.2.1",
+    duration_seconds=3600,
+    reason="brute_force",
+)
+
+# ✗ Wrong: Unstructured, hard to query
+log.info(f"Banned {ip} in jail {jail} for 3600 seconds because brute_force")
+```
+
+---
+
+## Centralized Logging Configuration
+
+### Environment Variables
+
+External logging is configured via environment variables (all prefixed with `BANGUI_`):
+
+#### Datadog
+
+Enable logging to Datadog via HTTP API:
+
+```bash
+BANGUI_EXTERNAL_LOGGING_ENABLED=true
+BANGUI_EXTERNAL_LOGGING_PROVIDER=datadog
+BANGUI_DATADOG_API_KEY=your-api-key-here
+BANGUI_DATADOG_SITE=datadoghq.com              # or datadoghq.eu for EU
+BANGUI_DATADOG_BATCH_SIZE=10                   # Optional: logs per batch
+BANGUI_DATADOG_FLUSH_INTERVAL_SECONDS=5        # Optional: flush interval
+```
+
+#### Papertrail
+
+Enable logging to Papertrail via Syslog protocol:
+
+```bash
+BANGUI_EXTERNAL_LOGGING_ENABLED=true
+BANGUI_EXTERNAL_LOGGING_PROVIDER=papertrail
+BANGUI_PAPERTRAIL_HOST=logs1.papertrailapp.com
+BANGUI_PAPERTRAIL_PORT=12345
+BANGUI_PAPERTRAIL_PROGRAM_NAME=bangui          # Optional: program name in syslog
+```
+
+#### ELK Stack
+
+Enable logging to Elasticsearch/Logstash:
+
+```bash
+BANGUI_EXTERNAL_LOGGING_ENABLED=true
+BANGUI_EXTERNAL_LOGGING_PROVIDER=elasticsearch
+BANGUI_ELASTICSEARCH_HOSTS=http://elasticsearch:9200
+BANGUI_ELASTICSEARCH_INDEX_PREFIX=bangui       # Optional: index prefix
+BANGUI_ELASTICSEARCH_BATCH_SIZE=10             # Optional: docs per batch
+BANGUI_ELASTICSEARCH_FLUSH_INTERVAL_SECONDS=5  # Optional: flush interval
+```
+
+### Local Development (Disabled by Default)
+
+External logging is **disabled by default**. In development, logs continue to write to stdout only:
+
+```bash
+# No configuration needed — logs go to stdout
+docker compose up
+```
+
+To enable external logging in development for testing:
+
+```bash
+BANGUI_EXTERNAL_LOGGING_ENABLED=true \
+BANGUI_EXTERNAL_LOGGING_PROVIDER=datadog \
+BANGUI_DATADOG_API_KEY=test-key \
+python -m uvicorn app.main:create_app --host 0.0.0.0 --port 8000
+```
+
+---
+
+## Performance and Reliability
+
+### Non-Blocking Delivery
+
+External log delivery uses **asynchronous buffering** to prevent blocking the application:
+
+1. Logs are written to an in-memory buffer
+2. After the configured flush interval or batch size, the buffer is sent asynchronously
+3. Send failures do not block application logic
+4. Retries use exponential backoff (up to 5 attempts)
+
+This ensures that external logging never degrades application performance.
+
+### Failure Modes
+
+If external logging becomes unavailable:
+
+- **Transient failures** (network timeouts, temporary 5xx errors): Logs are retried with exponential backoff
+- **Permanent failures** (invalid API key, host unreachable): A warning is logged; application continues
+- **Steady-state**: Logs are buffered up to a maximum queue size (default: 1000 logs); older logs are dropped if buffer fills
+
+The application **never crashes** due to external logging failures.
+
+### Log Volume and Rate Limiting
+
+Large log volumes can increase data transfer and storage costs. To manage log volume:
+
+1. **Reduce log level in production**: Set `BANGUI_LOG_LEVEL=warning` or `error` to suppress debug/info logs
+2. **Sample logs**: Some providers (Datadog, Papertrail) support sampling rules
+3. **Filter sensitive paths**: Middleware can suppress verbose logging for noisy endpoints
+
+Monitor actual log volume and adjust settings based on usage patterns.
+
+---
+
+## Integration Examples
+
+### Docker Compose (Development with Datadog)
+
+```yaml
+version: "3.9"
+services:
+  bangui:
+    build:
+      context: .
+      dockerfile: Docker/Dockerfile.app
+    environment:
+      BANGUI_EXTERNAL_LOGGING_ENABLED: "true"
+      BANGUI_EXTERNAL_LOGGING_PROVIDER: "datadog"
+      BANGUI_DATADOG_API_KEY: "${DATADOG_API_KEY}"
+      BANGUI_DATADOG_SITE: "datadoghq.com"
+      BANGUI_LOG_LEVEL: "info"
+    ports:
+      - "8000:8000"
+```
+
+### Kubernetes Deployment (Papertrail)
+
+```yaml
+apiVersion: v1
+kind: ConfigMap
+metadata:
+  name: bangui-logging
+data:
+  BANGUI_EXTERNAL_LOGGING_ENABLED: "true"
+  BANGUI_EXTERNAL_LOGGING_PROVIDER: "papertrail"
+  BANGUI_PAPERTRAIL_HOST: "logs1.papertrailapp.com"
+  BANGUI_PAPERTRAIL_PORT: "12345"
+  BANGUI_PAPERTRAIL_PROGRAM_NAME: "bangui"
+  BANGUI_LOG_LEVEL: "info"
+
+---
+apiVersion: apps/v1
+kind: Deployment
+metadata:
+  name: bangui
+spec:
+  template:
+    spec:
+      containers:
+      - name: bangui
+        image: bangui:latest
+        envFrom:
+        - configMapRef:
+            name: bangui-logging
+        env:
+        - name: BANGUI_DATADOG_API_KEY
+          valueFrom:
+            secretKeyRef:
+              name: bangui-secrets
+              key: datadog-api-key
+```
+
+---
+
+## Monitoring Logging Infrastructure
+
+### Datadog Dashboard Query
+
+Search for all BanGUI logs:
+
+```
+service:bangui
+```
+
+Search for errors in authentication:
+
+```
+service:bangui status:error component:auth
+```
+
+### Papertrail Search
+
+Search for all startup events:
+
+```
+program:bangui bangui_starting_up
+```
+
+Search for authentication failures:
+
+```
+program:bangui auth_token_validation_failed
+```
+
+### Elasticsearch Query (ELK)
+
+```json
+{
+  "query": {
+    "bool": {
+      "must": [
+        { "match": { "logger_name": "app.auth" } },
+        { "match": { "level": "error" } }
+      ]
+    }
+  }
+}
+```
+
+---
+
+## Testing and Debugging
+
+### Verify JSON Output
+
+Inspect the actual JSON emitted by the logging system:
+
+```bash
+# Start the app and capture logs
+python -m uvicorn app.main:create_app --host 0.0.0.0 --port 8000 2>&1 | head -10 | python -m json.tool
+```
+
+Expected output:
+
+```json
+{
+  "timestamp": "2024-05-01T18:20:45.123456+02:00",
+  "level": "info",
+  "logger_name": "app.main",
+  "event": "bangui_starting_up",
+  "database_path": "/var/lib/bangui/bangui.db"
+}
+```
+
+### Enable Debug Logging for External Log Delivery
+
+Set the log level to `debug` to see internal logs from the external logging system:
+
+```bash
+BANGUI_LOG_LEVEL=debug BANGUI_EXTERNAL_LOGGING_ENABLED=true python -m uvicorn app.main:create_app
+```
+
+This will emit logs like:
+
+```json
+{
+  "level": "debug",
+  "event": "external_log_batch_sent",
+  "provider": "datadog",
+  "batch_size": 10,
+  "duration_ms": 42
+}
+```
+
+### Validate Configuration
+
+Validate external logging configuration on startup:
+
+```bash
+python -c "from app.config import get_settings; s = get_settings(); print(s.model_dump())"
+```
+
+---
+
+## Security Considerations
+
+### API Key Rotation
+
+Rotate API keys regularly:
+
+1. Update `BANGUI_DATADOG_API_KEY` with the new key
+2. Restart the application
+3. Old keys can be revoked after restart
+
+### Network Security
+
+When sending logs over the network:
+
+- **Datadog HTTP API**: Uses HTTPS, encrypted in transit
+- **Papertrail Syslog**: Use TLS-enabled Syslog (if supported) or send over VPN/private network
+- **Elasticsearch**: Use HTTPS and HTTP Basic Auth or API Key authentication
+
+Never send logs over unencrypted channels in production.
+
+### Compliance
+
+Ensure that your external logging platform complies with your organization's data protection requirements:
+
+- **GDPR**: Verify the platform's data processing agreements
+- **HIPAA**: Ensure the provider is HIPAA-eligible
+- **SOC 2**: Request audit reports from your logging provider
+- **Data retention**: Configure appropriate log retention policies
+
+---
+
+## Troubleshooting
+
+### Logs Not Appearing in External System
+
+1. **Verify configuration**: Check that environment variables are set correctly
+2. **Check API credentials**: Ensure the API key or credentials are valid
+3. **Check network connectivity**: Verify the external system is reachable
+4. **Review logs locally**: Run with `BANGUI_LOG_LEVEL=debug` and check stdout for errors
+5. **Check disk space**: Ensure the local buffer directory has sufficient disk space
+
+### Performance Degradation
+
+1. **Check buffer size**: If the buffer is full, logs are dropped; increase `BANGUI_EXTERNAL_LOGGING_BUFFER_SIZE`
+2. **Adjust flush interval**: Decrease flush interval if experiencing large batches
+3. **Reduce log level**: Set `BANGUI_LOG_LEVEL=warning` to reduce log volume
+4. **Monitor network**: Check bandwidth usage between application and external system
+
+### Lost Logs
+
+In the rare event that logs are lost:
+
+1. **Buffer overflow**: The in-memory buffer has a maximum size; excess logs are dropped with a warning
+2. **Network failure during batch send**: Logs are retried; after max retries, a warning is logged
+3. **External system outage**: Logs may be dropped if buffer fills before service is restored
+
+To minimize data loss:
+
+- Increase buffer size (`BANGUI_EXTERNAL_LOGGING_BUFFER_SIZE`)
+- Use persistent external logging platforms
+- Monitor for warnings in application logs about dropped batches
+
+---
+
+## Future Enhancements
+
+Planned observability improvements:
+
+- [ ] Distributed tracing (OpenTelemetry integration)
+- [ ] Custom metrics collection
+- [ ] Alerting rules and thresholds
+- [ ] Log sampling strategies
+- [ ] Additional provider support (Splunk, New Relic, CloudWatch)
+
+---
+
+## References
+
+- [structlog Documentation](https://www.structlog.org/)
+- [Datadog Logging Documentation](https://docs.datadoghq.com/logs/)
+- [Papertrail Documentation](https://help.papertrailapp.com/)
+- [Elasticsearch JSON Logging](https://www.elastic.co/guide/en/elasticsearch/reference/current/logging.html)
+- [Observability Best Practices (OpenTelemetry)](https://opentelemetry.io/docs/concepts/observability-primer/)
--- a/Docs/Tasks.md
+++ b/Docs/Tasks.md
@@ -1,39 +1,3 @@
-## [MEDIUM] Input validation missing for regex patterns (ReDoS)
-
-**Where found**
-
- `backend/app/routers/config.py` — regex validation accepts arbitrary patterns without timeout
-
-**Why this is needed**
-
-Malicious regex causes catastrophic backtracking (ReDoS). Attacker sends pattern → compilation hangs → DoS.
-
-**Goal**
-
-Add timeout and complexity limits to regex validation.
-
-**What to do**
-
-1. Add timeout to regex compilation (2 seconds recommended)
-2. Add length limit (reject patterns > 1000 characters)
-3. Use `signal.alarm()` (Unix) or timeout library
-
-**Possible traps and issues**
-
- `signal.alarm()` Unix-only
- Some valid complex regexes may timeout
- Frontend should also validate (defense in depth)
-
-**Docs changes needed**
-
- Update API docs to document regex validation limits
-
-**Doc references**
-
- `backend/app/routers/config.py`
-
---
-
 ## [MEDIUM] No structured logging to external system

 **Where found**