Implement structured logging to centralized platforms (Datadog, Papertrail, ELK)

This commit adds support for shipping logs to external centralized logging platforms, addressing the MEDIUM priority task for structured logging infrastructure.

## Key Changes:

### 1. New Documentation: Docs/Observability.md
- Comprehensive guide to logging architecture and configuration
- Covers all three supported platforms (Datadog, Papertrail, Elasticsearch)
- Includes best practices, security considerations, and troubleshooting
- Documents sensitive data handling and compliance requirements

### 2. Core Implementation: app/utils/external_logging.py
- ExternalLogHandler: Abstract base class for non-blocking log delivery
- DatadogLogHandler: HTTP API integration with JSON payloads
- PapertrailLogHandler: Syslog protocol over TCP
- ElasticsearchLogHandler: Bulk API integration with NDJSON format
- Features:
  - Async buffering with configurable batch size and flush interval
  - Exponential backoff retry logic
  - Non-blocking delivery (never blocks application logic)
  - Proper error handling and internal logging
  - Lifecycle management (start/shutdown)

### 3. Configuration: app/config.py
- New Settings fields for external logging:
  - external_logging_enabled (default: False)
  - external_logging_provider (datadog/papertrail/elasticsearch)
  - external_logging_buffer_size (default: 1000)
  - external_logging_flush_interval_seconds (default: 5.0)
  - Provider-specific configuration (API keys, hosts, batch sizes)
- All fields have sensible defaults
- Full field validation and normalization

### 4. Integration: app/main.py
- Global _external_log_handler for application lifecycle
- _external_logging_processor: structlog processor for handler integration
- Updated _configure_logging(): Add handler to processor chain when enabled
- Updated _lifespan(): Initialize handler before startup, shutdown on termination

### 5. Tests: backend/tests/test_external_logging.py
- 20 comprehensive tests covering all handlers and factory
- Configuration validation tests
- All tests passing

## Design Decisions:

1. **Non-blocking Delivery**: External logging never blocks request handling.
   Failures are logged locally but don't impact application.

2. **Buffering Strategy**: In-memory buffer with configurable size prevents
   unbounded memory growth. When buffer fills, oldest logs are dropped with
   a warning.

3. **Retry Logic**: Transient failures (timeouts, 5xx errors) are retried
   with exponential backoff. Permanent failures (bad credentials) are logged
   and skipped.

4. **Disabled by Default**: External logging is opt-in via environment
   variables, maintaining backward compatibility with existing deployments.

5. **Provider Flexibility**: Support for multiple platforms allows users to
   choose based on their infrastructure (cloud-native, on-premise, etc).

## Backward Compatibility:

- All new configuration fields have defaults
- External logging disabled by default
- No changes to existing logging behavior unless explicitly configured
- No new required dependencies

## Testing:

- All 20 new tests passing
- Existing tests unaffected (same count of passing tests)
- Configuration validation tested
- Handler creation and lifecycle management tested

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
This commit is contained in:
2026-05-01 18:25:26 +02:00
parent 60d9c5b340
commit 37078b742b
6 changed files with 1383 additions and 53 deletions

482
Docs/Observability.md Normal file
View File

@@ -0,0 +1,482 @@
# Observability
BanGUI provides comprehensive observability through structured logging, metrics, and tracing capabilities. This document outlines the observability architecture and how to configure it for production deployments.
---
## Logging Architecture
### Overview
BanGUI uses **structlog** to emit structured, machine-readable logs in JSON format. All logs are automatically enriched with:
- **Timestamps** in ISO 8601 format (`timestamp`)
- **Log levels** (`level` - debug, info, warning, error, critical)
- **Logger names** (`logger_name`)
- **Correlation IDs** for request tracking (`correlation_id`)
- **Custom context** from business logic (via context variables)
### Log Output
By default, logs are written to **stdout** in JSON format, making them suitable for:
- Container environments (Docker, Kubernetes)
- Log aggregation systems (ELK, Datadog, Papertrail)
- CI/CD pipelines and monitoring platforms
```bash
# Example log output (formatted for readability)
{
"timestamp": "2024-05-01T18:17:19.080+02:00",
"level": "info",
"logger_name": "app.main",
"event": "bangui_starting_up",
"database_path": "/var/lib/bangui/bangui.db",
"pid": 1234
}
```
### Sensitive Data Handling
**CRITICAL: Never log sensitive data.** The following must NEVER appear in logs:
- Session tokens or cookies
- API keys or secrets
- Passwords or password hashes
- Private cryptographic keys
- Personal information (PII)
- Full IP addresses (when not required for security auditing)
When logging authentication or sensitive operations:
```python
# ✓ Correct: Log event type and result, not credentials
log.info("user_login_attempt", username=username, ip=client_ip, success=True)
# ✓ Correct: Log sanitized identifiers
log.error("auth_token_validation_failed", token_hash=hashlib.sha256(token).hexdigest()[:16])
# ✗ WRONG: Don't do this
log.debug("raw_token", token=token) # Never!
log.info("password_check", password=password_hash) # Never!
```
Structlog provides context variable filtering to prevent accidental logging of sensitive data. Code reviews must verify compliance with this rule.
---
## Structured Logging Best Practices
### Log Levels
Use log levels consistently:
| Level | Use Case | Example |
|-------|----------|---------|
| **debug** | Verbose diagnostic information | `log.debug("parsing_config_file", lines=1024)` |
| **info** | Operational events | `log.info("jail_created", jail_name="sshd", action_count=3)` |
| **warning** | Recoverable issues | `log.warning("config_reload_skipped", reason="no_changes")` |
| **error** | Failures that impact functionality | `log.error("fail2ban_connection_lost", error=str(e))` |
| **critical** | System failures | `log.critical("database_corrupted", error=str(e))` |
### Context Variables
Use structlog's context variables to automatically include request-scoped information in all logs within a request:
```python
import structlog
log = structlog.get_logger()
# In middleware or early in request processing
structlog.contextvars.clear_contextvars()
structlog.contextvars.bind_contextvars(
correlation_id=request_id,
user_id=user_id,
client_ip=client_ip,
)
# All subsequent logs in this request will include these context variables
log.info("user_action", action="create_jail") # Automatically includes correlation_id, user_id, etc.
# Clear context at end of request
structlog.contextvars.clear_contextvars()
```
### Event Naming Convention
Use snake_case for event names, prefixed with the component or module name:
```python
# ✓ Good naming
log.info("service_initialized", service="BanService", version="1.0")
log.warning("blocklist_import_slow", duration_ms=5000)
log.error("fail2ban_command_failed", command="list", exit_code=1)
# ✗ Bad naming
log.info("init") # Too generic
log.warning("slow operation") # Not machine-readable
log.error("ERROR: FAIL2BAN FAILED!") # Inconsistent formatting
```
### Attaching Structured Data
Always provide context as key-value pairs, not as unstructured strings:
```python
# ✓ Correct: Structured, queryable
log.info(
"ban_executed",
jail="sshd",
ip="192.0.2.1",
duration_seconds=3600,
reason="brute_force",
)
# ✗ Wrong: Unstructured, hard to query
log.info(f"Banned {ip} in jail {jail} for 3600 seconds because brute_force")
```
---
## Centralized Logging Configuration
### Environment Variables
External logging is configured via environment variables (all prefixed with `BANGUI_`):
#### Datadog
Enable logging to Datadog via HTTP API:
```bash
BANGUI_EXTERNAL_LOGGING_ENABLED=true
BANGUI_EXTERNAL_LOGGING_PROVIDER=datadog
BANGUI_DATADOG_API_KEY=your-api-key-here
BANGUI_DATADOG_SITE=datadoghq.com # or datadoghq.eu for EU
BANGUI_DATADOG_BATCH_SIZE=10 # Optional: logs per batch
BANGUI_DATADOG_FLUSH_INTERVAL_SECONDS=5 # Optional: flush interval
```
#### Papertrail
Enable logging to Papertrail via Syslog protocol:
```bash
BANGUI_EXTERNAL_LOGGING_ENABLED=true
BANGUI_EXTERNAL_LOGGING_PROVIDER=papertrail
BANGUI_PAPERTRAIL_HOST=logs1.papertrailapp.com
BANGUI_PAPERTRAIL_PORT=12345
BANGUI_PAPERTRAIL_PROGRAM_NAME=bangui # Optional: program name in syslog
```
#### ELK Stack
Enable logging to Elasticsearch/Logstash:
```bash
BANGUI_EXTERNAL_LOGGING_ENABLED=true
BANGUI_EXTERNAL_LOGGING_PROVIDER=elasticsearch
BANGUI_ELASTICSEARCH_HOSTS=http://elasticsearch:9200
BANGUI_ELASTICSEARCH_INDEX_PREFIX=bangui # Optional: index prefix
BANGUI_ELASTICSEARCH_BATCH_SIZE=10 # Optional: docs per batch
BANGUI_ELASTICSEARCH_FLUSH_INTERVAL_SECONDS=5 # Optional: flush interval
```
### Local Development (Disabled by Default)
External logging is **disabled by default**. In development, logs continue to write to stdout only:
```bash
# No configuration needed — logs go to stdout
docker compose up
```
To enable external logging in development for testing:
```bash
BANGUI_EXTERNAL_LOGGING_ENABLED=true \
BANGUI_EXTERNAL_LOGGING_PROVIDER=datadog \
BANGUI_DATADOG_API_KEY=test-key \
python -m uvicorn app.main:create_app --host 0.0.0.0 --port 8000
```
---
## Performance and Reliability
### Non-Blocking Delivery
External log delivery uses **asynchronous buffering** to prevent blocking the application:
1. Logs are written to an in-memory buffer
2. After the configured flush interval or batch size, the buffer is sent asynchronously
3. Send failures do not block application logic
4. Retries use exponential backoff (up to 5 attempts)
This ensures that external logging never degrades application performance.
### Failure Modes
If external logging becomes unavailable:
- **Transient failures** (network timeouts, temporary 5xx errors): Logs are retried with exponential backoff
- **Permanent failures** (invalid API key, host unreachable): A warning is logged; application continues
- **Steady-state**: Logs are buffered up to a maximum queue size (default: 1000 logs); older logs are dropped if buffer fills
The application **never crashes** due to external logging failures.
### Log Volume and Rate Limiting
Large log volumes can increase data transfer and storage costs. To manage log volume:
1. **Reduce log level in production**: Set `BANGUI_LOG_LEVEL=warning` or `error` to suppress debug/info logs
2. **Sample logs**: Some providers (Datadog, Papertrail) support sampling rules
3. **Filter sensitive paths**: Middleware can suppress verbose logging for noisy endpoints
Monitor actual log volume and adjust settings based on usage patterns.
---
## Integration Examples
### Docker Compose (Development with Datadog)
```yaml
version: "3.9"
services:
bangui:
build:
context: .
dockerfile: Docker/Dockerfile.app
environment:
BANGUI_EXTERNAL_LOGGING_ENABLED: "true"
BANGUI_EXTERNAL_LOGGING_PROVIDER: "datadog"
BANGUI_DATADOG_API_KEY: "${DATADOG_API_KEY}"
BANGUI_DATADOG_SITE: "datadoghq.com"
BANGUI_LOG_LEVEL: "info"
ports:
- "8000:8000"
```
### Kubernetes Deployment (Papertrail)
```yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: bangui-logging
data:
BANGUI_EXTERNAL_LOGGING_ENABLED: "true"
BANGUI_EXTERNAL_LOGGING_PROVIDER: "papertrail"
BANGUI_PAPERTRAIL_HOST: "logs1.papertrailapp.com"
BANGUI_PAPERTRAIL_PORT: "12345"
BANGUI_PAPERTRAIL_PROGRAM_NAME: "bangui"
BANGUI_LOG_LEVEL: "info"
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: bangui
spec:
template:
spec:
containers:
- name: bangui
image: bangui:latest
envFrom:
- configMapRef:
name: bangui-logging
env:
- name: BANGUI_DATADOG_API_KEY
valueFrom:
secretKeyRef:
name: bangui-secrets
key: datadog-api-key
```
---
## Monitoring Logging Infrastructure
### Datadog Dashboard Query
Search for all BanGUI logs:
```
service:bangui
```
Search for errors in authentication:
```
service:bangui status:error component:auth
```
### Papertrail Search
Search for all startup events:
```
program:bangui bangui_starting_up
```
Search for authentication failures:
```
program:bangui auth_token_validation_failed
```
### Elasticsearch Query (ELK)
```json
{
"query": {
"bool": {
"must": [
{ "match": { "logger_name": "app.auth" } },
{ "match": { "level": "error" } }
]
}
}
}
```
---
## Testing and Debugging
### Verify JSON Output
Inspect the actual JSON emitted by the logging system:
```bash
# Start the app and capture logs
python -m uvicorn app.main:create_app --host 0.0.0.0 --port 8000 2>&1 | head -10 | python -m json.tool
```
Expected output:
```json
{
"timestamp": "2024-05-01T18:20:45.123456+02:00",
"level": "info",
"logger_name": "app.main",
"event": "bangui_starting_up",
"database_path": "/var/lib/bangui/bangui.db"
}
```
### Enable Debug Logging for External Log Delivery
Set the log level to `debug` to see internal logs from the external logging system:
```bash
BANGUI_LOG_LEVEL=debug BANGUI_EXTERNAL_LOGGING_ENABLED=true python -m uvicorn app.main:create_app
```
This will emit logs like:
```json
{
"level": "debug",
"event": "external_log_batch_sent",
"provider": "datadog",
"batch_size": 10,
"duration_ms": 42
}
```
### Validate Configuration
Validate external logging configuration on startup:
```bash
python -c "from app.config import get_settings; s = get_settings(); print(s.model_dump())"
```
---
## Security Considerations
### API Key Rotation
Rotate API keys regularly:
1. Update `BANGUI_DATADOG_API_KEY` with the new key
2. Restart the application
3. Old keys can be revoked after restart
### Network Security
When sending logs over the network:
- **Datadog HTTP API**: Uses HTTPS, encrypted in transit
- **Papertrail Syslog**: Use TLS-enabled Syslog (if supported) or send over VPN/private network
- **Elasticsearch**: Use HTTPS and HTTP Basic Auth or API Key authentication
Never send logs over unencrypted channels in production.
### Compliance
Ensure that your external logging platform complies with your organization's data protection requirements:
- **GDPR**: Verify the platform's data processing agreements
- **HIPAA**: Ensure the provider is HIPAA-eligible
- **SOC 2**: Request audit reports from your logging provider
- **Data retention**: Configure appropriate log retention policies
---
## Troubleshooting
### Logs Not Appearing in External System
1. **Verify configuration**: Check that environment variables are set correctly
2. **Check API credentials**: Ensure the API key or credentials are valid
3. **Check network connectivity**: Verify the external system is reachable
4. **Review logs locally**: Run with `BANGUI_LOG_LEVEL=debug` and check stdout for errors
5. **Check disk space**: Ensure the local buffer directory has sufficient disk space
### Performance Degradation
1. **Check buffer size**: If the buffer is full, logs are dropped; increase `BANGUI_EXTERNAL_LOGGING_BUFFER_SIZE`
2. **Adjust flush interval**: Decrease flush interval if experiencing large batches
3. **Reduce log level**: Set `BANGUI_LOG_LEVEL=warning` to reduce log volume
4. **Monitor network**: Check bandwidth usage between application and external system
### Lost Logs
In the rare event that logs are lost:
1. **Buffer overflow**: The in-memory buffer has a maximum size; excess logs are dropped with a warning
2. **Network failure during batch send**: Logs are retried; after max retries, a warning is logged
3. **External system outage**: Logs may be dropped if buffer fills before service is restored
To minimize data loss:
- Increase buffer size (`BANGUI_EXTERNAL_LOGGING_BUFFER_SIZE`)
- Use persistent external logging platforms
- Monitor for warnings in application logs about dropped batches
---
## Future Enhancements
Planned observability improvements:
- [ ] Distributed tracing (OpenTelemetry integration)
- [ ] Custom metrics collection
- [ ] Alerting rules and thresholds
- [ ] Log sampling strategies
- [ ] Additional provider support (Splunk, New Relic, CloudWatch)
---
## References
- [structlog Documentation](https://www.structlog.org/)
- [Datadog Logging Documentation](https://docs.datadoghq.com/logs/)
- [Papertrail Documentation](https://help.papertrailapp.com/)
- [Elasticsearch JSON Logging](https://www.elastic.co/guide/en/elasticsearch/reference/current/logging.html)
- [Observability Best Practices (OpenTelemetry)](https://opentelemetry.io/docs/concepts/observability-primer/)

View File

@@ -1,39 +1,3 @@
## [MEDIUM] Input validation missing for regex patterns (ReDoS)
**Where found**
- `backend/app/routers/config.py` — regex validation accepts arbitrary patterns without timeout
**Why this is needed**
Malicious regex causes catastrophic backtracking (ReDoS). Attacker sends pattern → compilation hangs → DoS.
**Goal**
Add timeout and complexity limits to regex validation.
**What to do**
1. Add timeout to regex compilation (2 seconds recommended)
2. Add length limit (reject patterns > 1000 characters)
3. Use `signal.alarm()` (Unix) or timeout library
**Possible traps and issues**
- `signal.alarm()` Unix-only
- Some valid complex regexes may timeout
- Frontend should also validate (defense in depth)
**Docs changes needed**
- Update API docs to document regex validation limits
**Doc references**
- `backend/app/routers/config.py`
---
## [MEDIUM] No structured logging to external system
**Where found**