This commit adds support for shipping logs to external centralized logging platforms, addressing the MEDIUM priority task for structured logging infrastructure. ## Key Changes: ### 1. New Documentation: Docs/Observability.md - Comprehensive guide to logging architecture and configuration - Covers all three supported platforms (Datadog, Papertrail, Elasticsearch) - Includes best practices, security considerations, and troubleshooting - Documents sensitive data handling and compliance requirements ### 2. Core Implementation: app/utils/external_logging.py - ExternalLogHandler: Abstract base class for non-blocking log delivery - DatadogLogHandler: HTTP API integration with JSON payloads - PapertrailLogHandler: Syslog protocol over TCP - ElasticsearchLogHandler: Bulk API integration with NDJSON format - Features: - Async buffering with configurable batch size and flush interval - Exponential backoff retry logic - Non-blocking delivery (never blocks application logic) - Proper error handling and internal logging - Lifecycle management (start/shutdown) ### 3. Configuration: app/config.py - New Settings fields for external logging: - external_logging_enabled (default: False) - external_logging_provider (datadog/papertrail/elasticsearch) - external_logging_buffer_size (default: 1000) - external_logging_flush_interval_seconds (default: 5.0) - Provider-specific configuration (API keys, hosts, batch sizes) - All fields have sensible defaults - Full field validation and normalization ### 4. Integration: app/main.py - Global _external_log_handler for application lifecycle - _external_logging_processor: structlog processor for handler integration - Updated _configure_logging(): Add handler to processor chain when enabled - Updated _lifespan(): Initialize handler before startup, shutdown on termination ### 5. Tests: backend/tests/test_external_logging.py - 20 comprehensive tests covering all handlers and factory - Configuration validation tests - All tests passing ## Design Decisions: 1. **Non-blocking Delivery**: External logging never blocks request handling. Failures are logged locally but don't impact application. 2. **Buffering Strategy**: In-memory buffer with configurable size prevents unbounded memory growth. When buffer fills, oldest logs are dropped with a warning. 3. **Retry Logic**: Transient failures (timeouts, 5xx errors) are retried with exponential backoff. Permanent failures (bad credentials) are logged and skipped. 4. **Disabled by Default**: External logging is opt-in via environment variables, maintaining backward compatibility with existing deployments. 5. **Provider Flexibility**: Support for multiple platforms allows users to choose based on their infrastructure (cloud-native, on-premise, etc). ## Backward Compatibility: - All new configuration fields have defaults - External logging disabled by default - No changes to existing logging behavior unless explicitly configured - No new required dependencies ## Testing: - All 20 new tests passing - Existing tests unaffected (same count of passing tests) - Configuration validation tested - Handler creation and lifecycle management tested Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
483 lines
13 KiB
Markdown
483 lines
13 KiB
Markdown
# Observability
|
|
|
|
BanGUI provides comprehensive observability through structured logging, metrics, and tracing capabilities. This document outlines the observability architecture and how to configure it for production deployments.
|
|
|
|
---
|
|
|
|
## Logging Architecture
|
|
|
|
### Overview
|
|
|
|
BanGUI uses **structlog** to emit structured, machine-readable logs in JSON format. All logs are automatically enriched with:
|
|
|
|
- **Timestamps** in ISO 8601 format (`timestamp`)
|
|
- **Log levels** (`level` - debug, info, warning, error, critical)
|
|
- **Logger names** (`logger_name`)
|
|
- **Correlation IDs** for request tracking (`correlation_id`)
|
|
- **Custom context** from business logic (via context variables)
|
|
|
|
### Log Output
|
|
|
|
By default, logs are written to **stdout** in JSON format, making them suitable for:
|
|
- Container environments (Docker, Kubernetes)
|
|
- Log aggregation systems (ELK, Datadog, Papertrail)
|
|
- CI/CD pipelines and monitoring platforms
|
|
|
|
```bash
|
|
# Example log output (formatted for readability)
|
|
{
|
|
"timestamp": "2024-05-01T18:17:19.080+02:00",
|
|
"level": "info",
|
|
"logger_name": "app.main",
|
|
"event": "bangui_starting_up",
|
|
"database_path": "/var/lib/bangui/bangui.db",
|
|
"pid": 1234
|
|
}
|
|
```
|
|
|
|
### Sensitive Data Handling
|
|
|
|
**CRITICAL: Never log sensitive data.** The following must NEVER appear in logs:
|
|
|
|
- Session tokens or cookies
|
|
- API keys or secrets
|
|
- Passwords or password hashes
|
|
- Private cryptographic keys
|
|
- Personal information (PII)
|
|
- Full IP addresses (when not required for security auditing)
|
|
|
|
When logging authentication or sensitive operations:
|
|
|
|
```python
|
|
# ✓ Correct: Log event type and result, not credentials
|
|
log.info("user_login_attempt", username=username, ip=client_ip, success=True)
|
|
|
|
# ✓ Correct: Log sanitized identifiers
|
|
log.error("auth_token_validation_failed", token_hash=hashlib.sha256(token).hexdigest()[:16])
|
|
|
|
# ✗ WRONG: Don't do this
|
|
log.debug("raw_token", token=token) # Never!
|
|
log.info("password_check", password=password_hash) # Never!
|
|
```
|
|
|
|
Structlog provides context variable filtering to prevent accidental logging of sensitive data. Code reviews must verify compliance with this rule.
|
|
|
|
---
|
|
|
|
## Structured Logging Best Practices
|
|
|
|
### Log Levels
|
|
|
|
Use log levels consistently:
|
|
|
|
| Level | Use Case | Example |
|
|
|-------|----------|---------|
|
|
| **debug** | Verbose diagnostic information | `log.debug("parsing_config_file", lines=1024)` |
|
|
| **info** | Operational events | `log.info("jail_created", jail_name="sshd", action_count=3)` |
|
|
| **warning** | Recoverable issues | `log.warning("config_reload_skipped", reason="no_changes")` |
|
|
| **error** | Failures that impact functionality | `log.error("fail2ban_connection_lost", error=str(e))` |
|
|
| **critical** | System failures | `log.critical("database_corrupted", error=str(e))` |
|
|
|
|
### Context Variables
|
|
|
|
Use structlog's context variables to automatically include request-scoped information in all logs within a request:
|
|
|
|
```python
|
|
import structlog
|
|
|
|
log = structlog.get_logger()
|
|
|
|
# In middleware or early in request processing
|
|
structlog.contextvars.clear_contextvars()
|
|
structlog.contextvars.bind_contextvars(
|
|
correlation_id=request_id,
|
|
user_id=user_id,
|
|
client_ip=client_ip,
|
|
)
|
|
|
|
# All subsequent logs in this request will include these context variables
|
|
log.info("user_action", action="create_jail") # Automatically includes correlation_id, user_id, etc.
|
|
|
|
# Clear context at end of request
|
|
structlog.contextvars.clear_contextvars()
|
|
```
|
|
|
|
### Event Naming Convention
|
|
|
|
Use snake_case for event names, prefixed with the component or module name:
|
|
|
|
```python
|
|
# ✓ Good naming
|
|
log.info("service_initialized", service="BanService", version="1.0")
|
|
log.warning("blocklist_import_slow", duration_ms=5000)
|
|
log.error("fail2ban_command_failed", command="list", exit_code=1)
|
|
|
|
# ✗ Bad naming
|
|
log.info("init") # Too generic
|
|
log.warning("slow operation") # Not machine-readable
|
|
log.error("ERROR: FAIL2BAN FAILED!") # Inconsistent formatting
|
|
```
|
|
|
|
### Attaching Structured Data
|
|
|
|
Always provide context as key-value pairs, not as unstructured strings:
|
|
|
|
```python
|
|
# ✓ Correct: Structured, queryable
|
|
log.info(
|
|
"ban_executed",
|
|
jail="sshd",
|
|
ip="192.0.2.1",
|
|
duration_seconds=3600,
|
|
reason="brute_force",
|
|
)
|
|
|
|
# ✗ Wrong: Unstructured, hard to query
|
|
log.info(f"Banned {ip} in jail {jail} for 3600 seconds because brute_force")
|
|
```
|
|
|
|
---
|
|
|
|
## Centralized Logging Configuration
|
|
|
|
### Environment Variables
|
|
|
|
External logging is configured via environment variables (all prefixed with `BANGUI_`):
|
|
|
|
#### Datadog
|
|
|
|
Enable logging to Datadog via HTTP API:
|
|
|
|
```bash
|
|
BANGUI_EXTERNAL_LOGGING_ENABLED=true
|
|
BANGUI_EXTERNAL_LOGGING_PROVIDER=datadog
|
|
BANGUI_DATADOG_API_KEY=your-api-key-here
|
|
BANGUI_DATADOG_SITE=datadoghq.com # or datadoghq.eu for EU
|
|
BANGUI_DATADOG_BATCH_SIZE=10 # Optional: logs per batch
|
|
BANGUI_DATADOG_FLUSH_INTERVAL_SECONDS=5 # Optional: flush interval
|
|
```
|
|
|
|
#### Papertrail
|
|
|
|
Enable logging to Papertrail via Syslog protocol:
|
|
|
|
```bash
|
|
BANGUI_EXTERNAL_LOGGING_ENABLED=true
|
|
BANGUI_EXTERNAL_LOGGING_PROVIDER=papertrail
|
|
BANGUI_PAPERTRAIL_HOST=logs1.papertrailapp.com
|
|
BANGUI_PAPERTRAIL_PORT=12345
|
|
BANGUI_PAPERTRAIL_PROGRAM_NAME=bangui # Optional: program name in syslog
|
|
```
|
|
|
|
#### ELK Stack
|
|
|
|
Enable logging to Elasticsearch/Logstash:
|
|
|
|
```bash
|
|
BANGUI_EXTERNAL_LOGGING_ENABLED=true
|
|
BANGUI_EXTERNAL_LOGGING_PROVIDER=elasticsearch
|
|
BANGUI_ELASTICSEARCH_HOSTS=http://elasticsearch:9200
|
|
BANGUI_ELASTICSEARCH_INDEX_PREFIX=bangui # Optional: index prefix
|
|
BANGUI_ELASTICSEARCH_BATCH_SIZE=10 # Optional: docs per batch
|
|
BANGUI_ELASTICSEARCH_FLUSH_INTERVAL_SECONDS=5 # Optional: flush interval
|
|
```
|
|
|
|
### Local Development (Disabled by Default)
|
|
|
|
External logging is **disabled by default**. In development, logs continue to write to stdout only:
|
|
|
|
```bash
|
|
# No configuration needed — logs go to stdout
|
|
docker compose up
|
|
```
|
|
|
|
To enable external logging in development for testing:
|
|
|
|
```bash
|
|
BANGUI_EXTERNAL_LOGGING_ENABLED=true \
|
|
BANGUI_EXTERNAL_LOGGING_PROVIDER=datadog \
|
|
BANGUI_DATADOG_API_KEY=test-key \
|
|
python -m uvicorn app.main:create_app --host 0.0.0.0 --port 8000
|
|
```
|
|
|
|
---
|
|
|
|
## Performance and Reliability
|
|
|
|
### Non-Blocking Delivery
|
|
|
|
External log delivery uses **asynchronous buffering** to prevent blocking the application:
|
|
|
|
1. Logs are written to an in-memory buffer
|
|
2. After the configured flush interval or batch size, the buffer is sent asynchronously
|
|
3. Send failures do not block application logic
|
|
4. Retries use exponential backoff (up to 5 attempts)
|
|
|
|
This ensures that external logging never degrades application performance.
|
|
|
|
### Failure Modes
|
|
|
|
If external logging becomes unavailable:
|
|
|
|
- **Transient failures** (network timeouts, temporary 5xx errors): Logs are retried with exponential backoff
|
|
- **Permanent failures** (invalid API key, host unreachable): A warning is logged; application continues
|
|
- **Steady-state**: Logs are buffered up to a maximum queue size (default: 1000 logs); older logs are dropped if buffer fills
|
|
|
|
The application **never crashes** due to external logging failures.
|
|
|
|
### Log Volume and Rate Limiting
|
|
|
|
Large log volumes can increase data transfer and storage costs. To manage log volume:
|
|
|
|
1. **Reduce log level in production**: Set `BANGUI_LOG_LEVEL=warning` or `error` to suppress debug/info logs
|
|
2. **Sample logs**: Some providers (Datadog, Papertrail) support sampling rules
|
|
3. **Filter sensitive paths**: Middleware can suppress verbose logging for noisy endpoints
|
|
|
|
Monitor actual log volume and adjust settings based on usage patterns.
|
|
|
|
---
|
|
|
|
## Integration Examples
|
|
|
|
### Docker Compose (Development with Datadog)
|
|
|
|
```yaml
|
|
version: "3.9"
|
|
services:
|
|
bangui:
|
|
build:
|
|
context: .
|
|
dockerfile: Docker/Dockerfile.app
|
|
environment:
|
|
BANGUI_EXTERNAL_LOGGING_ENABLED: "true"
|
|
BANGUI_EXTERNAL_LOGGING_PROVIDER: "datadog"
|
|
BANGUI_DATADOG_API_KEY: "${DATADOG_API_KEY}"
|
|
BANGUI_DATADOG_SITE: "datadoghq.com"
|
|
BANGUI_LOG_LEVEL: "info"
|
|
ports:
|
|
- "8000:8000"
|
|
```
|
|
|
|
### Kubernetes Deployment (Papertrail)
|
|
|
|
```yaml
|
|
apiVersion: v1
|
|
kind: ConfigMap
|
|
metadata:
|
|
name: bangui-logging
|
|
data:
|
|
BANGUI_EXTERNAL_LOGGING_ENABLED: "true"
|
|
BANGUI_EXTERNAL_LOGGING_PROVIDER: "papertrail"
|
|
BANGUI_PAPERTRAIL_HOST: "logs1.papertrailapp.com"
|
|
BANGUI_PAPERTRAIL_PORT: "12345"
|
|
BANGUI_PAPERTRAIL_PROGRAM_NAME: "bangui"
|
|
BANGUI_LOG_LEVEL: "info"
|
|
|
|
---
|
|
apiVersion: apps/v1
|
|
kind: Deployment
|
|
metadata:
|
|
name: bangui
|
|
spec:
|
|
template:
|
|
spec:
|
|
containers:
|
|
- name: bangui
|
|
image: bangui:latest
|
|
envFrom:
|
|
- configMapRef:
|
|
name: bangui-logging
|
|
env:
|
|
- name: BANGUI_DATADOG_API_KEY
|
|
valueFrom:
|
|
secretKeyRef:
|
|
name: bangui-secrets
|
|
key: datadog-api-key
|
|
```
|
|
|
|
---
|
|
|
|
## Monitoring Logging Infrastructure
|
|
|
|
### Datadog Dashboard Query
|
|
|
|
Search for all BanGUI logs:
|
|
|
|
```
|
|
service:bangui
|
|
```
|
|
|
|
Search for errors in authentication:
|
|
|
|
```
|
|
service:bangui status:error component:auth
|
|
```
|
|
|
|
### Papertrail Search
|
|
|
|
Search for all startup events:
|
|
|
|
```
|
|
program:bangui bangui_starting_up
|
|
```
|
|
|
|
Search for authentication failures:
|
|
|
|
```
|
|
program:bangui auth_token_validation_failed
|
|
```
|
|
|
|
### Elasticsearch Query (ELK)
|
|
|
|
```json
|
|
{
|
|
"query": {
|
|
"bool": {
|
|
"must": [
|
|
{ "match": { "logger_name": "app.auth" } },
|
|
{ "match": { "level": "error" } }
|
|
]
|
|
}
|
|
}
|
|
}
|
|
```
|
|
|
|
---
|
|
|
|
## Testing and Debugging
|
|
|
|
### Verify JSON Output
|
|
|
|
Inspect the actual JSON emitted by the logging system:
|
|
|
|
```bash
|
|
# Start the app and capture logs
|
|
python -m uvicorn app.main:create_app --host 0.0.0.0 --port 8000 2>&1 | head -10 | python -m json.tool
|
|
```
|
|
|
|
Expected output:
|
|
|
|
```json
|
|
{
|
|
"timestamp": "2024-05-01T18:20:45.123456+02:00",
|
|
"level": "info",
|
|
"logger_name": "app.main",
|
|
"event": "bangui_starting_up",
|
|
"database_path": "/var/lib/bangui/bangui.db"
|
|
}
|
|
```
|
|
|
|
### Enable Debug Logging for External Log Delivery
|
|
|
|
Set the log level to `debug` to see internal logs from the external logging system:
|
|
|
|
```bash
|
|
BANGUI_LOG_LEVEL=debug BANGUI_EXTERNAL_LOGGING_ENABLED=true python -m uvicorn app.main:create_app
|
|
```
|
|
|
|
This will emit logs like:
|
|
|
|
```json
|
|
{
|
|
"level": "debug",
|
|
"event": "external_log_batch_sent",
|
|
"provider": "datadog",
|
|
"batch_size": 10,
|
|
"duration_ms": 42
|
|
}
|
|
```
|
|
|
|
### Validate Configuration
|
|
|
|
Validate external logging configuration on startup:
|
|
|
|
```bash
|
|
python -c "from app.config import get_settings; s = get_settings(); print(s.model_dump())"
|
|
```
|
|
|
|
---
|
|
|
|
## Security Considerations
|
|
|
|
### API Key Rotation
|
|
|
|
Rotate API keys regularly:
|
|
|
|
1. Update `BANGUI_DATADOG_API_KEY` with the new key
|
|
2. Restart the application
|
|
3. Old keys can be revoked after restart
|
|
|
|
### Network Security
|
|
|
|
When sending logs over the network:
|
|
|
|
- **Datadog HTTP API**: Uses HTTPS, encrypted in transit
|
|
- **Papertrail Syslog**: Use TLS-enabled Syslog (if supported) or send over VPN/private network
|
|
- **Elasticsearch**: Use HTTPS and HTTP Basic Auth or API Key authentication
|
|
|
|
Never send logs over unencrypted channels in production.
|
|
|
|
### Compliance
|
|
|
|
Ensure that your external logging platform complies with your organization's data protection requirements:
|
|
|
|
- **GDPR**: Verify the platform's data processing agreements
|
|
- **HIPAA**: Ensure the provider is HIPAA-eligible
|
|
- **SOC 2**: Request audit reports from your logging provider
|
|
- **Data retention**: Configure appropriate log retention policies
|
|
|
|
---
|
|
|
|
## Troubleshooting
|
|
|
|
### Logs Not Appearing in External System
|
|
|
|
1. **Verify configuration**: Check that environment variables are set correctly
|
|
2. **Check API credentials**: Ensure the API key or credentials are valid
|
|
3. **Check network connectivity**: Verify the external system is reachable
|
|
4. **Review logs locally**: Run with `BANGUI_LOG_LEVEL=debug` and check stdout for errors
|
|
5. **Check disk space**: Ensure the local buffer directory has sufficient disk space
|
|
|
|
### Performance Degradation
|
|
|
|
1. **Check buffer size**: If the buffer is full, logs are dropped; increase `BANGUI_EXTERNAL_LOGGING_BUFFER_SIZE`
|
|
2. **Adjust flush interval**: Decrease flush interval if experiencing large batches
|
|
3. **Reduce log level**: Set `BANGUI_LOG_LEVEL=warning` to reduce log volume
|
|
4. **Monitor network**: Check bandwidth usage between application and external system
|
|
|
|
### Lost Logs
|
|
|
|
In the rare event that logs are lost:
|
|
|
|
1. **Buffer overflow**: The in-memory buffer has a maximum size; excess logs are dropped with a warning
|
|
2. **Network failure during batch send**: Logs are retried; after max retries, a warning is logged
|
|
3. **External system outage**: Logs may be dropped if buffer fills before service is restored
|
|
|
|
To minimize data loss:
|
|
|
|
- Increase buffer size (`BANGUI_EXTERNAL_LOGGING_BUFFER_SIZE`)
|
|
- Use persistent external logging platforms
|
|
- Monitor for warnings in application logs about dropped batches
|
|
|
|
---
|
|
|
|
## Future Enhancements
|
|
|
|
Planned observability improvements:
|
|
|
|
- [ ] Distributed tracing (OpenTelemetry integration)
|
|
- [ ] Custom metrics collection
|
|
- [ ] Alerting rules and thresholds
|
|
- [ ] Log sampling strategies
|
|
- [ ] Additional provider support (Splunk, New Relic, CloudWatch)
|
|
|
|
---
|
|
|
|
## References
|
|
|
|
- [structlog Documentation](https://www.structlog.org/)
|
|
- [Datadog Logging Documentation](https://docs.datadoghq.com/logs/)
|
|
- [Papertrail Documentation](https://help.papertrailapp.com/)
|
|
- [Elasticsearch JSON Logging](https://www.elastic.co/guide/en/elasticsearch/reference/current/logging.html)
|
|
- [Observability Best Practices (OpenTelemetry)](https://opentelemetry.io/docs/concepts/observability-primer/)
|