- Remove structlog dependency from backend/pyproject.toml - Add app.utils.logging_compat shim for keyword-arg logging API - Add app.utils.json_formatter for JSON log output with extra fields - Update all backend modules to use logging_compat.get_logger() - Update docstrings in log_sanitizer.py and json_formatter.py - Update test comment in test_async_utils.py - Record 406 failing tests in Docs/Tasks.md for tracking
846 lines
26 KiB
Markdown
846 lines
26 KiB
Markdown
# Observability
|
|
|
|
BanGUI provides comprehensive observability through structured logging, metrics, and tracing capabilities. This document outlines the observability architecture and how to configure it for production deployments.
|
|
|
|
---
|
|
|
|
## Logging Architecture
|
|
|
|
### Overview
|
|
|
|
BanGUI uses **structlog** to emit structured, machine-readable logs in JSON format. All logs are automatically enriched with:
|
|
|
|
- **Timestamps** in ISO 8601 format (`timestamp`)
|
|
- **Log levels** (`level` - debug, info, warning, error, critical)
|
|
- **Logger names** (`logger_name`)
|
|
- **Correlation IDs** for request tracking (`correlation_id`)
|
|
- **Custom context** from business logic (via context variables)
|
|
|
|
### Log Output
|
|
|
|
By default, logs are written to **stdout** in JSON format, making them suitable for:
|
|
- Container environments (Docker, Kubernetes)
|
|
- Log aggregation systems (ELK, Datadog, Papertrail)
|
|
- CI/CD pipelines and monitoring platforms
|
|
|
|
```bash
|
|
# Example log output (formatted for readability)
|
|
{
|
|
"timestamp": "2024-05-01T18:17:19.080+02:00",
|
|
"level": "info",
|
|
"logger_name": "app.main",
|
|
"event": "bangui_starting_up",
|
|
"database_path": "/var/lib/bangui/bangui.db",
|
|
"pid": 1234
|
|
}
|
|
```
|
|
|
|
### Sensitive Data Handling
|
|
|
|
**CRITICAL: Never log sensitive data.** The following must NEVER appear in logs:
|
|
|
|
- Session tokens or cookies
|
|
- API keys or secrets
|
|
- Passwords or password hashes
|
|
- Private cryptographic keys
|
|
- Personal information (PII)
|
|
- Full IP addresses (when not required for security auditing)
|
|
|
|
When logging authentication or sensitive operations:
|
|
|
|
```python
|
|
# ✓ Correct: Log event type and result, not credentials
|
|
log.info("user_login_attempt", username=username, ip=client_ip, success=True)
|
|
|
|
# ✓ Correct: Log sanitized identifiers
|
|
log.error("auth_token_validation_failed", token_hash=hashlib.sha256(token).hexdigest()[:16])
|
|
|
|
# ✗ WRONG: Don't do this
|
|
log.debug("raw_token", token=token) # Never!
|
|
log.info("password_check", password=password_hash) # Never!
|
|
```
|
|
|
|
Structlog provides context variable filtering to prevent accidental logging of sensitive data. Code reviews must verify compliance with this rule.
|
|
|
|
### Log Sanitization
|
|
|
|
All external output (subprocess results, API responses, config file contents) passed to structlog **must** be sanitized first using `sanitize_for_logging()` from `app.utils.log_sanitizer`.
|
|
|
|
This prevents sensitive data — passwords, API keys, tokens, private keys — from leaking into logs.
|
|
|
|
```python
|
|
from app.utils.log_sanitizer import sanitize_for_logging
|
|
|
|
# ✓ Correct: Sanitize before logging
|
|
log.error(
|
|
"fail2ban_start_failed",
|
|
command=" ".join(start_cmd_parts),
|
|
returncode=process.returncode,
|
|
stdout=sanitize_for_logging(stdout.decode("utf-8", errors="replace")),
|
|
stderr=sanitize_for_logging(stderr.decode("utf-8", errors="replace")),
|
|
)
|
|
|
|
# ✗ Wrong: Raw output may contain secrets
|
|
log.error("fail2ban_start_failed", stdout=stdout_raw, stderr=stderr_raw) # Never!
|
|
```
|
|
|
|
`sanitize_for_logging()` redacts the following patterns:
|
|
|
|
| Pattern | Example match | Replacement |
|
|
|---------|---------------|-------------|
|
|
| `password=X` | `password=Secret123` | `password=***` |
|
|
| `api_key=X` / `api-key=X` | `api_key=key123` | `api_key=***` |
|
|
| `token=X` | `token=eyJhbG...` | `token=***` |
|
|
| `Authorization: Bearer X` | `Authorization: Bearer tok...` | `Authorization: ***` |
|
|
| `secret=X` | `secret=myvalue` | `secret=***` |
|
|
| `-----BEGIN RSA PRIVATE KEY-----` | (key header) | `*** PRIVATE KEY ***` |
|
|
| `AKIA...` | `AKIAIOSFODNN7EXAMPLE` | `AKIA***` |
|
|
|
|
---
|
|
|
|
## Third-Party Library Logs
|
|
|
|
BanGUI uses **structlog** for all application logs, but third-party libraries often emit plain text through Python's standard `logging` module. To maintain uniform JSON output and reduce noise, the following libraries have their log levels overridden to `WARNING`:
|
|
|
|
| Library | Logger Name | Level | Rationale |
|
|
|---------|-------------|-------|-----------|
|
|
| APScheduler | `apscheduler` | `WARNING` | Suppresses routine scheduler polling ("Looking for jobs to run", "Next wakeup is due at...") while preserving job failure warnings. |
|
|
| aiosqlite | `aiosqlite` | `WARNING` | Suppresses database operation traces and connection details while preserving connection errors. |
|
|
|
|
These overrides are applied in `backend/app/main.py::_configure_logging()` immediately after `logging.basicConfig()`.
|
|
|
|
### Disabling Suppression
|
|
|
|
Set the environment variable `BANGUI_SUPPRESS_THIRD_PARTY_LOGS=false` to allow APScheduler and aiosqlite to emit their normal DEBUG/INFO logs. This is useful when troubleshooting scheduler or database issues in development.
|
|
|
|
```bash
|
|
BANGUI_SUPPRESS_THIRD_PARTY_LOGS=false python -m uvicorn app.main:create_app
|
|
```
|
|
|
|
When suppression is disabled, the loggers inherit the application's `BANGUI_LOG_LEVEL` (e.g., `debug`).
|
|
|
|
### Uniform JSON Formatting
|
|
|
|
All stdlib logs — including those from third-party libraries — are intercepted by `structlog.stdlib.ProcessorFormatter` and rendered as JSON. This ensures every log line in `bangui.log` is machine-readable, regardless of its source.
|
|
|
|
### Adding New Overrides
|
|
|
|
When integrating a new library that emits verbose DEBUG logs:
|
|
|
|
```python
|
|
# In backend/app/main.py, inside _configure_logging()
|
|
logging.getLogger("new_library").setLevel(logging.WARNING)
|
|
```
|
|
|
|
Use `WARNING` as the default to still capture errors and warnings. Only use `ERROR` if the library is exceptionally noisy and its warnings are not actionable.
|
|
|
|
---
|
|
|
|
## Structured Logging Best Practices
|
|
|
|
### Log Levels
|
|
|
|
Use log levels consistently:
|
|
|
|
| Level | Use Case | Example |
|
|
|-------|----------|---------|
|
|
| **debug** | Verbose diagnostic information | `log.debug("parsing_config_file", lines=1024)` |
|
|
| **info** | Operational events | `log.info("jail_created", jail_name="sshd", action_count=3)` |
|
|
| **warning** | Recoverable issues | `log.warning("config_reload_skipped", reason="no_changes")` |
|
|
| **error** | Failures that impact functionality | `log.error("fail2ban_connection_lost", error=str(e))` |
|
|
| **critical** | System failures | `log.critical("database_corrupted", error=str(e))` |
|
|
|
|
### Context Variables
|
|
|
|
Use structlog's context variables to automatically include request-scoped information in all logs within a request:
|
|
|
|
```python
|
|
import structlog
|
|
|
|
log = structlog.get_logger()
|
|
|
|
# In middleware or early in request processing
|
|
structlog.contextvars.clear_contextvars()
|
|
structlog.contextvars.bind_contextvars(
|
|
correlation_id=request_id,
|
|
user_id=user_id,
|
|
client_ip=client_ip,
|
|
)
|
|
|
|
# All subsequent logs in this request will include these context variables
|
|
log.info("user_action", action="create_jail") # Automatically includes correlation_id, user_id, etc.
|
|
|
|
# Clear context at end of request
|
|
structlog.contextvars.clear_contextvars()
|
|
```
|
|
|
|
### Background Task Correlation
|
|
|
|
Background tasks (APScheduler jobs) run outside the HTTP request context.
|
|
Use :mod:`app.utils.correlation` to propagate correlation IDs through tasks:
|
|
|
|
```python
|
|
from app.utils.correlation import get_correlation_id, reset_correlation_id, set_correlation_id
|
|
|
|
async def my_background_task(correlation_id: str | None = None) -> None:
|
|
# Generate a new ID if not provided (scheduled tasks have no parent request)
|
|
if correlation_id is None:
|
|
import uuid
|
|
correlation_id = str(uuid.uuid4())
|
|
|
|
# Set the correlation ID for all logs in this task
|
|
token = set_correlation_id(correlation_id)
|
|
try:
|
|
log.info("task_started") # Now includes correlation_id
|
|
# ... task logic ...
|
|
finally:
|
|
reset_correlation_id(token)
|
|
|
|
# When scheduling, optionally pass the current correlation ID:
|
|
# scheduler.add_job(my_background_task, kwargs={"correlation_id": get_correlation_id()})
|
|
```
|
|
|
|
Scheduled tasks (no parent request) generate a fresh UUID for each run.
|
|
Tasks triggered by a request inherit the request's correlation ID.
|
|
|
|
### Event Naming Convention
|
|
|
|
Use snake_case for event names, prefixed with the component or module name:
|
|
|
|
```python
|
|
# ✓ Good naming
|
|
log.info("service_initialized", service="BanService", version="1.0")
|
|
log.warning("blocklist_import_slow", duration_ms=5000)
|
|
log.error("fail2ban_command_failed", command="list", exit_code=1)
|
|
|
|
# ✗ Bad naming
|
|
log.info("init") # Too generic
|
|
log.warning("slow operation") # Not machine-readable
|
|
log.error("ERROR: FAIL2BAN FAILED!") # Inconsistent formatting
|
|
```
|
|
|
|
### Attaching Structured Data
|
|
|
|
Always provide context as key-value pairs, not as unstructured strings:
|
|
|
|
```python
|
|
# ✓ Correct: Structured, queryable
|
|
log.info(
|
|
"ban_executed",
|
|
jail="sshd",
|
|
ip="192.0.2.1",
|
|
duration_seconds=3600,
|
|
reason="brute_force",
|
|
)
|
|
|
|
# ✗ Wrong: Unstructured, hard to query
|
|
log.info(f"Banned {ip} in jail {jail} for 3600 seconds because brute_force")
|
|
```
|
|
|
|
---
|
|
|
|
## Centralized Logging Configuration
|
|
|
|
### Environment Variables
|
|
|
|
External logging is configured via environment variables (all prefixed with `BANGUI_`):
|
|
|
|
#### Datadog
|
|
|
|
Enable logging to Datadog via HTTP API:
|
|
|
|
```bash
|
|
BANGUI_EXTERNAL_LOGGING_ENABLED=true
|
|
BANGUI_EXTERNAL_LOGGING_PROVIDER=datadog
|
|
BANGUI_DATADOG_API_KEY=your-api-key-here
|
|
BANGUI_DATADOG_SITE=datadoghq.com # or datadoghq.eu for EU
|
|
BANGUI_DATADOG_BATCH_SIZE=10 # Optional: logs per batch
|
|
BANGUI_DATADOG_FLUSH_INTERVAL_SECONDS=5 # Optional: flush interval
|
|
```
|
|
|
|
#### Papertrail
|
|
|
|
Enable logging to Papertrail via Syslog protocol:
|
|
|
|
```bash
|
|
BANGUI_EXTERNAL_LOGGING_ENABLED=true
|
|
BANGUI_EXTERNAL_LOGGING_PROVIDER=papertrail
|
|
BANGUI_PAPERTRAIL_HOST=logs1.papertrailapp.com
|
|
BANGUI_PAPERTRAIL_PORT=12345
|
|
BANGUI_PAPERTRAIL_PROGRAM_NAME=bangui # Optional: program name in syslog
|
|
```
|
|
|
|
#### ELK Stack
|
|
|
|
Enable logging to Elasticsearch/Logstash:
|
|
|
|
```bash
|
|
BANGUI_EXTERNAL_LOGGING_ENABLED=true
|
|
BANGUI_EXTERNAL_LOGGING_PROVIDER=elasticsearch
|
|
BANGUI_ELASTICSEARCH_HOSTS=http://elasticsearch:9200
|
|
BANGUI_ELASTICSEARCH_INDEX_PREFIX=bangui # Optional: index prefix
|
|
BANGUI_ELASTICSEARCH_BATCH_SIZE=10 # Optional: docs per batch
|
|
BANGUI_ELASTICSEARCH_FLUSH_INTERVAL_SECONDS=5 # Optional: flush interval
|
|
```
|
|
|
|
### Local Development (Disabled by Default)
|
|
|
|
External logging is **disabled by default**. In development, logs continue to write to stdout only:
|
|
|
|
```bash
|
|
# No configuration needed — logs go to stdout
|
|
docker compose up
|
|
```
|
|
|
|
To enable external logging in development for testing:
|
|
|
|
```bash
|
|
BANGUI_EXTERNAL_LOGGING_ENABLED=true \
|
|
BANGUI_EXTERNAL_LOGGING_PROVIDER=datadog \
|
|
BANGUI_DATADOG_API_KEY=test-key \
|
|
python -m uvicorn app.main:create_app --host 0.0.0.0 --port 8000
|
|
```
|
|
|
|
---
|
|
|
|
## Performance and Reliability
|
|
|
|
### Non-Blocking Delivery
|
|
|
|
External log delivery uses **asynchronous buffering** to prevent blocking the application:
|
|
|
|
1. Logs are written to an in-memory buffer
|
|
2. After the configured flush interval or batch size, the buffer is sent asynchronously
|
|
3. Send failures do not block application logic
|
|
4. Retries use exponential backoff (up to 5 attempts)
|
|
|
|
This ensures that external logging never degrades application performance.
|
|
|
|
### Failure Modes
|
|
|
|
If external logging becomes unavailable:
|
|
|
|
- **Transient failures** (network timeouts, temporary 5xx errors): Logs are retried with exponential backoff
|
|
- **Permanent failures** (invalid API key, host unreachable): A warning is logged; application continues
|
|
- **Steady-state**: Logs are buffered up to a maximum queue size (default: 1000 logs); older logs are dropped if buffer fills
|
|
|
|
The application **never crashes** due to external logging failures.
|
|
|
|
### Log Volume and Rate Limiting
|
|
|
|
Large log volumes can increase data transfer and storage costs. To manage log volume:
|
|
|
|
1. **Reduce log level in production**: Set `BANGUI_LOG_LEVEL=warning` or `error` to suppress debug/info logs
|
|
2. **Sample logs**: Some providers (Datadog, Papertrail) support sampling rules
|
|
3. **Filter sensitive paths**: Middleware can suppress verbose logging for noisy endpoints
|
|
|
|
Monitor actual log volume and adjust settings based on usage patterns.
|
|
|
|
---
|
|
|
|
## Integration Examples
|
|
|
|
### Docker Compose (Development with Datadog)
|
|
|
|
```yaml
|
|
version: "3.9"
|
|
services:
|
|
bangui:
|
|
build:
|
|
context: .
|
|
dockerfile: Docker/Dockerfile.app
|
|
environment:
|
|
BANGUI_EXTERNAL_LOGGING_ENABLED: "true"
|
|
BANGUI_EXTERNAL_LOGGING_PROVIDER: "datadog"
|
|
BANGUI_DATADOG_API_KEY: "${DATADOG_API_KEY}"
|
|
BANGUI_DATADOG_SITE: "datadoghq.com"
|
|
BANGUI_LOG_LEVEL: "info"
|
|
ports:
|
|
- "8000:8000"
|
|
```
|
|
|
|
### Kubernetes Deployment (Papertrail)
|
|
|
|
```yaml
|
|
apiVersion: v1
|
|
kind: ConfigMap
|
|
metadata:
|
|
name: bangui-logging
|
|
data:
|
|
BANGUI_EXTERNAL_LOGGING_ENABLED: "true"
|
|
BANGUI_EXTERNAL_LOGGING_PROVIDER: "papertrail"
|
|
BANGUI_PAPERTRAIL_HOST: "logs1.papertrailapp.com"
|
|
BANGUI_PAPERTRAIL_PORT: "12345"
|
|
BANGUI_PAPERTRAIL_PROGRAM_NAME: "bangui"
|
|
BANGUI_LOG_LEVEL: "info"
|
|
|
|
---
|
|
apiVersion: apps/v1
|
|
kind: Deployment
|
|
metadata:
|
|
name: bangui
|
|
spec:
|
|
template:
|
|
spec:
|
|
containers:
|
|
- name: bangui
|
|
image: bangui:latest
|
|
envFrom:
|
|
- configMapRef:
|
|
name: bangui-logging
|
|
env:
|
|
- name: BANGUI_DATADOG_API_KEY
|
|
valueFrom:
|
|
secretKeyRef:
|
|
name: bangui-secrets
|
|
key: datadog-api-key
|
|
```
|
|
|
|
---
|
|
|
|
## Monitoring Logging Infrastructure
|
|
|
|
### Datadog Dashboard Query
|
|
|
|
Search for all BanGUI logs:
|
|
|
|
```
|
|
service:bangui
|
|
```
|
|
|
|
Search for errors in authentication:
|
|
|
|
```
|
|
service:bangui status:error component:auth
|
|
```
|
|
|
|
### Papertrail Search
|
|
|
|
Search for all startup events:
|
|
|
|
```
|
|
program:bangui bangui_starting_up
|
|
```
|
|
|
|
Search for authentication failures:
|
|
|
|
```
|
|
program:bangui auth_token_validation_failed
|
|
```
|
|
|
|
### Elasticsearch Query (ELK)
|
|
|
|
```json
|
|
{
|
|
"query": {
|
|
"bool": {
|
|
"must": [
|
|
{ "match": { "logger_name": "app.auth" } },
|
|
{ "match": { "level": "error" } }
|
|
]
|
|
}
|
|
}
|
|
}
|
|
```
|
|
|
|
---
|
|
|
|
## Testing and Debugging
|
|
|
|
### Verify JSON Output
|
|
|
|
Inspect the actual JSON emitted by the logging system:
|
|
|
|
```bash
|
|
# Start the app and capture logs
|
|
python -m uvicorn app.main:create_app --host 0.0.0.0 --port 8000 2>&1 | head -10 | python -m json.tool
|
|
```
|
|
|
|
Expected output:
|
|
|
|
```json
|
|
{
|
|
"timestamp": "2024-05-01T18:20:45.123456+02:00",
|
|
"level": "info",
|
|
"logger_name": "app.main",
|
|
"event": "bangui_starting_up",
|
|
"database_path": "/var/lib/bangui/bangui.db"
|
|
}
|
|
```
|
|
|
|
### Enable Debug Logging for External Log Delivery
|
|
|
|
Set the log level to `debug` to see internal logs from the external logging system:
|
|
|
|
```bash
|
|
BANGUI_LOG_LEVEL=debug BANGUI_EXTERNAL_LOGGING_ENABLED=true python -m uvicorn app.main:create_app
|
|
```
|
|
|
|
This will emit logs like:
|
|
|
|
```json
|
|
{
|
|
"level": "debug",
|
|
"event": "external_log_batch_sent",
|
|
"provider": "datadog",
|
|
"batch_size": 10,
|
|
"duration_ms": 42
|
|
}
|
|
```
|
|
|
|
### Validate Configuration
|
|
|
|
Validate external logging configuration on startup:
|
|
|
|
```bash
|
|
python -c "from app.config import get_settings; s = get_settings(); print(s.model_dump())"
|
|
```
|
|
|
|
---
|
|
|
|
## Security Considerations
|
|
|
|
### API Key Rotation
|
|
|
|
Rotate API keys regularly:
|
|
|
|
1. Update `BANGUI_DATADOG_API_KEY` with the new key
|
|
2. Restart the application
|
|
3. Old keys can be revoked after restart
|
|
|
|
### Network Security
|
|
|
|
When sending logs over the network:
|
|
|
|
- **Datadog HTTP API**: Uses HTTPS, encrypted in transit
|
|
- **Papertrail Syslog**: Use TLS-enabled Syslog (if supported) or send over VPN/private network
|
|
- **Elasticsearch**: Use HTTPS and HTTP Basic Auth or API Key authentication
|
|
|
|
Never send logs over unencrypted channels in production.
|
|
|
|
### Compliance
|
|
|
|
Ensure that your external logging platform complies with your organization's data protection requirements:
|
|
|
|
- **GDPR**: Verify the platform's data processing agreements
|
|
- **HIPAA**: Ensure the provider is HIPAA-eligible
|
|
- **SOC 2**: Request audit reports from your logging provider
|
|
- **Data retention**: Configure appropriate log retention policies
|
|
|
|
---
|
|
|
|
## Troubleshooting
|
|
|
|
### Logs Not Appearing in External System
|
|
|
|
1. **Verify configuration**: Check that environment variables are set correctly
|
|
2. **Check API credentials**: Ensure the API key or credentials are valid
|
|
3. **Check network connectivity**: Verify the external system is reachable
|
|
4. **Review logs locally**: Run with `BANGUI_LOG_LEVEL=debug` and check stdout for errors
|
|
5. **Check disk space**: Ensure the local buffer directory has sufficient disk space
|
|
|
|
### Performance Degradation
|
|
|
|
1. **Check buffer size**: If the buffer is full, logs are dropped; increase `BANGUI_EXTERNAL_LOGGING_BUFFER_SIZE`
|
|
2. **Adjust flush interval**: Decrease flush interval if experiencing large batches
|
|
3. **Reduce log level**: Set `BANGUI_LOG_LEVEL=warning` to reduce log volume
|
|
4. **Monitor network**: Check bandwidth usage between application and external system
|
|
|
|
### Lost Logs
|
|
|
|
In the rare event that logs are lost:
|
|
|
|
1. **Buffer overflow**: The in-memory buffer has a maximum size; excess logs are dropped with a warning
|
|
2. **Network failure during batch send**: Logs are retried; after max retries, a warning is logged
|
|
3. **External system outage**: Logs may be dropped if buffer fills before service is restored
|
|
|
|
To minimize data loss:
|
|
|
|
- Increase buffer size (`BANGUI_EXTERNAL_LOGGING_BUFFER_SIZE`)
|
|
- Use persistent external logging platforms
|
|
- Monitor for warnings in application logs about dropped batches
|
|
|
|
---
|
|
|
|
## Application Performance Monitoring (Metrics)
|
|
|
|
BanGUI collects comprehensive metrics for request performance, application health, and resource utilization through **Prometheus**. Metrics are exposed in standard Prometheus text format and can be scraped by monitoring systems.
|
|
|
|
### Backend Metrics
|
|
|
|
#### HTTP Request Metrics
|
|
|
|
The backend automatically tracks HTTP request performance:
|
|
|
|
- **`bangui_http_requests_total`** (Counter) — Total HTTP requests by method, endpoint, and status code
|
|
```
|
|
bangui_http_requests_total{method="GET",endpoint="/api/jails",status_code="200"} 125
|
|
```
|
|
|
|
- **`bangui_http_request_duration_seconds`** (Histogram) — Request latency distribution by method and endpoint
|
|
```
|
|
bangui_http_request_duration_seconds_bucket{method="GET",endpoint="/api/jails",le="0.1"} 120
|
|
bangui_http_request_duration_seconds_sum{method="GET",endpoint="/api/jails"} 45.23
|
|
```
|
|
|
|
- **`bangui_http_active_requests`** (Gauge) — Current number of in-flight requests by method and endpoint
|
|
```
|
|
bangui_http_active_requests{method="GET",endpoint="/api/jails"} 5
|
|
```
|
|
|
|
#### Application Metrics
|
|
|
|
Domain-specific metrics track application state:
|
|
|
|
- **`bangui_bans_total`** (Gauge) — Total number of currently banned IPs across all jails
|
|
- **`bangui_jails_total`** (Gauge) — Total number of fail2ban jails
|
|
- **`bangui_fail2ban_connection_errors_total`** (Counter) — Total fail2ban connection errors
|
|
|
|
#### Accessing Metrics
|
|
|
|
Prometheus metrics are exposed at the `/metrics` endpoint:
|
|
|
|
```bash
|
|
curl http://localhost:8000/metrics
|
|
```
|
|
|
|
Response format:
|
|
```
|
|
# HELP bangui_http_requests_total Total HTTP requests by method, endpoint, and status code
|
|
# TYPE bangui_http_requests_total counter
|
|
bangui_http_requests_total{method="GET",endpoint="/api/dashboard/status",status_code="200"} 1523.0
|
|
|
|
# HELP bangui_http_request_duration_seconds HTTP request latency in seconds by method and endpoint
|
|
# TYPE bangui_http_request_duration_seconds histogram
|
|
bangui_http_request_duration_seconds_bucket{method="GET",endpoint="/api/dashboard/status",le="0.01"} 1200.0
|
|
bangui_http_request_duration_seconds_sum{method="GET",endpoint="/api/dashboard/status"} 156.78
|
|
```
|
|
|
|
### Frontend Metrics
|
|
|
|
#### Web Vitals
|
|
|
|
The frontend automatically measures Core Web Vitals using the `web-vitals` library:
|
|
|
|
- **Cumulative Layout Shift (CLS)** — Visual stability score (good: ≤0.1)
|
|
- **First Contentful Paint (FCP)** — Time until first content appears (good: ≤1.8s)
|
|
- **First Input Delay (FID)** — Responsiveness to user input (good: ≤100ms)
|
|
- **Largest Contentful Paint (LCP)** — Time until largest content is visible (good: ≤2.5s)
|
|
- **Time to First Byte (TTFB)** — Server response time (good: ≤600ms)
|
|
|
|
#### API Call Metrics
|
|
|
|
API calls are automatically tracked with:
|
|
|
|
- HTTP method and endpoint
|
|
- Response status code
|
|
- Duration in milliseconds
|
|
- Timestamp
|
|
|
|
### Integrating with Monitoring Systems
|
|
|
|
#### Prometheus + Grafana
|
|
|
|
Configure Prometheus to scrape BanGUI metrics:
|
|
|
|
```yaml
|
|
# prometheus.yml
|
|
scrape_configs:
|
|
- job_name: "bangui"
|
|
static_configs:
|
|
- targets: ["localhost:8000"]
|
|
metrics_path: "/metrics"
|
|
```
|
|
|
|
Then import a Grafana dashboard to visualize:
|
|
|
|
- Request rates by endpoint
|
|
- Latency percentiles (p50, p95, p99)
|
|
- Error rate trends
|
|
- Active request counts
|
|
|
|
#### Datadog
|
|
|
|
Configure BanGUI to send metrics via StatsD or HTTP API:
|
|
|
|
```bash
|
|
BANGUI_METRICS_ENABLED=true
|
|
BANGUI_METRICS_PROVIDER=datadog
|
|
BANGUI_DATADOG_API_KEY=your-api-key
|
|
BANGUI_DATADOG_SITE=datadoghq.com
|
|
```
|
|
|
|
#### New Relic
|
|
|
|
Send metrics to New Relic (custom event collection):
|
|
|
|
```bash
|
|
BANGUI_METRICS_ENABLED=true
|
|
BANGUI_METRICS_PROVIDER=newrelic
|
|
BANGUI_NEWRELIC_API_KEY=your-api-key
|
|
BANGUI_NEWRELIC_ACCOUNT_ID=your-account-id
|
|
```
|
|
|
|
### Metrics Best Practices
|
|
|
|
#### Cardinality Management
|
|
|
|
Metric labels (tags) can cause cardinality explosion if not carefully managed. BanGUI uses:
|
|
|
|
- Path normalization — `/api/jails/123` becomes `/api/{id}` to prevent unique labels per resource
|
|
- Status code grouping — errors are grouped by category, not individual codes
|
|
- Endpoint aggregation — only significant endpoints are tracked
|
|
|
|
#### Performance Considerations
|
|
|
|
- Metrics collection has negligible performance impact (<1ms per request)
|
|
- In-memory buffering prevents database writes on every request
|
|
- High-cardinality labels are avoided
|
|
- Metric export (scraping) does not block request processing
|
|
|
|
#### PII Protection
|
|
|
|
**NEVER include sensitive data in metric labels:**
|
|
|
|
- User IDs or session tokens
|
|
- Passwords or API keys
|
|
- Private IP addresses
|
|
- Full request/response bodies
|
|
|
|
Allowed: HTTP method, endpoint path (normalized), status code, duration, timestamp.
|
|
|
|
### Query Examples
|
|
|
|
#### Prometheus Queries
|
|
|
|
Find p95 request latency for `/api/jails`:
|
|
|
|
```promql
|
|
histogram_quantile(0.95, bangui_http_request_duration_seconds_bucket{endpoint="/api/jails"})
|
|
```
|
|
|
|
Find error rate (5xx responses):
|
|
|
|
```promql
|
|
rate(bangui_http_requests_total{status_code=~"5.."}[5m])
|
|
```
|
|
|
|
Find active requests per endpoint:
|
|
|
|
```promql
|
|
bangui_http_active_requests
|
|
```
|
|
|
|
#### Grafana Dashboard
|
|
|
|
Recommended panels:
|
|
|
|
1. **Request Rate** — `rate(bangui_http_requests_total[1m])` by endpoint
|
|
2. **Latency Percentiles** — `histogram_quantile([0.5, 0.95, 0.99], ...)`
|
|
3. **Error Rate** — `rate(bangui_http_requests_total{status_code=~"5.."}[5m])`
|
|
4. **Active Requests** — `bangui_http_active_requests` (gauge)
|
|
5. **fail2ban Connection Health** — `rate(bangui_fail2ban_connection_errors_total[5m])`
|
|
|
|
### Troubleshooting Metrics
|
|
|
|
#### Metrics endpoint not responding
|
|
|
|
1. Verify the `/metrics` endpoint is accessible: `curl http://localhost:8000/metrics`
|
|
2. Check application logs for errors during middleware initialization
|
|
3. Ensure prometheus-client is installed: `pip show prometheus-client`
|
|
|
|
#### High cardinality warnings
|
|
|
|
If Prometheus warns about high cardinality:
|
|
|
|
1. Check if custom labels are being added to metrics
|
|
2. Ensure path normalization is working (IDs should be replaced with `{id}`)
|
|
3. Consider sampling metrics for high-volume endpoints
|
|
|
|
#### Missing metrics
|
|
|
|
1. Check that endpoints are being called (look for 200 responses in logs)
|
|
2. Verify the metrics middleware is registered (check `app.add_middleware(MetricsMiddleware)`)
|
|
3. Ensure metrics are being recorded (call `recordApiCall()` on frontend)
|
|
|
|
---
|
|
|
|
## Future Enhancements
|
|
|
|
Planned observability improvements:
|
|
|
|
- [x] Application metrics collection (Prometheus)
|
|
- [x] Web Vitals tracking (frontend)
|
|
- [ ] Distributed tracing (OpenTelemetry integration)
|
|
- [ ] Custom metric hooks for business events
|
|
- [ ] Alerting rules and thresholds
|
|
- [ ] Log sampling strategies
|
|
- [ ] Additional provider support (Splunk, New Relic, CloudWatch)
|
|
|
|
---
|
|
|
|
## Scheduler Lock Health Monitoring
|
|
|
|
The scheduler lock ensures only one instance runs background tasks. Monitoring its health is critical for production reliability.
|
|
|
|
### Key Metrics
|
|
|
|
Monitor these log events for scheduler lock health:
|
|
|
|
| Event | Level | Meaning |
|
|
|-------|-------|---------|
|
|
| `scheduler_lock_acquired` | info | Successfully acquired the scheduler lock |
|
|
| `scheduler_lock_held_by_other_instance` | warning | Another instance holds the lock (expected during normal multi-instance operation) |
|
|
| `scheduler_lock_stale_overwrite` | info | Took over a stale lock from a crashed instance |
|
|
| `scheduler_lock_heartbeat_lost` | warning | Heartbeat update failed; we lost the lock |
|
|
| `scheduler_lock_release_mismatch` | warning | Release attempted but we don't hold the lock |
|
|
|
|
### Lock Health Check
|
|
|
|
Query current lock status via `get_lock_health()`:
|
|
|
|
```python
|
|
from app.utils.scheduler_lock import get_lock_health
|
|
|
|
health = await get_lock_health(db)
|
|
# Returns: {"locked": bool, "pid": int|None, "hostname": str|None,
|
|
# "age_seconds": float|None, "is_stale": bool, "ttl_remaining": float|None}
|
|
```
|
|
|
|
### Alerting Rules
|
|
|
|
**Critical alerts:**
|
|
- `scheduler_lock_acquired` not seen for >5 minutes during startup → Instance may not have acquired lock
|
|
- `scheduler_lock_heartbeat_lost` repeated >3 times → Lock keeps being stolen, possible contention issue
|
|
|
|
**Warning alerts:**
|
|
- `scheduler_lock_held_by_other_instance` every few minutes → Normal if multiple instances, abnormal if single instance
|
|
|
|
### Database Query
|
|
|
|
Check lock state directly in SQLite:
|
|
|
|
```sql
|
|
SELECT pid, hostname, heartbeat_at, heartbeat_timeout,
|
|
(datetime('now') - datetime(heartbeat_at, 'unixepoch')) as age
|
|
FROM scheduler_lock WHERE id = 1;
|
|
```
|
|
|
|
### Common Issues
|
|
|
|
1. **Lock not acquired on startup**: Check logs for `scheduler_lock_held_by_other_instance`. If another instance holds it, verify if that instance is healthy.
|
|
|
|
2. **Background tasks not running**: Use `get_lock_health()` to verify the lock is held. If not held, the instance cannot run scheduled tasks.
|
|
|
|
3. **Frequent lock steals**: If `scheduler_lock_stale_overwrite` occurs frequently, the heartbeat interval may be too long or network latency is causing false staleness detection.
|
|
|
|
---
|
|
|
|
## References
|
|
|
|
- [structlog Documentation](https://www.structlog.org/)
|
|
- [Datadog Logging Documentation](https://docs.datadoghq.com/logs/)
|
|
- [Papertrail Documentation](https://help.papertrailapp.com/)
|
|
- [Elasticsearch JSON Logging](https://www.elastic.co/guide/en/elasticsearch/reference/current/logging.html)
|
|
- [Observability Best Practices (OpenTelemetry)](https://opentelemetry.io/docs/concepts/observability-primer/)
|