Files

Lukas 7ec80fdeec refactor(logging): replace structlog with stdlib logging compat layer

- Remove structlog dependency from backend/pyproject.toml
- Add app.utils.logging_compat shim for keyword-arg logging API
- Add app.utils.json_formatter for JSON log output with extra fields
- Update all backend modules to use logging_compat.get_logger()
- Update docstrings in log_sanitizer.py and json_formatter.py
- Update test comment in test_async_utils.py
- Record 406 failing tests in Docs/Tasks.md for tracking

2026-05-10 13:37:54 +02:00

26 KiB

Raw Blame History

Observability

BanGUI provides comprehensive observability through structured logging, metrics, and tracing capabilities. This document outlines the observability architecture and how to configure it for production deployments.

Logging Architecture

Overview

BanGUI uses structlog to emit structured, machine-readable logs in JSON format. All logs are automatically enriched with:

Timestamps in ISO 8601 format (timestamp)
Log levels (level - debug, info, warning, error, critical)
Logger names (logger_name)
Correlation IDs for request tracking (correlation_id)
Custom context from business logic (via context variables)

Log Output

By default, logs are written to stdout in JSON format, making them suitable for:

Container environments (Docker, Kubernetes)
Log aggregation systems (ELK, Datadog, Papertrail)
CI/CD pipelines and monitoring platforms

# Example log output (formatted for readability)
{
  "timestamp": "2024-05-01T18:17:19.080+02:00",
  "level": "info",
  "logger_name": "app.main",
  "event": "bangui_starting_up",
  "database_path": "/var/lib/bangui/bangui.db",
  "pid": 1234
}

Sensitive Data Handling

CRITICAL: Never log sensitive data. The following must NEVER appear in logs:

Session tokens or cookies
API keys or secrets
Passwords or password hashes
Private cryptographic keys
Personal information (PII)
Full IP addresses (when not required for security auditing)

When logging authentication or sensitive operations:

# ✓ Correct: Log event type and result, not credentials
log.info("user_login_attempt", username=username, ip=client_ip, success=True)

# ✓ Correct: Log sanitized identifiers
log.error("auth_token_validation_failed", token_hash=hashlib.sha256(token).hexdigest()[:16])

# ✗ WRONG: Don't do this
log.debug("raw_token", token=token)  # Never!
log.info("password_check", password=password_hash)  # Never!

Structlog provides context variable filtering to prevent accidental logging of sensitive data. Code reviews must verify compliance with this rule.

Log Sanitization

All external output (subprocess results, API responses, config file contents) passed to structlog must be sanitized first using sanitize_for_logging() from app.utils.log_sanitizer.

This prevents sensitive data — passwords, API keys, tokens, private keys — from leaking into logs.

from app.utils.log_sanitizer import sanitize_for_logging

# ✓ Correct: Sanitize before logging
log.error(
    "fail2ban_start_failed",
    command=" ".join(start_cmd_parts),
    returncode=process.returncode,
    stdout=sanitize_for_logging(stdout.decode("utf-8", errors="replace")),
    stderr=sanitize_for_logging(stderr.decode("utf-8", errors="replace")),
)

# ✗ Wrong: Raw output may contain secrets
log.error("fail2ban_start_failed", stdout=stdout_raw, stderr=stderr_raw)  # Never!

sanitize_for_logging() redacts the following patterns:

Pattern	Example match	Replacement
`password=X`	`password=Secret123`	`password=***`
`api_key=X` / `api-key=X`	`api_key=key123`	`api_key=***`
`token=X`	`token=eyJhbG...`	`token=***`
`Authorization: Bearer X`	`Authorization: Bearer tok...`	`Authorization: ***`
`secret=X`	`secret=myvalue`	`secret=***`
`-----BEGIN RSA PRIVATE KEY-----`	(key header)	`* PRIVATE KEY *`
`AKIA...`	`AKIAIOSFODNN7EXAMPLE`	`AKIA***`

Third-Party Library Logs

BanGUI uses structlog for all application logs, but third-party libraries often emit plain text through Python's standard logging module. To maintain uniform JSON output and reduce noise, the following libraries have their log levels overridden to WARNING:

Library	Logger Name	Level	Rationale
APScheduler	`apscheduler`	`WARNING`	Suppresses routine scheduler polling ("Looking for jobs to run", "Next wakeup is due at...") while preserving job failure warnings.
aiosqlite	`aiosqlite`	`WARNING`	Suppresses database operation traces and connection details while preserving connection errors.

These overrides are applied in backend/app/main.py::_configure_logging() immediately after logging.basicConfig().

Disabling Suppression

Set the environment variable BANGUI_SUPPRESS_THIRD_PARTY_LOGS=false to allow APScheduler and aiosqlite to emit their normal DEBUG/INFO logs. This is useful when troubleshooting scheduler or database issues in development.

BANGUI_SUPPRESS_THIRD_PARTY_LOGS=false python -m uvicorn app.main:create_app

When suppression is disabled, the loggers inherit the application's BANGUI_LOG_LEVEL (e.g., debug).

Uniform JSON Formatting

All stdlib logs — including those from third-party libraries — are intercepted by structlog.stdlib.ProcessorFormatter and rendered as JSON. This ensures every log line in bangui.log is machine-readable, regardless of its source.

Adding New Overrides

When integrating a new library that emits verbose DEBUG logs:

# In backend/app/main.py, inside _configure_logging()
logging.getLogger("new_library").setLevel(logging.WARNING)

Use WARNING as the default to still capture errors and warnings. Only use ERROR if the library is exceptionally noisy and its warnings are not actionable.

Structured Logging Best Practices

Log Levels

Use log levels consistently:

Level	Use Case	Example
debug	Verbose diagnostic information	`log.debug("parsing_config_file", lines=1024)`
info	Operational events	`log.info("jail_created", jail_name="sshd", action_count=3)`
warning	Recoverable issues	`log.warning("config_reload_skipped", reason="no_changes")`
error	Failures that impact functionality	`log.error("fail2ban_connection_lost", error=str(e))`
critical	System failures	`log.critical("database_corrupted", error=str(e))`

Context Variables

Use structlog's context variables to automatically include request-scoped information in all logs within a request:

import structlog

log = structlog.get_logger()

# In middleware or early in request processing
structlog.contextvars.clear_contextvars()
structlog.contextvars.bind_contextvars(
    correlation_id=request_id,
    user_id=user_id,
    client_ip=client_ip,
)

# All subsequent logs in this request will include these context variables
log.info("user_action", action="create_jail")  # Automatically includes correlation_id, user_id, etc.

# Clear context at end of request
structlog.contextvars.clear_contextvars()

Background Task Correlation

Background tasks (APScheduler jobs) run outside the HTTP request context. Use :mod:app.utils.correlation to propagate correlation IDs through tasks:

from app.utils.correlation import get_correlation_id, reset_correlation_id, set_correlation_id

async def my_background_task(correlation_id: str | None = None) -> None:
    # Generate a new ID if not provided (scheduled tasks have no parent request)
    if correlation_id is None:
        import uuid
        correlation_id = str(uuid.uuid4())

    # Set the correlation ID for all logs in this task
    token = set_correlation_id(correlation_id)
    try:
        log.info("task_started")  # Now includes correlation_id
        # ... task logic ...
    finally:
        reset_correlation_id(token)

# When scheduling, optionally pass the current correlation ID:
# scheduler.add_job(my_background_task, kwargs={"correlation_id": get_correlation_id()})

Scheduled tasks (no parent request) generate a fresh UUID for each run. Tasks triggered by a request inherit the request's correlation ID.

Event Naming Convention

Use snake_case for event names, prefixed with the component or module name:

# ✓ Good naming
log.info("service_initialized", service="BanService", version="1.0")
log.warning("blocklist_import_slow", duration_ms=5000)
log.error("fail2ban_command_failed", command="list", exit_code=1)

# ✗ Bad naming
log.info("init")  # Too generic
log.warning("slow operation")  # Not machine-readable
log.error("ERROR: FAIL2BAN FAILED!")  # Inconsistent formatting

Attaching Structured Data

Always provide context as key-value pairs, not as unstructured strings:

# ✓ Correct: Structured, queryable
log.info(
    "ban_executed",
    jail="sshd",
    ip="192.0.2.1",
    duration_seconds=3600,
    reason="brute_force",
)

# ✗ Wrong: Unstructured, hard to query
log.info(f"Banned {ip} in jail {jail} for 3600 seconds because brute_force")

Centralized Logging Configuration

Environment Variables

External logging is configured via environment variables (all prefixed with BANGUI_):

Datadog

Enable logging to Datadog via HTTP API:

BANGUI_EXTERNAL_LOGGING_ENABLED=true
BANGUI_EXTERNAL_LOGGING_PROVIDER=datadog
BANGUI_DATADOG_API_KEY=your-api-key-here
BANGUI_DATADOG_SITE=datadoghq.com              # or datadoghq.eu for EU
BANGUI_DATADOG_BATCH_SIZE=10                   # Optional: logs per batch
BANGUI_DATADOG_FLUSH_INTERVAL_SECONDS=5        # Optional: flush interval

Papertrail

Enable logging to Papertrail via Syslog protocol:

BANGUI_EXTERNAL_LOGGING_ENABLED=true
BANGUI_EXTERNAL_LOGGING_PROVIDER=papertrail
BANGUI_PAPERTRAIL_HOST=logs1.papertrailapp.com
BANGUI_PAPERTRAIL_PORT=12345
BANGUI_PAPERTRAIL_PROGRAM_NAME=bangui          # Optional: program name in syslog

ELK Stack

Enable logging to Elasticsearch/Logstash:

BANGUI_EXTERNAL_LOGGING_ENABLED=true
BANGUI_EXTERNAL_LOGGING_PROVIDER=elasticsearch
BANGUI_ELASTICSEARCH_HOSTS=http://elasticsearch:9200
BANGUI_ELASTICSEARCH_INDEX_PREFIX=bangui       # Optional: index prefix
BANGUI_ELASTICSEARCH_BATCH_SIZE=10             # Optional: docs per batch
BANGUI_ELASTICSEARCH_FLUSH_INTERVAL_SECONDS=5  # Optional: flush interval

Local Development (Disabled by Default)

External logging is disabled by default. In development, logs continue to write to stdout only:

# No configuration needed — logs go to stdout
docker compose up

To enable external logging in development for testing:

BANGUI_EXTERNAL_LOGGING_ENABLED=true \
BANGUI_EXTERNAL_LOGGING_PROVIDER=datadog \
BANGUI_DATADOG_API_KEY=test-key \
python -m uvicorn app.main:create_app --host 0.0.0.0 --port 8000

Performance and Reliability

Non-Blocking Delivery

External log delivery uses asynchronous buffering to prevent blocking the application:

Logs are written to an in-memory buffer
After the configured flush interval or batch size, the buffer is sent asynchronously
Send failures do not block application logic
Retries use exponential backoff (up to 5 attempts)

This ensures that external logging never degrades application performance.

Failure Modes

If external logging becomes unavailable:

Transient failures (network timeouts, temporary 5xx errors): Logs are retried with exponential backoff
Permanent failures (invalid API key, host unreachable): A warning is logged; application continues
Steady-state: Logs are buffered up to a maximum queue size (default: 1000 logs); older logs are dropped if buffer fills

The application never crashes due to external logging failures.

Log Volume and Rate Limiting

Large log volumes can increase data transfer and storage costs. To manage log volume:

Reduce log level in production: Set BANGUI_LOG_LEVEL=warning or error to suppress debug/info logs
Sample logs: Some providers (Datadog, Papertrail) support sampling rules
Filter sensitive paths: Middleware can suppress verbose logging for noisy endpoints

Monitor actual log volume and adjust settings based on usage patterns.

Integration Examples

Docker Compose (Development with Datadog)

version: "3.9"
services:
  bangui:
    build:
      context: .
      dockerfile: Docker/Dockerfile.app
    environment:
      BANGUI_EXTERNAL_LOGGING_ENABLED: "true"
      BANGUI_EXTERNAL_LOGGING_PROVIDER: "datadog"
      BANGUI_DATADOG_API_KEY: "${DATADOG_API_KEY}"
      BANGUI_DATADOG_SITE: "datadoghq.com"
      BANGUI_LOG_LEVEL: "info"
    ports:
      - "8000:8000"

Kubernetes Deployment (Papertrail)

apiVersion: v1
kind: ConfigMap
metadata:
  name: bangui-logging
data:
  BANGUI_EXTERNAL_LOGGING_ENABLED: "true"
  BANGUI_EXTERNAL_LOGGING_PROVIDER: "papertrail"
  BANGUI_PAPERTRAIL_HOST: "logs1.papertrailapp.com"
  BANGUI_PAPERTRAIL_PORT: "12345"
  BANGUI_PAPERTRAIL_PROGRAM_NAME: "bangui"
  BANGUI_LOG_LEVEL: "info"

---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: bangui
spec:
  template:
    spec:
      containers:
      - name: bangui
        image: bangui:latest
        envFrom:
        - configMapRef:
            name: bangui-logging
        env:
        - name: BANGUI_DATADOG_API_KEY
          valueFrom:
            secretKeyRef:
              name: bangui-secrets
              key: datadog-api-key

Monitoring Logging Infrastructure

Datadog Dashboard Query

Search for all BanGUI logs:

service:bangui

Search for errors in authentication:

service:bangui status:error component:auth

Papertrail Search

Search for all startup events:

program:bangui bangui_starting_up

Search for authentication failures:

program:bangui auth_token_validation_failed

Elasticsearch Query (ELK)

{
  "query": {
    "bool": {
      "must": [
        { "match": { "logger_name": "app.auth" } },
        { "match": { "level": "error" } }
      ]
    }
  }
}

Testing and Debugging

Verify JSON Output

Inspect the actual JSON emitted by the logging system:

# Start the app and capture logs
python -m uvicorn app.main:create_app --host 0.0.0.0 --port 8000 2>&1 | head -10 | python -m json.tool

Expected output:

{
  "timestamp": "2024-05-01T18:20:45.123456+02:00",
  "level": "info",
  "logger_name": "app.main",
  "event": "bangui_starting_up",
  "database_path": "/var/lib/bangui/bangui.db"
}

Enable Debug Logging for External Log Delivery

Set the log level to debug to see internal logs from the external logging system:

BANGUI_LOG_LEVEL=debug BANGUI_EXTERNAL_LOGGING_ENABLED=true python -m uvicorn app.main:create_app

This will emit logs like:

{
  "level": "debug",
  "event": "external_log_batch_sent",
  "provider": "datadog",
  "batch_size": 10,
  "duration_ms": 42
}

Validate Configuration

Validate external logging configuration on startup:

python -c "from app.config import get_settings; s = get_settings(); print(s.model_dump())"

Security Considerations

API Key Rotation

Rotate API keys regularly:

Update BANGUI_DATADOG_API_KEY with the new key
Restart the application
Old keys can be revoked after restart

Network Security

When sending logs over the network:

Datadog HTTP API: Uses HTTPS, encrypted in transit
Papertrail Syslog: Use TLS-enabled Syslog (if supported) or send over VPN/private network
Elasticsearch: Use HTTPS and HTTP Basic Auth or API Key authentication

Never send logs over unencrypted channels in production.

Compliance

Ensure that your external logging platform complies with your organization's data protection requirements:

GDPR: Verify the platform's data processing agreements
HIPAA: Ensure the provider is HIPAA-eligible
SOC 2: Request audit reports from your logging provider
Data retention: Configure appropriate log retention policies

Troubleshooting

Logs Not Appearing in External System

Verify configuration: Check that environment variables are set correctly
Check API credentials: Ensure the API key or credentials are valid
Check network connectivity: Verify the external system is reachable
Review logs locally: Run with BANGUI_LOG_LEVEL=debug and check stdout for errors
Check disk space: Ensure the local buffer directory has sufficient disk space

Performance Degradation

Check buffer size: If the buffer is full, logs are dropped; increase BANGUI_EXTERNAL_LOGGING_BUFFER_SIZE
Adjust flush interval: Decrease flush interval if experiencing large batches
Reduce log level: Set BANGUI_LOG_LEVEL=warning to reduce log volume
Monitor network: Check bandwidth usage between application and external system

Lost Logs

In the rare event that logs are lost:

Buffer overflow: The in-memory buffer has a maximum size; excess logs are dropped with a warning
Network failure during batch send: Logs are retried; after max retries, a warning is logged
External system outage: Logs may be dropped if buffer fills before service is restored

To minimize data loss:

Increase buffer size (BANGUI_EXTERNAL_LOGGING_BUFFER_SIZE)
Use persistent external logging platforms
Monitor for warnings in application logs about dropped batches

Application Performance Monitoring (Metrics)

BanGUI collects comprehensive metrics for request performance, application health, and resource utilization through Prometheus. Metrics are exposed in standard Prometheus text format and can be scraped by monitoring systems.

Backend Metrics

HTTP Request Metrics

The backend automatically tracks HTTP request performance:

bangui_http_requests_total (Counter) — Total HTTP requests by method, endpoint, and status code
```
bangui_http_requests_total{method="GET",endpoint="/api/jails",status_code="200"} 125
```

bangui_http_request_duration_seconds (Histogram) — Request latency distribution by method and endpoint

bangui_http_request_duration_seconds_bucket{method="GET",endpoint="/api/jails",le="0.1"} 120
bangui_http_request_duration_seconds_sum{method="GET",endpoint="/api/jails"} 45.23

bangui_http_active_requests (Gauge) — Current number of in-flight requests by method and endpoint
```
bangui_http_active_requests{method="GET",endpoint="/api/jails"} 5
```

Application Metrics

Domain-specific metrics track application state:

bangui_bans_total (Gauge) — Total number of currently banned IPs across all jails
bangui_jails_total (Gauge) — Total number of fail2ban jails
bangui_fail2ban_connection_errors_total (Counter) — Total fail2ban connection errors

Accessing Metrics

Prometheus metrics are exposed at the /metrics endpoint:

curl http://localhost:8000/metrics

Response format:

# HELP bangui_http_requests_total Total HTTP requests by method, endpoint, and status code
# TYPE bangui_http_requests_total counter
bangui_http_requests_total{method="GET",endpoint="/api/dashboard/status",status_code="200"} 1523.0

# HELP bangui_http_request_duration_seconds HTTP request latency in seconds by method and endpoint
# TYPE bangui_http_request_duration_seconds histogram
bangui_http_request_duration_seconds_bucket{method="GET",endpoint="/api/dashboard/status",le="0.01"} 1200.0
bangui_http_request_duration_seconds_sum{method="GET",endpoint="/api/dashboard/status"} 156.78

Frontend Metrics

Web Vitals

The frontend automatically measures Core Web Vitals using the web-vitals library:

Cumulative Layout Shift (CLS) — Visual stability score (good: ≤0.1)
First Contentful Paint (FCP) — Time until first content appears (good: ≤1.8s)
First Input Delay (FID) — Responsiveness to user input (good: ≤100ms)
Largest Contentful Paint (LCP) — Time until largest content is visible (good: ≤2.5s)
Time to First Byte (TTFB) — Server response time (good: ≤600ms)

API Call Metrics

API calls are automatically tracked with:

HTTP method and endpoint
Response status code
Duration in milliseconds
Timestamp

Integrating with Monitoring Systems

Prometheus + Grafana

Configure Prometheus to scrape BanGUI metrics:

# prometheus.yml
scrape_configs:
  - job_name: "bangui"
    static_configs:
      - targets: ["localhost:8000"]
    metrics_path: "/metrics"

Then import a Grafana dashboard to visualize:

Request rates by endpoint
Latency percentiles (p50, p95, p99)
Error rate trends
Active request counts

Datadog

Configure BanGUI to send metrics via StatsD or HTTP API:

BANGUI_METRICS_ENABLED=true
BANGUI_METRICS_PROVIDER=datadog
BANGUI_DATADOG_API_KEY=your-api-key
BANGUI_DATADOG_SITE=datadoghq.com

New Relic

Send metrics to New Relic (custom event collection):

BANGUI_METRICS_ENABLED=true
BANGUI_METRICS_PROVIDER=newrelic
BANGUI_NEWRELIC_API_KEY=your-api-key
BANGUI_NEWRELIC_ACCOUNT_ID=your-account-id

Metrics Best Practices

Cardinality Management

Metric labels (tags) can cause cardinality explosion if not carefully managed. BanGUI uses:

Path normalization — /api/jails/123 becomes /api/{id} to prevent unique labels per resource
Status code grouping — errors are grouped by category, not individual codes
Endpoint aggregation — only significant endpoints are tracked

Performance Considerations

Metrics collection has negligible performance impact (<1ms per request)
In-memory buffering prevents database writes on every request
High-cardinality labels are avoided
Metric export (scraping) does not block request processing

PII Protection

NEVER include sensitive data in metric labels:

User IDs or session tokens
Passwords or API keys
Private IP addresses
Full request/response bodies

Allowed: HTTP method, endpoint path (normalized), status code, duration, timestamp.

Query Examples

Prometheus Queries

Find p95 request latency for /api/jails:

histogram_quantile(0.95, bangui_http_request_duration_seconds_bucket{endpoint="/api/jails"})

Find error rate (5xx responses):

rate(bangui_http_requests_total{status_code=~"5.."}[5m])

Find active requests per endpoint:

bangui_http_active_requests

Grafana Dashboard

Recommended panels:

Request Rate — rate(bangui_http_requests_total[1m]) by endpoint
Latency Percentiles — histogram_quantile([0.5, 0.95, 0.99], ...)
Error Rate — rate(bangui_http_requests_total{status_code=~"5.."}[5m])
Active Requests — bangui_http_active_requests (gauge)
fail2ban Connection Health — rate(bangui_fail2ban_connection_errors_total[5m])

Troubleshooting Metrics

Metrics endpoint not responding

Verify the /metrics endpoint is accessible: curl http://localhost:8000/metrics
Check application logs for errors during middleware initialization
Ensure prometheus-client is installed: pip show prometheus-client

High cardinality warnings

If Prometheus warns about high cardinality:

Check if custom labels are being added to metrics
Ensure path normalization is working (IDs should be replaced with {id})
Consider sampling metrics for high-volume endpoints

Missing metrics

Check that endpoints are being called (look for 200 responses in logs)
Verify the metrics middleware is registered (check app.add_middleware(MetricsMiddleware))
Ensure metrics are being recorded (call recordApiCall() on frontend)

Future Enhancements

Planned observability improvements:

Application metrics collection (Prometheus)
Web Vitals tracking (frontend)
Distributed tracing (OpenTelemetry integration)
Custom metric hooks for business events
Alerting rules and thresholds
Log sampling strategies
Additional provider support (Splunk, New Relic, CloudWatch)

Scheduler Lock Health Monitoring

The scheduler lock ensures only one instance runs background tasks. Monitoring its health is critical for production reliability.

Key Metrics

Monitor these log events for scheduler lock health:

Event	Level	Meaning
`scheduler_lock_acquired`	info	Successfully acquired the scheduler lock
`scheduler_lock_held_by_other_instance`	warning	Another instance holds the lock (expected during normal multi-instance operation)
`scheduler_lock_stale_overwrite`	info	Took over a stale lock from a crashed instance
`scheduler_lock_heartbeat_lost`	warning	Heartbeat update failed; we lost the lock
`scheduler_lock_release_mismatch`	warning	Release attempted but we don't hold the lock

Lock Health Check

Query current lock status via get_lock_health():

from app.utils.scheduler_lock import get_lock_health

health = await get_lock_health(db)
# Returns: {"locked": bool, "pid": int|None, "hostname": str|None,
#           "age_seconds": float|None, "is_stale": bool, "ttl_remaining": float|None}

Alerting Rules

Critical alerts:

scheduler_lock_acquired not seen for >5 minutes during startup → Instance may not have acquired lock
scheduler_lock_heartbeat_lost repeated >3 times → Lock keeps being stolen, possible contention issue

Warning alerts:

scheduler_lock_held_by_other_instance every few minutes → Normal if multiple instances, abnormal if single instance

Database Query

Check lock state directly in SQLite:

SELECT pid, hostname, heartbeat_at, heartbeat_timeout,
       (datetime('now') - datetime(heartbeat_at, 'unixepoch')) as age
FROM scheduler_lock WHERE id = 1;

Common Issues

Lock not acquired on startup: Check logs for scheduler_lock_held_by_other_instance. If another instance holds it, verify if that instance is healthy.
Background tasks not running: Use get_lock_health() to verify the lock is held. If not held, the instance cannot run scheduled tasks.
Frequent lock steals: If scheduler_lock_stale_overwrite occurs frequently, the heartbeat interval may be too long or network latency is causing false staleness detection.

26 KiB Raw Blame History

Observability

Logging Architecture

Overview

Log Output

Sensitive Data Handling

Log Sanitization

Third-Party Library Logs

Disabling Suppression

Uniform JSON Formatting

Adding New Overrides

Structured Logging Best Practices

Log Levels

Context Variables

Background Task Correlation

Event Naming Convention

Attaching Structured Data

Centralized Logging Configuration

Environment Variables

Datadog

Papertrail

ELK Stack

Local Development (Disabled by Default)

Performance and Reliability

Non-Blocking Delivery

Failure Modes

Log Volume and Rate Limiting

Integration Examples

Docker Compose (Development with Datadog)

Kubernetes Deployment (Papertrail)

Monitoring Logging Infrastructure

Datadog Dashboard Query

Papertrail Search

Elasticsearch Query (ELK)

Testing and Debugging

Verify JSON Output

Enable Debug Logging for External Log Delivery

Validate Configuration

Security Considerations

API Key Rotation

Network Security

Compliance

Troubleshooting

Logs Not Appearing in External System

Performance Degradation

Lost Logs

Application Performance Monitoring (Metrics)

Backend Metrics

HTTP Request Metrics

Application Metrics

Accessing Metrics

Frontend Metrics

Web Vitals

API Call Metrics

Integrating with Monitoring Systems

Prometheus + Grafana

Datadog

New Relic

Metrics Best Practices

Cardinality Management

Performance Considerations

PII Protection

Query Examples

Prometheus Queries

Grafana Dashboard

Troubleshooting Metrics

Metrics endpoint not responding

High cardinality warnings

Missing metrics

Future Enhancements

Scheduler Lock Health Monitoring

Key Metrics

Lock Health Check

Alerting Rules

Database Query

Common Issues

References

26 KiB

Raw Blame History