feat: implement API versioning /api/v1/

- All backend routers moved to /api/v1/ prefix - Frontend BASE_URL updated to /api/v1 - Setup redirect middleware updated to redirect to /api/v1/setup - Health router path fixed: prefix=/api/v1/health, @router.get('') - conftest.py: set server_status=online for test fixture - Created Docs/API_VERSIONING.md with deprecation policy - Updated Docs/Backend-Development.md with versioning section - Updated Instructions.md curl examples Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-05-02 21:29:30 +02:00
parent 0d5882b32f
commit cc6dbcf3f0
51 changed files with 1886 additions and 671 deletions
--- a/Docs/API_VERSIONING.md
+++ b/Docs/API_VERSIONING.md
@@ -0,0 +1,125 @@
+# API Versioning Strategy
+
+**Status:** Active — Current version: **v1**
+
+All BanGUI API endpoints are versioned using URI path versioning (e.g., `/api/v1/`).
+This document explains when and how to version endpoints, how deprecation works, and what guarantees consumers can rely on.
+
+---
+
+## 1. Version Lifecycle
+
+| Stage | Meaning |
+|-------|---------|
+| **Current** | Active, receiving new features and bug fixes. |
+| **Deprecated** | Still functional but marked for removal. Clients receive `Deprecation: true` and `Sunset: <date>` response headers. |
+| **Removed** | Endpoint no longer exists. Clients must migrate to a newer version. |
+
+---
+
+## 2. URL Structure
+
+```
+/api/v{major}/<resource>/<path>
+```
+
+- **v1** — current version (2026-05-02)
+- **v2** — reserved for future breaking changes
+- **PATCH** versions (v1.1, v1.2) are **not** used; only **major** version bumps indicate breaking changes
+- The OpenAPI schema is always available at `/api/openapi.json` regardless of version
+
+---
+
+## 3. What Triggers a Version Bump
+
+A new major version is required when a **breaking change** must be introduced, including:
+
+- Removing or renaming a field in a response model
+- Changing the type of a request or response field
+- Removing an endpoint entirely
+- Changing authentication/authorization semantics
+- Modifying the semantics of an existing operation
+
+**Non-breaking changes** (backward-compatible):
+
+- Adding new optional request fields
+- Adding new response fields
+- Adding new endpoints
+- Fixing bugs that caused incorrect behavior
+
+These do **not** require a version bump.
+
+---
+
+## 4. Deprecation Policy
+
+When an endpoint is deprecated:
+
+1. The endpoint **remains functional** for a minimum of **6 months** from the `Sunset` date
+2. Response headers are added:
+   ```
+   Deprecation: true
+   Sunset: <RFC-5322 date>
+   Link: <https://bangui.example.com/api/v2/...>; rel="successor-version"
+   ```
+3. The OpenAPI schema marks the endpoint with `deprecated: true`
+4. Documentation is updated to show the endpoint as deprecated
+
+---
+
+## 5. Backend Development: Adding Versioned Endpoints
+
+### New endpoints
+
+All new endpoints are added to the **current** version (`/api/v1/`). Prefix your router:
+
+```python
+router = APIRouter(prefix="/api/v1/my-resource", tags=["My Resource"])
+```
+
+### Breaking changes requiring v2
+
+1. Create a new router file (e.g., `routers/my_resource_v2.py`) with the v2 prefix:
+   ```python
+   router = APIRouter(prefix="/api/v2/my-resource", tags=["My Resource"])
+   ```
+2. Copy or adapt the v1 handler logic as needed
+3. Register the new router in `app/main.py`:
+   ```python
+   app.include_router(my_resource_v2.router)
+   ```
+4. Add deprecation headers to the **old** v1 router by marking it deprecated in the OpenAPI spec
+5. Update this document to reflect the new version lifecycle
+
+### Keeping routers DRY
+
+If v1 and v2 share logic, extract business logic into a **service layer function** and call it from both router handlers. Routers should only contain HTTP concerns (parameters, responses, status codes).
+
+---
+
+## 6. Frontend Development
+
+The frontend always uses the current version's base URL:
+
+```typescript
+const BASE_URL: string = import.meta.env.VITE_API_URL ?? "/api/v1";
+```
+
+All endpoint paths in `frontend/src/api/endpoints.ts` are defined as relative paths (e.g., `/bans`, `/jails`) and are appended to `BASE_URL` at runtime.
+
+---
+
+## 7. OpenAPI / Documentation
+
+- Swagger UI: `/api/docs`
+- ReDoc: `/api/redoc`
+- OpenAPI schema: `/api/openapi.json`
+- Docs are **not** versioned; they always reflect the **current** (latest) API version
+
+---
+
+## 8. Version History
+
+| Version | Status | Released | Sunset Date | Notes |
+|---------|--------|---------|-------------|-------|
+| v1 | **Current** | 2026-05-02 | — | Initial versioning; all endpoints moved from `/api/` to `/api/v1/` |
--- a/Docs/Backend-Development.md
+++ b/Docs/Backend-Development.md
@@ -260,6 +260,50 @@ For `history_archive`, the read-heavy workload justifies these indexes because:

 ---

+## 7.6 Never Load Unbounded Result Sets
+
+**Problem:** Loading large result sets entirely into Python memory causes:
+- Memory spikes that crash containers
+- Slow dashboard performance
+- Unbounded database file growth
+
+**Rule:** Never load unbounded result sets. Always use SQL aggregation or pagination.
+
+**Anti-patterns:**
+
+```python
+# BAD — loads all rows into memory
+all_rows = await history_archive_repo.get_all_archived_history(db=db, ...)
+
+# GOOD — SQL aggregation returns lightweight counts
+ip_counts = await history_archive_repo.get_ip_ban_counts(db=db, ...)
+```
+
+**SQL aggregation patterns for common operations:**
+
+| Operation | SQL Pattern | Repository Function |
+|-----------|-------------|---------------------|
+| Count by IP | `SELECT ip, COUNT(*) FROM bans GROUP BY ip` | `get_ip_ban_counts()` |
+| Count by jail | `SELECT jail, COUNT(*) FROM bans GROUP BY jail` | `get_jail_ban_counts()` |
+| Count by time bucket | `SELECT CAST((timeofban - ?) / ? AS INTEGER), COUNT(*) ... GROUP BY bucket_idx` | `get_ban_counts_by_bucket()` |
+| Paginated rows | `WHERE id < ? ORDER BY id DESC LIMIT ?` | `get_archived_history_keyset()` |
+
+**When to use SQL aggregation:**
+- Computing totals, counts, or aggregations for display
+- Building country/jail/geo maps from large datasets
+- Any endpoint that needs only a summary, not full row data
+
+**When to use pagination:**
+- Endpoints that return individual records for display (ban lists, history)
+- Any endpoint where clients need access to specific rows
+
+**Memory budgets for reference:**
+- 1M ban records ≈ 200-400 MB if fully materialized as Python dicts
+- SQL aggregation returns lightweight results: {ip, count} pairs = a few KB for same 1M records
+- Keyset pagination returns only the page size (typically 50-200 rows)
+
+---
+
 ## 3. Project Structure

 ```
@@ -1840,12 +1884,14 @@ async def client() -> AsyncClient:

@pytest.mark.asyncio
 async def test_list_jails_returns_200(client: AsyncClient) -> None:
-    response = await client.get("/api/jails/")
+    response = await client.get("/api/v1/jails/")
    assert response.status_code == 200
    data: dict = response.json()
    assert "jails" in data
 ```

+See [API_VERSIONING.md](API_VERSIONING.md) for the full versioning strategy, deprecation policy, and instructions for adding versioned endpoints.
+
 ---

 ## 9.1 Background Tasks and Scheduler Architecture
--- a/Docs/Instructions.md
+++ b/Docs/Instructions.md
@@ -230,11 +230,11 @@ The session cookie is named `bangui_session`.
 ```bash
 # Dev master password: Hallo123!
 HASHED=$(echo -n "Hallo123!" | sha256sum | awk '{print $1}')
-TOKEN=$(curl -s -X POST http://127.0.0.1:8000/api/auth/login \
+TOKEN=$(curl -s -X POST http://127.0.0.1:8000/api/v1/auth/login \
  -H 'Content-Type: application/json' \
  -d "{\"password\":\"$HASHED\"}" \
  | python3 -c 'import sys,json; print(json.load(sys.stdin)["token"])')

 # Use token in subsequent requests:
-curl -H "Cookie: bangui_session=$TOKEN" http://127.0.0.1:8000/api/dashboard/status
+curl -H "Cookie: bangui_session=$TOKEN" http://127.0.0.1:8000/api/v1/dashboard/status
 ```
--- a/Docs/PERFORMANCE.md
+++ b/Docs/PERFORMANCE.md
@@ -0,0 +1,98 @@
+# Performance Guidelines
+
+Query optimization patterns for BanGUI backend services.
+
+---
+
+## Never Load Unbounded Result Sets
+
+Loading large result sets into Python memory causes OOM crashes, slow responses, and unbounded growth. Every query that processes large datasets must use one of the following strategies.
+
+### The Problem
+
+With millions of ban records:
+- Loading all rows as Python dicts → 200-400 MB+ memory spike
+- Python loop aggregation (O(n) per item) → seconds of CPU time
+- Offset pagination on large tables → O(n) scan before returning results
+
+### The Solution: SQL Aggregation
+
+SQL GROUP BY executes inside SQLite's optimized query planner, using indexes where available, and returns only the aggregated result (typically a few KB).
+
+```python
+# BAD: loads 1M rows into Python
+all_rows = await get_all_archived_history(db, since=since)
+agg = {}
+for row in all_rows:  # O(n) Python loop
+    agg[row["ip"]] = agg.get(row["ip"], 0) + 1
+
+# GOOD: SQL aggregation, returns lightweight {ip, count} pairs
+ip_counts = await get_ip_ban_counts(db, since=since)
+# [{ip: "1.2.3.4", event_count: 42}, ...] — a few KB regardless of table size
+```
+
+### Aggregation Reference
+
+| Use Case | SQL Pattern | Repository Function |
+|----------|-------------|-------------------|
+| Ban count per IP | `SELECT ip, COUNT(*) FROM history_archive ... GROUP BY ip` | `get_ip_ban_counts()` |
+| Ban count per jail | `SELECT jail, COUNT(*) FROM history_archive ... GROUP BY jail ORDER BY COUNT(*) DESC` | `get_jail_ban_counts()` |
+| Ban count per time bucket | `SELECT CAST((timeofban - ?) / ? AS INTEGER), COUNT(*) ... GROUP BY bucket_idx` | `get_ban_counts_by_bucket()` |
+| Paginated rows (no offset) | `WHERE id < ? ORDER BY id DESC LIMIT ?` | `get_archived_history_keyset()` |
+| Total count | `SELECT COUNT(*) FROM ...` (fast with where clause) | included in `get_jail_ban_counts()` return |
+
+### Pagination vs Aggregation
+
+Use **aggregation** when:
+- Displaying summary data (counts, totals, group-by results)
+- Building country/jail/timeline dashboards
+- Only need counts, not individual row data
+
+Use **pagination** when:
+- Displaying individual records (ban list, history)
+- Clients need access to specific rows
+- Exporting or bulk operations
+
+### Batch Geo Lookups
+
+When you need geo data for many IPs, batch in a single call rather than per-IP:
+
+```python
+# BAD: N sequential API calls
+for ip in unique_ips:
+    geo = await geo_service.lookup(ip)  # 45 req/min rate limit × N calls
+
+# GOOD: one batch call, geo_service handles rate limiting
+geo_map, uncached = geo_cache_lookup(unique_ips)  # uses in-memory cache
+if uncached:
+    asyncio.create_task(geo_cache.lookup_batch(uncached, http_session))  # fire-and-forget
+```
+
+### Index Requirements
+
+SQLite needs indexes on:
+- Columns used in WHERE clauses (timeofban, jail, action)
+- Columns used in GROUP BY (ip, jail, bucket index)
+- Sort columns for pagination (id)
+
+Current indexes on `history_archive`:
+- `idx_history_archive_timeofban` — for time-range filtering
+- `idx_history_archive_jail_timeofban` — for jail + time filtering
+- `idx_history_archive_action_timeofban` — for action + time filtering
+- `idx_history_archive_id` — for keyset pagination
+
+Before adding a new query pattern, verify it uses an existing index or add one with a benchmark test.
+
+### Memory Monitoring
+
+Watch for these warning signs:
+- Python RSS > 500 MB in container metrics
+- Response time > 5s for dashboard endpoints
+- Query time > 1s in SQLite EXPLAIN ANALYZE output
+
+Use `EXPLAIN QUERY PLAN` to verify index usage:
+```sql
+EXPLAIN QUERY PLAN SELECT ip, COUNT(*) FROM history_archive WHERE timeofban >= ? GROUP BY ip;
+```
+
+Expected: `USING INDEX idx_history_archive_timeofban` in the output.
--- a/Docs/TROUBLESHOOTING.md
+++ b/Docs/TROUBLESHOOTING.md
@@ -86,6 +86,54 @@ ps aux | grep <pid>

 ---

+## Rate Limiting
+
+### Getting 429 Too Many Requests
+
+**Symptom:** API returns HTTP 429 with `rate_limit_exceeded` error code.
+
+**Cause:** You have exceeded the per-IP rate limit for a specific operation.
+
+**Diagnosis:**
+1. Check the `Retry-After` header in the response — this tells you how many seconds to wait
+2. Look for the log event `*_rate_limit_exceeded` which shows the bucket and client IP
+
+**Rate limit buckets:**
+| Bucket | Limit | Window | Operations |
+|--------|-------|--------|------------|
+| `bans:ban` | 100 | 1 minute | Ban IP addresses |
+| `bans:unban` | 100 | 1 minute | Unban IP addresses |
+| `blocklist:import` | 10 | 1 hour | Import blocklists |
+| `config:update` | 50 | 1 minute | Update configuration |
+| `jail:update` | 100 | 1 minute | Update jail config |
+| `jail:create` | 100 | 1 minute | Add log paths, assign filters/actions |
+| `jail:delete` | 100 | 1 minute | Remove log paths, actions |
+| `jail:activate` | 100 | 1 minute | Activate jails |
+| `jail:deactivate` | 100 | 1 minute | Deactivate jails |
+| `filter:update` | 50 | 1 minute | Update filters |
+| `filter:create` | 50 | 1 minute | Create filters |
+| `filter:delete` | 50 | 1 minute | Delete filters |
+| `action:update` | 50 | 1 minute | Update actions |
+| `action:create` | 50 | 1 minute | Create actions |
+| `action:delete` | 50 | 1 minute | Delete actions |
+
+**Solution:**
+1. Wait for the `Retry-After` period before retrying
+2. If you hit the limit during legitimate bulk operations, consider batching requests
+3. For blocklist imports (10/hour), ensure automated imports are not more frequent
+
+**Prevention:**
+- Monitor `*_rate_limit_exceeded` log events
+- Adjust limits via environment variables if needed (see `Docs/CONFIGURATION.md`)
+- For bulk operations, implement client-side throttling
+
+**Note:** If rate limiting triggers unexpectedly for legitimate use, check for:
+- Internal monitoring scripts hitting endpoints too frequently
+- Multiple users behind the same proxy IP
+- Stale rate limit state after process restart (uses in-memory tracking)
+
+---
+
 ## General Recovery Commands

 Clear all locks:
--- a/Docs/Tasks.md
+++ b/Docs/Tasks.md
@@ -1,97 +1,3 @@
-## HIGH PRIORITY ISSUES
-
---
-
-### Issue #3: HIGH - Unbounded Query Results Causing OOM (Out of Memory)
-
-**Where found**: 
- `backend/app/repositories/history_archive_repo.py` - `get_all_archived_history()` 
- `backend/app/services/ban_service.py` (lines 589-600) - `bans_by_country()` loads all unique IPs into memory
- `backend/app/services/ban_service.py` (lines 650-680) - N+1 geo lookup pattern
-
-**Why this is needed**: 
-With large deployments having millions of ban records, queries that load entire tables into memory cause:
- Memory spikes that crash the container
- Slow dashboard performance
- Database file growth without bounds
-
-**Goal**: 
-Implement pagination, streaming, and batch processing for all large queries to ensure bounded memory usage and consistent performance.
-
-**What to do**:
-1. Refactor `get_all_archived_history()` to only be called with pagination parameters
-2. Refactor `bans_by_country()` to:
-   - Process countries in batches
-   - Stream results instead of collecting all in memory
-   - Implement server-side aggregation in SQL instead of Python loops
-3. Add `LIMIT` + `OFFSET` or cursor-based pagination to all list endpoints
-4. Implement batch geo lookups instead of per-IP loops
-5. Add tests with large datasets (1M+ records) to catch performance regressions
-
-**Possible traps and issues**:
- Changing query patterns might break sorting/filtering logic
- Pagination cursor format must be consistent across endpoints
- Memory usage must be monitored in production
- Aggregation queries might need new database indexes
- Frontend pagination UI assumes cursor format - changes will break old clients
-
-**Docs changes needed**:
- Add performance guidelines to `Docs/Backend-Development.md` - "Never load unbounded result sets"
- Create `Docs/PERFORMANCE.md` with query optimization patterns
- Document pagination standards in API docs
-
-**Doc references**:
- DETAILED_FINDINGS.md - Issues #2, #3, #4 (Unbounded queries, N+1, Large structures)
- DATABASE_API_DEPLOYMENT_ISSUES.md - Section "Database Design Issues"
-
---
-
-### Issue #4: HIGH - Missing Rate Limiting on Write Operations
-
-**Where found**: 
- `backend/app/middleware/rate_limit.py` - Only applied to login endpoint
- `backend/app/routers/bans.py` - POST /api/bans/ban, POST /api/bans/unban (NO rate limit)
- `backend/app/routers/blocklist.py` - POST /api/blocklists/:id/import (NO rate limit)
- `backend/app/routers/config.py` - PUT endpoints (NO rate limit)
-
-**Why this is needed**: 
-Without rate limiting on state-mutating endpoints, an attacker can:
- Spam ban requests to exhaust fail2ban resources
- Trigger repeated blocklist imports consuming bandwidth/CPU
- Cause DoS by hammering config updates
-
-**Goal**: 
-Extend rate limiting to all write operations (POST, PUT, DELETE) with appropriate rate limits per operation type.
-
-**What to do**:
-1. Create rate limit buckets for different operations:
-   - `bans:ban` - 100/minute per IP
-   - `bans:unban` - 100/minute per IP
-   - `blocklist:import` - 10/hour per IP
-   - `config:update` - 50/minute per IP
-2. Apply rate limiting middleware to all write endpoints
-3. Return 429 with `Retry-After` header when limit exceeded
-4. Add metrics/monitoring for rate limit hits
-5. Make rate limits configurable via environment variables
-
-**Possible traps and issues**:
- Rate limiting at IP level doesn't work behind proxies (need proper X-Forwarded-For handling)
- Different operations need different rate limits (can't use global limit)
- Legitimate bulk operations might hit limits unexpectedly
- Rate limit state must be persistent across process restarts (use database or Redis)
- False positives from internal monitoring scripts hammering endpoints
-
-**Docs changes needed**:
- Add rate limit table to API documentation
- Document in `Docs/CONFIGURATION.md` how to adjust rate limits
- Add to `Docs/TROUBLESHOOTING.md` - "Getting 429 Too Many Requests"
-
-**Doc references**:
- DETAILED_FINDINGS.md - Issue #5 "Missing Rate Limiting"
- `backend/app/middleware/rate_limit.py` - Current implementation
-
---
-
 ### Issue #5: HIGH - API Has No Versioning Strategy

 **Where found**: