feat: implement API versioning /api/v1/

- All backend routers moved to /api/v1/ prefix
- Frontend BASE_URL updated to /api/v1
- Setup redirect middleware updated to redirect to /api/v1/setup
- Health router path fixed: prefix=/api/v1/health, @router.get('')
- conftest.py: set server_status=online for test fixture
- Created Docs/API_VERSIONING.md with deprecation policy
- Updated Docs/Backend-Development.md with versioning section
- Updated Instructions.md curl examples

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
This commit is contained in:
2026-05-02 21:29:30 +02:00
parent 0d5882b32f
commit cc6dbcf3f0
51 changed files with 1886 additions and 671 deletions

125
Docs/API_VERSIONING.md Normal file
View File

@@ -0,0 +1,125 @@
# API Versioning Strategy
**Status:** Active — Current version: **v1**
All BanGUI API endpoints are versioned using URI path versioning (e.g., `/api/v1/`).
This document explains when and how to version endpoints, how deprecation works, and what guarantees consumers can rely on.
---
## 1. Version Lifecycle
| Stage | Meaning |
|-------|---------|
| **Current** | Active, receiving new features and bug fixes. |
| **Deprecated** | Still functional but marked for removal. Clients receive `Deprecation: true` and `Sunset: <date>` response headers. |
| **Removed** | Endpoint no longer exists. Clients must migrate to a newer version. |
---
## 2. URL Structure
```
/api/v{major}/<resource>/<path>
```
- **v1** — current version (2026-05-02)
- **v2** — reserved for future breaking changes
- **PATCH** versions (v1.1, v1.2) are **not** used; only **major** version bumps indicate breaking changes
- The OpenAPI schema is always available at `/api/openapi.json` regardless of version
---
## 3. What Triggers a Version Bump
A new major version is required when a **breaking change** must be introduced, including:
- Removing or renaming a field in a response model
- Changing the type of a request or response field
- Removing an endpoint entirely
- Changing authentication/authorization semantics
- Modifying the semantics of an existing operation
**Non-breaking changes** (backward-compatible):
- Adding new optional request fields
- Adding new response fields
- Adding new endpoints
- Fixing bugs that caused incorrect behavior
These do **not** require a version bump.
---
## 4. Deprecation Policy
When an endpoint is deprecated:
1. The endpoint **remains functional** for a minimum of **6 months** from the `Sunset` date
2. Response headers are added:
```
Deprecation: true
Sunset: <RFC-5322 date>
Link: <https://bangui.example.com/api/v2/...>; rel="successor-version"
```
3. The OpenAPI schema marks the endpoint with `deprecated: true`
4. Documentation is updated to show the endpoint as deprecated
---
## 5. Backend Development: Adding Versioned Endpoints
### New endpoints
All new endpoints are added to the **current** version (`/api/v1/`). Prefix your router:
```python
router = APIRouter(prefix="/api/v1/my-resource", tags=["My Resource"])
```
### Breaking changes requiring v2
1. Create a new router file (e.g., `routers/my_resource_v2.py`) with the v2 prefix:
```python
router = APIRouter(prefix="/api/v2/my-resource", tags=["My Resource"])
```
2. Copy or adapt the v1 handler logic as needed
3. Register the new router in `app/main.py`:
```python
app.include_router(my_resource_v2.router)
```
4. Add deprecation headers to the **old** v1 router by marking it deprecated in the OpenAPI spec
5. Update this document to reflect the new version lifecycle
### Keeping routers DRY
If v1 and v2 share logic, extract business logic into a **service layer function** and call it from both router handlers. Routers should only contain HTTP concerns (parameters, responses, status codes).
---
## 6. Frontend Development
The frontend always uses the current version's base URL:
```typescript
const BASE_URL: string = import.meta.env.VITE_API_URL ?? "/api/v1";
```
All endpoint paths in `frontend/src/api/endpoints.ts` are defined as relative paths (e.g., `/bans`, `/jails`) and are appended to `BASE_URL` at runtime.
---
## 7. OpenAPI / Documentation
- Swagger UI: `/api/docs`
- ReDoc: `/api/redoc`
- OpenAPI schema: `/api/openapi.json`
- Docs are **not** versioned; they always reflect the **current** (latest) API version
---
## 8. Version History
| Version | Status | Released | Sunset Date | Notes |
|---------|--------|---------|-------------|-------|
| v1 | **Current** | 2026-05-02 | — | Initial versioning; all endpoints moved from `/api/` to `/api/v1/` |

View File

@@ -260,6 +260,50 @@ For `history_archive`, the read-heavy workload justifies these indexes because:
---
## 7.6 Never Load Unbounded Result Sets
**Problem:** Loading large result sets entirely into Python memory causes:
- Memory spikes that crash containers
- Slow dashboard performance
- Unbounded database file growth
**Rule:** Never load unbounded result sets. Always use SQL aggregation or pagination.
**Anti-patterns:**
```python
# BAD — loads all rows into memory
all_rows = await history_archive_repo.get_all_archived_history(db=db, ...)
# GOOD — SQL aggregation returns lightweight counts
ip_counts = await history_archive_repo.get_ip_ban_counts(db=db, ...)
```
**SQL aggregation patterns for common operations:**
| Operation | SQL Pattern | Repository Function |
|-----------|-------------|---------------------|
| Count by IP | `SELECT ip, COUNT(*) FROM bans GROUP BY ip` | `get_ip_ban_counts()` |
| Count by jail | `SELECT jail, COUNT(*) FROM bans GROUP BY jail` | `get_jail_ban_counts()` |
| Count by time bucket | `SELECT CAST((timeofban - ?) / ? AS INTEGER), COUNT(*) ... GROUP BY bucket_idx` | `get_ban_counts_by_bucket()` |
| Paginated rows | `WHERE id < ? ORDER BY id DESC LIMIT ?` | `get_archived_history_keyset()` |
**When to use SQL aggregation:**
- Computing totals, counts, or aggregations for display
- Building country/jail/geo maps from large datasets
- Any endpoint that needs only a summary, not full row data
**When to use pagination:**
- Endpoints that return individual records for display (ban lists, history)
- Any endpoint where clients need access to specific rows
**Memory budgets for reference:**
- 1M ban records ≈ 200-400 MB if fully materialized as Python dicts
- SQL aggregation returns lightweight results: {ip, count} pairs = a few KB for same 1M records
- Keyset pagination returns only the page size (typically 50-200 rows)
---
## 3. Project Structure
```
@@ -1840,12 +1884,14 @@ async def client() -> AsyncClient:
@pytest.mark.asyncio
async def test_list_jails_returns_200(client: AsyncClient) -> None:
response = await client.get("/api/jails/")
response = await client.get("/api/v1/jails/")
assert response.status_code == 200
data: dict = response.json()
assert "jails" in data
```
See [API_VERSIONING.md](API_VERSIONING.md) for the full versioning strategy, deprecation policy, and instructions for adding versioned endpoints.
---
## 9.1 Background Tasks and Scheduler Architecture

View File

@@ -230,11 +230,11 @@ The session cookie is named `bangui_session`.
```bash
# Dev master password: Hallo123!
HASHED=$(echo -n "Hallo123!" | sha256sum | awk '{print $1}')
TOKEN=$(curl -s -X POST http://127.0.0.1:8000/api/auth/login \
TOKEN=$(curl -s -X POST http://127.0.0.1:8000/api/v1/auth/login \
-H 'Content-Type: application/json' \
-d "{\"password\":\"$HASHED\"}" \
| python3 -c 'import sys,json; print(json.load(sys.stdin)["token"])')
# Use token in subsequent requests:
curl -H "Cookie: bangui_session=$TOKEN" http://127.0.0.1:8000/api/dashboard/status
curl -H "Cookie: bangui_session=$TOKEN" http://127.0.0.1:8000/api/v1/dashboard/status
```

98
Docs/PERFORMANCE.md Normal file
View File

@@ -0,0 +1,98 @@
# Performance Guidelines
Query optimization patterns for BanGUI backend services.
---
## Never Load Unbounded Result Sets
Loading large result sets into Python memory causes OOM crashes, slow responses, and unbounded growth. Every query that processes large datasets must use one of the following strategies.
### The Problem
With millions of ban records:
- Loading all rows as Python dicts → 200-400 MB+ memory spike
- Python loop aggregation (O(n) per item) → seconds of CPU time
- Offset pagination on large tables → O(n) scan before returning results
### The Solution: SQL Aggregation
SQL GROUP BY executes inside SQLite's optimized query planner, using indexes where available, and returns only the aggregated result (typically a few KB).
```python
# BAD: loads 1M rows into Python
all_rows = await get_all_archived_history(db, since=since)
agg = {}
for row in all_rows: # O(n) Python loop
agg[row["ip"]] = agg.get(row["ip"], 0) + 1
# GOOD: SQL aggregation, returns lightweight {ip, count} pairs
ip_counts = await get_ip_ban_counts(db, since=since)
# [{ip: "1.2.3.4", event_count: 42}, ...] — a few KB regardless of table size
```
### Aggregation Reference
| Use Case | SQL Pattern | Repository Function |
|----------|-------------|-------------------|
| Ban count per IP | `SELECT ip, COUNT(*) FROM history_archive ... GROUP BY ip` | `get_ip_ban_counts()` |
| Ban count per jail | `SELECT jail, COUNT(*) FROM history_archive ... GROUP BY jail ORDER BY COUNT(*) DESC` | `get_jail_ban_counts()` |
| Ban count per time bucket | `SELECT CAST((timeofban - ?) / ? AS INTEGER), COUNT(*) ... GROUP BY bucket_idx` | `get_ban_counts_by_bucket()` |
| Paginated rows (no offset) | `WHERE id < ? ORDER BY id DESC LIMIT ?` | `get_archived_history_keyset()` |
| Total count | `SELECT COUNT(*) FROM ...` (fast with where clause) | included in `get_jail_ban_counts()` return |
### Pagination vs Aggregation
Use **aggregation** when:
- Displaying summary data (counts, totals, group-by results)
- Building country/jail/timeline dashboards
- Only need counts, not individual row data
Use **pagination** when:
- Displaying individual records (ban list, history)
- Clients need access to specific rows
- Exporting or bulk operations
### Batch Geo Lookups
When you need geo data for many IPs, batch in a single call rather than per-IP:
```python
# BAD: N sequential API calls
for ip in unique_ips:
geo = await geo_service.lookup(ip) # 45 req/min rate limit × N calls
# GOOD: one batch call, geo_service handles rate limiting
geo_map, uncached = geo_cache_lookup(unique_ips) # uses in-memory cache
if uncached:
asyncio.create_task(geo_cache.lookup_batch(uncached, http_session)) # fire-and-forget
```
### Index Requirements
SQLite needs indexes on:
- Columns used in WHERE clauses (timeofban, jail, action)
- Columns used in GROUP BY (ip, jail, bucket index)
- Sort columns for pagination (id)
Current indexes on `history_archive`:
- `idx_history_archive_timeofban` — for time-range filtering
- `idx_history_archive_jail_timeofban` — for jail + time filtering
- `idx_history_archive_action_timeofban` — for action + time filtering
- `idx_history_archive_id` — for keyset pagination
Before adding a new query pattern, verify it uses an existing index or add one with a benchmark test.
### Memory Monitoring
Watch for these warning signs:
- Python RSS > 500 MB in container metrics
- Response time > 5s for dashboard endpoints
- Query time > 1s in SQLite EXPLAIN ANALYZE output
Use `EXPLAIN QUERY PLAN` to verify index usage:
```sql
EXPLAIN QUERY PLAN SELECT ip, COUNT(*) FROM history_archive WHERE timeofban >= ? GROUP BY ip;
```
Expected: `USING INDEX idx_history_archive_timeofban` in the output.

View File

@@ -86,6 +86,54 @@ ps aux | grep <pid>
---
## Rate Limiting
### Getting 429 Too Many Requests
**Symptom:** API returns HTTP 429 with `rate_limit_exceeded` error code.
**Cause:** You have exceeded the per-IP rate limit for a specific operation.
**Diagnosis:**
1. Check the `Retry-After` header in the response — this tells you how many seconds to wait
2. Look for the log event `*_rate_limit_exceeded` which shows the bucket and client IP
**Rate limit buckets:**
| Bucket | Limit | Window | Operations |
|--------|-------|--------|------------|
| `bans:ban` | 100 | 1 minute | Ban IP addresses |
| `bans:unban` | 100 | 1 minute | Unban IP addresses |
| `blocklist:import` | 10 | 1 hour | Import blocklists |
| `config:update` | 50 | 1 minute | Update configuration |
| `jail:update` | 100 | 1 minute | Update jail config |
| `jail:create` | 100 | 1 minute | Add log paths, assign filters/actions |
| `jail:delete` | 100 | 1 minute | Remove log paths, actions |
| `jail:activate` | 100 | 1 minute | Activate jails |
| `jail:deactivate` | 100 | 1 minute | Deactivate jails |
| `filter:update` | 50 | 1 minute | Update filters |
| `filter:create` | 50 | 1 minute | Create filters |
| `filter:delete` | 50 | 1 minute | Delete filters |
| `action:update` | 50 | 1 minute | Update actions |
| `action:create` | 50 | 1 minute | Create actions |
| `action:delete` | 50 | 1 minute | Delete actions |
**Solution:**
1. Wait for the `Retry-After` period before retrying
2. If you hit the limit during legitimate bulk operations, consider batching requests
3. For blocklist imports (10/hour), ensure automated imports are not more frequent
**Prevention:**
- Monitor `*_rate_limit_exceeded` log events
- Adjust limits via environment variables if needed (see `Docs/CONFIGURATION.md`)
- For bulk operations, implement client-side throttling
**Note:** If rate limiting triggers unexpectedly for legitimate use, check for:
- Internal monitoring scripts hitting endpoints too frequently
- Multiple users behind the same proxy IP
- Stale rate limit state after process restart (uses in-memory tracking)
---
## General Recovery Commands
Clear all locks:

View File

@@ -1,97 +1,3 @@
## HIGH PRIORITY ISSUES
---
### Issue #3: HIGH - Unbounded Query Results Causing OOM (Out of Memory)
**Where found**:
- `backend/app/repositories/history_archive_repo.py` - `get_all_archived_history()`
- `backend/app/services/ban_service.py` (lines 589-600) - `bans_by_country()` loads all unique IPs into memory
- `backend/app/services/ban_service.py` (lines 650-680) - N+1 geo lookup pattern
**Why this is needed**:
With large deployments having millions of ban records, queries that load entire tables into memory cause:
- Memory spikes that crash the container
- Slow dashboard performance
- Database file growth without bounds
**Goal**:
Implement pagination, streaming, and batch processing for all large queries to ensure bounded memory usage and consistent performance.
**What to do**:
1. Refactor `get_all_archived_history()` to only be called with pagination parameters
2. Refactor `bans_by_country()` to:
- Process countries in batches
- Stream results instead of collecting all in memory
- Implement server-side aggregation in SQL instead of Python loops
3. Add `LIMIT` + `OFFSET` or cursor-based pagination to all list endpoints
4. Implement batch geo lookups instead of per-IP loops
5. Add tests with large datasets (1M+ records) to catch performance regressions
**Possible traps and issues**:
- Changing query patterns might break sorting/filtering logic
- Pagination cursor format must be consistent across endpoints
- Memory usage must be monitored in production
- Aggregation queries might need new database indexes
- Frontend pagination UI assumes cursor format - changes will break old clients
**Docs changes needed**:
- Add performance guidelines to `Docs/Backend-Development.md` - "Never load unbounded result sets"
- Create `Docs/PERFORMANCE.md` with query optimization patterns
- Document pagination standards in API docs
**Doc references**:
- DETAILED_FINDINGS.md - Issues #2, #3, #4 (Unbounded queries, N+1, Large structures)
- DATABASE_API_DEPLOYMENT_ISSUES.md - Section "Database Design Issues"
---
### Issue #4: HIGH - Missing Rate Limiting on Write Operations
**Where found**:
- `backend/app/middleware/rate_limit.py` - Only applied to login endpoint
- `backend/app/routers/bans.py` - POST /api/bans/ban, POST /api/bans/unban (NO rate limit)
- `backend/app/routers/blocklist.py` - POST /api/blocklists/:id/import (NO rate limit)
- `backend/app/routers/config.py` - PUT endpoints (NO rate limit)
**Why this is needed**:
Without rate limiting on state-mutating endpoints, an attacker can:
- Spam ban requests to exhaust fail2ban resources
- Trigger repeated blocklist imports consuming bandwidth/CPU
- Cause DoS by hammering config updates
**Goal**:
Extend rate limiting to all write operations (POST, PUT, DELETE) with appropriate rate limits per operation type.
**What to do**:
1. Create rate limit buckets for different operations:
- `bans:ban` - 100/minute per IP
- `bans:unban` - 100/minute per IP
- `blocklist:import` - 10/hour per IP
- `config:update` - 50/minute per IP
2. Apply rate limiting middleware to all write endpoints
3. Return 429 with `Retry-After` header when limit exceeded
4. Add metrics/monitoring for rate limit hits
5. Make rate limits configurable via environment variables
**Possible traps and issues**:
- Rate limiting at IP level doesn't work behind proxies (need proper X-Forwarded-For handling)
- Different operations need different rate limits (can't use global limit)
- Legitimate bulk operations might hit limits unexpectedly
- Rate limit state must be persistent across process restarts (use database or Redis)
- False positives from internal monitoring scripts hammering endpoints
**Docs changes needed**:
- Add rate limit table to API documentation
- Document in `Docs/CONFIGURATION.md` how to adjust rate limits
- Add to `Docs/TROUBLESHOOTING.md` - "Getting 429 Too Many Requests"
**Doc references**:
- DETAILED_FINDINGS.md - Issue #5 "Missing Rate Limiting"
- `backend/app/middleware/rate_limit.py` - Current implementation
---
### Issue #5: HIGH - API Has No Versioning Strategy
**Where found**: