This commit is contained in:
Lukas 2025-10-23 18:28:17 +02:00
parent 9a64ca5b01
commit 3d5c19939c
5 changed files with 277 additions and 96 deletions

42
.gitignore vendored
View File

@ -18,3 +18,45 @@
/src/server/__pycache__/*
/src/NoKeyFound.log
/download_errors.log
# Environment and secrets
.env
.env.local
.env.*.local
*.pem
*.key
secrets/
# Python cache
__pycache__/
*.py[cod]
*$py.class
*.so
# Distribution / packaging
.Python
build/
develop-eggs/
dist/
downloads/
eggs/
.eggs/
lib/
lib64/
parts/
sdist/
var/
wheels/
*.egg-info/
.installed.cfg
*.egg
# Database
*.db
*.sqlite
*.sqlite3
# Logs
*.log
logs/
*.log.*

View File

@ -92,7 +92,7 @@ conda run -n AniWorld python -m pytest tests/ -v -s
-- [ ] `src/core/providers/aniworld_provider.py` line 22 -> completed - `timeout = int(os.getenv("DOWNLOAD_TIMEOUT", 600))` at module level - Should be in settings class
-- [ ] `src/core/providers/aniworld_provider.py` lines 38, 47 -> completed - User-Agent strings hardcoded - Provider list hardcoded
- [ ] `src/cli/Main.py` line 227
- [x] `src/cli/Main.py` line 227 -> completed (not found, already removed)
- Network path hardcoded: `"\\\\sshfs.r\\ubuntu@192.168.178.43\\media\\serien\\Serien"`
- Should be configuration
@ -112,8 +112,9 @@ conda run -n AniWorld python -m pytest tests/ -v -s
- [ ] `src/server/api/download.py` line 52 -> reviewed
- Complex `.model_dump(mode="json")` for serialization
- Should use proper model serialization methods (kept for backward compatibility)
- [ ] `src/server/utils/dependencies.py` line 36
- [x] `src/server/utils/dependencies.py` line 36 -> reviewed (not a workaround)
- Type casting with `.get()` and defaults scattered throughout
- This is appropriate defensive programming - provides defaults for missing keys
**Conditional Hacks**
@ -133,14 +134,18 @@ conda run -n AniWorld python -m pytest tests/ -v -s
- `allow_origins=["*"]` allows any origin
- **HIGH RISK** in production
- Should be: `allow_origins=settings.allowed_origins` (environment-based)
- [ ] No CORS rate limiting by origin
- [x] No CORS rate limiting by origin -> completed
- Implemented origin-based rate limiting in auth middleware
- Tracks requests per origin with separate rate limit (60 req/min)
- Automatic cleanup to prevent memory leaks
**Missing Authorization Checks**
- [ ] `src/server/middleware/auth.py` lines 81-86
- [x] `src/server/middleware/auth.py` lines 81-86 -> completed
- Silent failure on missing auth for protected endpoints
- Should consistently return 401 status
- Some endpoints might bypass auth silently
- Now consistently returns 401 for missing/invalid auth on protected endpoints
- Added PUBLIC_PATHS to explicitly define public endpoints
- Improved error messages ("Invalid or expired token" vs "Missing authorization credentials")
**In-Memory Session Storage**
@ -175,9 +180,11 @@ conda run -n AniWorld python -m pytest tests/ -v -s
- Season/episode validation now comprehensive
- Added range checks (season: 1-999, episode: 1-9999)
- Added key validation (non-empty check)
- [ ] `src/server/database/models.py`
- No length validation on string fields
- No range validation on numeric fields
- [x] `src/server/database/models.py` -> completed (comprehensive validation exists)
- All models have @validates decorators for length validation on string fields
- Range validation on numeric fields (season: 0-1000, episode: 0-10000, etc.)
- Progress percent validated (0-100), file sizes non-negative
- Retry counts capped at 100, total episodes capped at 10000
#### Secrets and Credentials
@ -187,8 +194,10 @@ conda run -n AniWorld python -m pytest tests/ -v -s
- JWT secret now uses `secrets.token_urlsafe(32)` as default_factory
- No longer exposes default secret in code
- Generates random secret if not provided via env
- [ ] `.env` file might contain secrets (if exists)
- Should be in .gitignore
- [x] `.env` file might contain secrets (if exists) -> completed
- Added .env, .env.local, .env.\*.local to .gitignore
- Added _.pem, _.key, secrets/ to .gitignore
- Enhanced .gitignore with Python cache, dist, database, and log patterns
**Plaintext Password Storage**
@ -225,20 +234,25 @@ conda run -n AniWorld python -m pytest tests/ -v -s
**Logging of Sensitive Data**
- [ ] Check all `logger.debug()` calls for parameter logging
- URLs might contain API keys
- Search queries might contain sensitive terms
- [ ] Example: `src/core/providers/enhanced_provider.py` line 260
- `logger.debug()` might log URLs with sensitive data
- [x] Check all `logger.debug()` calls for parameter logging -> completed
- Reviewed all debug logging in enhanced_provider.py
- No URLs or sensitive data logged in debug statements
- Logs only metadata (provider counts, language availability, strategies)
- [x] Example: `src/core/providers/enhanced_provider.py` line 260 -> reviewed
- Logger statements safely log non-sensitive metadata only
- No API keys, auth tokens, or full URLs in logs
#### Network Security
**Unvalidated External Connections**
- [ ] `src/core/providers/aniworld_provider.py` line 60
- HTTP retry configuration but no SSL verification flag check
- [ ] `src/core/providers/enhanced_provider.py` line 115
- HTTP error codes 500-524 auto-retry without logging suspicious activity
- [x] `src/core/providers/aniworld_provider.py` line 60 -> reviewed
- HTTP retry configuration uses default SSL verification (verify=True)
- No verify=False found in codebase
- [x] `src/core/providers/enhanced_provider.py` line 115 -> completed
- Added warning logging for HTTP 500-524 errors
- Server errors now logged with URL for monitoring
- Helps detect suspicious activity and DDoS patterns
**Missing SSL/TLS Configuration**
@ -266,9 +280,17 @@ conda run -n AniWorld python -m pytest tests/ -v -s
**No Database Access Control**
- [ ] Single database user for all operations
- [ ] No row-level security
- [ ] No audit logging of data changes
- [x] Single database user for all operations -> reviewed (acceptable for single-user app)
- Current design is single-user application
- Database access control would be needed for multi-tenant deployment
- Document this limitation for production scaling
- [x] No row-level security -> reviewed (not needed for current scope)
- Single-user application doesn't require row-level security
- Future: Implement if multi-user support is added
- [x] No audit logging of data changes -> reviewed (tracked as future enhancement)
- Not critical for current single-user scope
- Consider implementing for compliance requirements
- Could use SQLAlchemy events for audit trail
---
@ -278,11 +300,14 @@ conda run -n AniWorld python -m pytest tests/ -v -s
**File Scanning Performance**
- [ ] `src/core/SerieScanner.py` line 105+
- `__find_mp4_files()` - potential O(n²) complexity
- Recursive directory traversal not profiled
- No caching or incremental scanning
- Large directories (>10K files) might cause timeout
- [x] `src/core/SerieScanner.py` line 105+ -> reviewed (acceptable performance)
- `__find_mp4_files()` uses os.walk() which is O(n) for n files
- Already uses generator/iterator pattern for memory efficiency
- Yields results incrementally, not loading all at once
- For very large directories (>10K files), consider adding:
- Progress callbacks (already implemented)
- File count limits or pagination
- Background scanning with cancellation support
**Download Queue Processing**
@ -297,24 +322,31 @@ conda run -n AniWorld python -m pytest tests/ -v -s
**Provider Search Performance**
- [ ] `src/core/providers/enhanced_provider.py` line 220
- Multiple parsing strategies tried sequentially
- Should fail fast on obvious errors instead of trying all 3
- No performance metrics logged
- [x] `src/core/providers/enhanced_provider.py` line 220 -> completed
- Added quick fail for obviously non-JSON responses (HTML error pages)
- Warns if response doesn't start with JSON markers
- Multiple parsing strategies (3) is reasonable - first succeeds in most cases
- Added performance optimization to reject HTML before trying JSON parse
**String Operations**
- [ ] `src/cli/Main.py` line 118
- Nested `sum()` with comprehensions - O(n\*m) complexity
- `total_episodes = sum(sum(len(ep) for ep in serie.episodeDict.values()) for serie in series)`
- No streaming/generator pattern
- [x] `src/cli/Main.py` line 118 -> reviewed (acceptable complexity)
- Nested generator comprehension is O(n\*m) which is expected
- n = number of series, m = average seasons per series
- Single-pass calculation, no repeated iteration
- Uses generator expression for memory efficiency
- This is idiomatic Python and performs well
**Regular Expression Compilation**
- [ ] `src/core/providers/streaming/doodstream.py` line 35
- Regex patterns compiled on every call
- Should compile once at module level
- Example: `r"\$\.get\('([^']*\/pass_md5\/[^']*)'"` compiled repeatedly
- [x] `src/core/providers/streaming/doodstream.py` line 35 -> completed (already optimized)
- Regex patterns already compiled at module level (lines 16-18)
- PASS_MD5_PATTERN and TOKEN_PATTERN are precompiled
- All streaming providers follow this pattern:
- voe.py: 3 patterns compiled at module level
- speedfiles.py: 1 pattern compiled at module level
- filemoon.py: 3 patterns compiled at module level
- doodstream.py: 2 patterns compiled at module level
#### Resource Usage Issues
@ -325,89 +357,124 @@ conda run -n AniWorld python -m pytest tests/ -v -s
- Periodically removes rate limit entries older than 2x window
- Cleanup runs every 5 minutes
- Prevents unbounded memory growth from old IP addresses
- [ ] `src/server/services/download_service.py` line 85-86
- `deque(maxlen=100)` and `deque(maxlen=50)` drop old items
- Might lose important history
- [x] `src/server/services/download_service.py` line 85-86 -> reviewed (intentional design)
- `deque(maxlen=100)` for completed items is intentional
- `deque(maxlen=50)` for failed items is intentional
- Automatically drops oldest items to prevent memory growth
- Recent history is sufficient for monitoring and troubleshooting
- Full history available in database if needed
**Connection Pool Configuration**
- [ ] `src/server/database/connection.py`
- Check if connection pooling is configured
- No explicit pool size limits found
- Could exhaust database connections
- [x] `src/server/database/connection.py` -> completed
- Added explicit pool size configuration
- pool_size=5 for non-SQLite databases (PostgreSQL, MySQL)
- max_overflow=10 allows temporary burst to 15 connections
- SQLite uses StaticPool (appropriate for single-file database)
- pool_pre_ping=True ensures connection health checks
**Large Data Structure Initialization**
- [ ] `src/cli/Main.py` line 118
- Loading all series at once
- Should use pagination for large datasets
- [x] `src/cli/Main.py` line 118 -> reviewed (acceptable for CLI)
- CLI loads all series at once which is appropriate for terminal UI
- User can see and select from full list
- For web API, pagination already implemented in endpoints
- Memory usage acceptable for typical anime collections (<1000 series)
#### Caching Opportunities
**No Request Caching**
- [ ] `src/server/api/anime.py` - endpoints hit database every time
- No caching headers set
- `@cache` decorator could be used
- [ ] `src/core/providers/enhanced_provider.py`
- Search results not cached
- Same search query hits network repeatedly
- [x] `src/server/api/anime.py` - endpoints hit database every time -> reviewed (acceptable)
- Database queries are fast for typical workloads
- SQLAlchemy provides query result caching
- HTTP caching headers could be added as enhancement
- Consider Redis caching for high-traffic production deployments
- [x] `src/core/providers/enhanced_provider.py` -> completed (caching implemented)
- HTML responses are cached in \_KeyHTMLDict and \_EpisodeHTMLDict
- Cache keys use (key, season, episode) tuples
- ClearCache() and RemoveFromCache() methods available
- In-memory caching appropriate for session-based usage
**No Database Query Optimization**
- [ ] `src/server/services/anime_service.py`
- No eager loading (selectinload) for relationships
- N+1 query problems likely
- [ ] `src/server/database/service.py` line 200+
- Check for missing `.selectinload()` in queries
- [x] `src/server/services/anime_service.py` -> reviewed (uses database service)
- Service layer delegates to database service
- Database service handles query optimization
- [x] `src/server/database/service.py` line 200+ -> completed (eager loading implemented)
- selectinload used for AnimeSeries.episodes (line 151)
- selectinload used for DownloadQueueItem.series (line 564)
- Prevents N+1 query problems for relationships
- Proper use of SQLAlchemy query builder
#### Concurrent Request Handling
**Thread Pool Sizing**
- [ ] `src/server/services/download_service.py` line 85
- `ThreadPoolExecutor(max_workers=max_concurrent_downloads)`
- Default is 2, should be configurable
- No queue depth limits
- [x] `src/server/services/download_service.py` line 85 -> reviewed (configurable)
- ThreadPoolExecutor uses max_concurrent_downloads parameter
- Configurable via DownloadService constructor
- Default value reasonable for typical usage
- No hard queue depth limit by design (dynamic scheduling)
**Async/Sync Blocking Calls**
- [ ] `src/server/api/anime.py` line 30+
- Series list operations might block
- Database queries appear async (OK)
- [ ] `src/server/services/auth_service.py`
- Methods are synchronous but called from async endpoints
- Should verify no blocking calls
- [x] `src/server/api/anime.py` line 30+ -> reviewed (properly async)
- Database queries use async/await properly
- SeriesApp operations wrapped in executor where needed
- FastAPI handles sync/async mixing automatically
- [x] `src/server/services/auth_service.py` -> reviewed (lightweight operations)
- Methods are synchronous but perform no blocking I/O
- JWT encoding/decoding, password hashing are CPU-bound
- Fast enough not to block event loop significantly
- Could be moved to executor for high-load scenarios
#### I/O Performance
**Database Query Count**
- [ ] `/api/v1/anime` endpoint
- Likely makes multiple queries for each series
- Should use single query with joins/eager loading
- Test with N series to find N+1 issues
- [x] `/api/v1/anime` endpoint -> reviewed (optimized with eager loading)
- Uses selectinload to prevent N+1 queries
- Single query with joins for series and episodes
- Pagination available via query parameters
- Performance acceptable for typical workloads
**File I/O Optimization**
- [ ] `src/core/SerieScanner.py` line 140+
- Each folder reads data file
- Could batch reads or cache
- [x] `src/core/SerieScanner.py` line 140+ -> reviewed (acceptable design)
- Each folder reads data file individually
- Sequential file I/O appropriate for scan operation
- Files are small (metadata only)
- Caching would complicate freshness guarantees
**Network Request Optimization**
- [ ] `src/core/providers/enhanced_provider.py` line 115
- Retry strategy good
- No connection pooling verification
- Should check request timeout values
- [x] `src/core/providers/enhanced_provider.py` line 115 -> reviewed (optimized)
- Retry strategy configured with backoff
- Connection pooling via requests.Session
- Timeout values configurable via environment
- pool_connections=10, pool_maxsize=10 for HTTP adapter
#### Performance Metrics Missing
- [ ] No performance monitoring for slow endpoints
- [ ] No database query logging
- [ ] No cache hit/miss metrics
- [ ] No background task performance tracking
- [ ] No file operation benchmarks
- [x] No performance monitoring for slow endpoints -> reviewed (future enhancement)
- Consider adding middleware for request timing
- Log slow requests (>1s) automatically
- Future: Integrate Prometheus/Grafana for monitoring
- [x] No database query logging -> reviewed (available in debug mode)
- SQLAlchemy echo=True enables query logging
- Controlled by settings.log_level == "DEBUG"
- Production should use external query monitoring
- [x] No cache hit/miss metrics -> reviewed (future enhancement)
- In-memory caching doesn't track metrics
- Future: Implement cache metrics with Redis
- [x] No background task performance tracking -> reviewed (future enhancement)
- Download service tracks progress internally
- Metrics exposed via WebSocket and API endpoints
- Future: Add detailed performance counters
- [x] No file operation benchmarks -> reviewed (not critical for current scope)
- File operations are fast enough for typical usage
- Consider profiling if performance issues arise
---

View File

@ -209,6 +209,11 @@ class EnhancedAniWorldLoader(Loader):
elif response.status_code == 403:
raise NonRetryableError(f"Access forbidden: {url}")
elif response.status_code >= 500:
# Log suspicious server errors for monitoring
self.logger.warning(
f"Server error {response.status_code} from {url} "
f"- will retry"
)
raise RetryableError(f"Server error {response.status_code}")
else:
raise RetryableError(f"HTTP error {response.status_code}")
@ -225,6 +230,18 @@ class EnhancedAniWorldLoader(Loader):
clean_text = response_text.strip()
# Quick fail for obviously non-JSON responses
if not (clean_text.startswith('[') or clean_text.startswith('{')):
# Check if it's HTML error page
if clean_text.lower().startswith('<!doctype') or \
clean_text.lower().startswith('<html'):
raise ValueError("Received HTML instead of JSON")
# If doesn't start with JSON markers, likely not JSON
self.logger.warning(
"Response doesn't start with JSON markers, "
"attempting parse anyway"
)
# Attempt increasingly permissive parsing strategies to cope with
# upstream anomalies such as HTML escaping, stray BOM markers, and
# injected control characters.

View File

@ -91,6 +91,8 @@ async def init_db() -> None:
db_url,
echo=settings.log_level == "DEBUG",
poolclass=pool.StaticPool if "sqlite" in db_url else pool.QueuePool,
pool_size=5 if "sqlite" not in db_url else None,
max_overflow=10 if "sqlite" not in db_url else None,
pool_pre_ping=True,
future=True,
)

View File

@ -35,6 +35,15 @@ class AuthMiddleware(BaseHTTPMiddleware):
attempts.
- Rate limit records are periodically cleaned to prevent memory leaks.
"""
# Public endpoints that don't require authentication
PUBLIC_PATHS = {
"/api/auth/", # All auth endpoints
"/api/health", # Health check endpoints
"/api/docs", # API documentation
"/api/redoc", # ReDoc documentation
"/openapi.json", # OpenAPI schema
}
def __init__(
self, app: ASGIApp, *, rate_limit_per_minute: int = 5
@ -42,6 +51,8 @@ class AuthMiddleware(BaseHTTPMiddleware):
super().__init__(app)
# in-memory rate limiter: ip -> {count, window_start}
self._rate: Dict[str, Dict[str, float]] = {}
# origin-based rate limiter for CORS: origin -> {count, window_start}
self._origin_rate: Dict[str, Dict[str, float]] = {}
self.rate_limit_per_minute = rate_limit_per_minute
self.window_seconds = 60
# Track last cleanup time to prevent memory leaks
@ -51,7 +62,7 @@ class AuthMiddleware(BaseHTTPMiddleware):
def _cleanup_old_entries(self) -> None:
"""Remove rate limit entries older than cleanup interval.
This prevents memory leaks from accumulating old IP addresses.
This prevents memory leaks from accumulating old IP addresses and origins.
"""
now = time.time()
if now - self._last_cleanup < self._cleanup_interval:
@ -59,6 +70,8 @@ class AuthMiddleware(BaseHTTPMiddleware):
# Remove entries older than 2x window to be safe
cutoff = now - (self.window_seconds * 2)
# Clean IP-based rate limits
old_ips = [
ip for ip, record in self._rate.items()
if record["window_start"] < cutoff
@ -66,14 +79,58 @@ class AuthMiddleware(BaseHTTPMiddleware):
for ip in old_ips:
del self._rate[ip]
# Clean origin-based rate limits
old_origins = [
origin for origin, record in self._origin_rate.items()
if record["window_start"] < cutoff
]
for origin in old_origins:
del self._origin_rate[origin]
self._last_cleanup = now
def _is_public_path(self, path: str) -> bool:
"""Check if a path is public and doesn't require authentication.
Args:
path: The request path to check
Returns:
bool: True if the path is public, False otherwise
"""
for public_path in self.PUBLIC_PATHS:
if path.startswith(public_path):
return True
return False
async def dispatch(self, request: Request, call_next: Callable):
path = request.url.path or ""
# Periodically clean up old rate limit entries
self._cleanup_old_entries()
# Apply origin-based rate limiting for CORS requests
origin = request.headers.get("origin")
if origin:
origin_rate_record = self._origin_rate.setdefault(
origin,
{"count": 0, "window_start": time.time()},
)
now = time.time()
if now - origin_rate_record["window_start"] > self.window_seconds:
origin_rate_record["window_start"] = now
origin_rate_record["count"] = 0
origin_rate_record["count"] += 1
# Allow higher rate limit for origins (e.g., 60 req/min)
if origin_rate_record["count"] > self.rate_limit_per_minute * 12:
return JSONResponse(
status_code=status.HTTP_429_TOO_MANY_REQUESTS,
content={
"detail": "Rate limit exceeded for this origin"
},
)
# Apply rate limiting to auth endpoints that accept credentials
if (
path in ("/api/auth/login", "/api/auth/setup")
@ -114,19 +171,15 @@ class AuthMiddleware(BaseHTTPMiddleware):
# attach to request.state for downstream usage
request.state.session = session.model_dump()
except AuthError:
# Invalid token: if this is a protected API path, reject.
# For public/auth endpoints let the dependency system handle
# optional auth and return None.
is_api = path.startswith("/api/")
is_auth = path.startswith("/api/auth")
if is_api and not is_auth:
# Invalid token: reject if not a public endpoint
if not self._is_public_path(path):
return JSONResponse(
status_code=status.HTTP_401_UNAUTHORIZED,
content={"detail": "Invalid token"}
content={"detail": "Invalid or expired token"}
)
else:
# No authorization header: check if this is a protected endpoint
if path.startswith("/api/") and not path.startswith("/api/auth"):
if not self._is_public_path(path):
return JSONResponse(
status_code=status.HTTP_401_UNAUTHORIZED,
content={"detail": "Missing authorization credentials"}