feat(providers): detect HTML encoding before parsing
Add chardet-based _decode_html_content() to aniworld_provider. Apply to all BeautifulSoup parsing calls to prevent decoding warnings on pages with mismatched encoding declarations. Falls back to utf-8 with errors='replace' when confidence < 0.7. Also fix test_enhanced_provider HLS test signature and add HLS pattern unit tests. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
This commit is contained in:
@@ -41,6 +41,15 @@ This changelog follows [Keep a Changelog](https://keepachangelog.com/) principle
|
||||
|
||||
### Added
|
||||
|
||||
- **Encoding detection for HTML parsing** (`src/core/providers/aniworld_provider.py`):
|
||||
Added `_decode_html_content()` function that uses `chardet` to detect the actual
|
||||
encoding of HTML content before parsing. Falls back to UTF-8 with `errors='replace'`
|
||||
to handle pages with mismatched encoding declarations. Applied to all BeautifulSoup
|
||||
parsing calls to prevent "Some characters could not be decoded" warnings.
|
||||
- **chardet dependency**: Added `chardet>=5.2.0` to `requirements.txt` for encoding detection.
|
||||
|
||||
### Added
|
||||
|
||||
- **Temp file cleanup after every download** (`src/core/providers/aniworld_provider.py`,
|
||||
`src/core/providers/enhanced_provider.py`): Module-level helper
|
||||
`_cleanup_temp_file()` removes the working temp file and any yt-dlp `.part`
|
||||
|
||||
@@ -342,6 +342,35 @@ Doodstream alternates between `dood.li`, `dood.so`, `dood.la`). Only the
|
||||
referer hints in `PROVIDER_HEADERS` are persisted — discovery still happens
|
||||
at runtime through AniWorld's redirect endpoint.
|
||||
|
||||
### HLS Stream Handling
|
||||
|
||||
HLS (HTTP Live Streaming) manifests (`.m3u8`) require yt-dlp to use the
|
||||
`ffmpeg` downloader with `--hls-use-mpegts`. Both providers configure this
|
||||
automatically:
|
||||
|
||||
```python
|
||||
ydl_opts = {
|
||||
"downloader": "ffmpeg", # Use ffmpeg instead of native
|
||||
"hls_use_mpegts": True, # Write transport stream (.ts) segments
|
||||
}
|
||||
```
|
||||
|
||||
**Why this matters**: Without ffmpeg, yt-dlp logs:
|
||||
`"Live HLS streams are not supported by the native downloader"`
|
||||
|
||||
**Requirements**:
|
||||
- ffmpeg must be installed and in PATH (`which ffmpeg`)
|
||||
- Install: `apt install ffmpeg` (Debian/Ubuntu) or `brew install ffmpeg` (macOS)
|
||||
- Startup health check (see Health Check Endpoints) verifies ffmpeg presence
|
||||
|
||||
**Trade-offs**:
|
||||
- HLS downloads are slower than direct MP4 (reassembly of .ts segments)
|
||||
- Requires more disk space during download
|
||||
- May need post-processing if .ts format is not desired
|
||||
|
||||
**Detection**: VOE provider extracts HLS URLs via `HLS_PATTERN` regex. Other
|
||||
providers let yt-dlp auto-detect from URL/content-type.
|
||||
|
||||
### Updating yt-dlp
|
||||
|
||||
When extractors break (typical symptoms: every provider HEAD probe succeeds
|
||||
|
||||
Reference in New Issue
Block a user