feat(providers): detect HTML encoding before parsing

Add chardet-based _decode_html_content() to aniworld_provider. Apply
to all BeautifulSoup parsing calls to prevent decoding warnings on
pages with mismatched encoding declarations. Falls back to utf-8
with errors='replace' when confidence < 0.7.

Also fix test_enhanced_provider HLS test signature and add HLS
pattern unit tests.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
This commit is contained in:
2026-05-25 16:30:36 +02:00
parent d5e955a731
commit ca93bb740a
6 changed files with 162 additions and 7 deletions

View File

@@ -41,6 +41,15 @@ This changelog follows [Keep a Changelog](https://keepachangelog.com/) principle
### Added
- **Encoding detection for HTML parsing** (`src/core/providers/aniworld_provider.py`):
Added `_decode_html_content()` function that uses `chardet` to detect the actual
encoding of HTML content before parsing. Falls back to UTF-8 with `errors='replace'`
to handle pages with mismatched encoding declarations. Applied to all BeautifulSoup
parsing calls to prevent "Some characters could not be decoded" warnings.
- **chardet dependency**: Added `chardet>=5.2.0` to `requirements.txt` for encoding detection.
### Added
- **Temp file cleanup after every download** (`src/core/providers/aniworld_provider.py`,
`src/core/providers/enhanced_provider.py`): Module-level helper
`_cleanup_temp_file()` removes the working temp file and any yt-dlp `.part`