Release v2.0.0 · Goldziher/kreuzberg

Enhancements:

add sync methods (feature)
add corrupt PDF searchable text detection with automatic OCR fallback (feature)
add metadata extraction using Pandoc (feature)
add multi-sheet worksheet (excel etc.) extraction (feature)
add language, psm and pax_processes keyword arguments (enhancement; api)
gated typing-extensions to Python 3.10 and below (enhancement; dependencies)
added multi-loop compatibility by switching from asyncio to using anyio (enhancement; compatibility)
added managed worker processes for Pandoc and Tesseract using anyio.to_process (enhancement; performance)
replaced xslx2csv with python-calamine and improved implementation to extract all sheets in a workbook (enhancement; performance)

Breaking Changes:

Internal:

split the _extractors namespace into smaller packages and reorganized source code
add matrix tests against all supported Python versions (internal; testing)
refined ruff rules to enhance linting strictness
increase coverage to >=99% coverage

Provide feedback