v2.0.0
Enhancements:
- add sync methods (feature)
- add corrupt PDF searchable text detection with automatic OCR fallback (feature)
- add metadata extraction using Pandoc (feature)
- add multi-sheet worksheet (excel etc.) extraction (feature)
- add
language
,psm
andpax_processes
keyword arguments (enhancement; api) - gated
typing-extensions
to Python 3.10 and below (enhancement; dependencies) - added multi-loop compatibility by switching from
asyncio
to usinganyio
(enhancement; compatibility) - added managed worker processes for Pandoc and Tesseract using
anyio.to_process
(enhancement; performance) - replaced
xslx2csv
withpython-calamine
and improved implementation to extract all sheets in a workbook (enhancement; performance)
Breaking Changes:
- updated
ExtractionResult
to includemetadata
(breaking change; api) - changed
force_ocr
to a kwarg (breaking change; api)
Internal:
- split the _extractors namespace into smaller packages and reorganized source code
- add matrix tests against all supported Python versions (internal; testing)
- refined ruff rules to enhance linting strictness
- increase coverage to >=99% coverage