Skip to content

v2.0.0

Compare
Choose a tag to compare
@Goldziher Goldziher released this 15 Feb 08:20
· 40 commits to main since this release
f9f319e

Enhancements:

  • add sync methods (feature)
  • add corrupt PDF searchable text detection with automatic OCR fallback (feature)
  • add metadata extraction using Pandoc (feature)
  • add multi-sheet worksheet (excel etc.) extraction (feature)
  • add language, psm and pax_processes keyword arguments (enhancement; api)
  • gated typing-extensions to Python 3.10 and below (enhancement; dependencies)
  • added multi-loop compatibility by switching from asyncio to using anyio (enhancement; compatibility)
  • added managed worker processes for Pandoc and Tesseract using anyio.to_process (enhancement; performance)
  • replaced xslx2csv with python-calamine and improved implementation to extract all sheets in a workbook (enhancement; performance)

Breaking Changes:

  • updated ExtractionResult to include metadata (breaking change; api)
  • changed force_ocr to a kwarg (breaking change; api)

Internal:

  • split the _extractors namespace into smaller packages and reorganized source code
  • add matrix tests against all supported Python versions (internal; testing)
  • refined ruff rules to enhance linting strictness
  • increase coverage to >=99% coverage