Skip to content

Latest commit

 

History

History
24 lines (16 loc) · 854 Bytes

README.md

File metadata and controls

24 lines (16 loc) · 854 Bytes

pdf2parallel

Getting Started

  1. Extract sentences from PDFs with Apache Tika (Thai sentences with pythainlp and English sentences with nltk)
python extract_sentences.py --en_dir en_data/ --th_dir th_data/
  1. Align sentences using universal sentence encoder
python align_sentences_use.py --en_dir en_data/ --th_dir th_data/ --output_path assorted_government.csv

Authors

  • @attapol - Extraction and normalization of Thai texts from PDF
  • @pinedbean - Universal sentence encoder inference code
  • @cstorm125 - Sentence alignment with universal sentence encoder

Acknowledgement

  • @pnphannisa - Sourcing government document in PDF files