pdf2parallel

Getting Started

Extract sentences from PDFs with Apache Tika (Thai sentences with pythainlp and English sentences with nltk)

python extract_sentences.py --en_dir en_data/ --th_dir th_data/

python align_sentences_use.py --en_dir en_data/ --th_dir th_data/ --output_path assorted_government.csv