sindhi-tokenization

Sindhi tokenization data from ISRA

A collection of text files, with token and sentence boundaries marked in the tkns_ and stns_ files respectively.

A tool in Stanza, convert_text_files.py, processes this data into a CoNLL-style suitable for training a tokenizer. (The other annotations are left blank.)