Skip to content

Latest commit

 

History

History
11 lines (8 loc) · 371 Bytes

README.md

File metadata and controls

11 lines (8 loc) · 371 Bytes

sindhi-tokenization

Sindhi tokenization data from ISRA

A collection of text files, with token and sentence boundaries marked in the tkns_ and stns_ files respectively.

A tool in Stanza, convert_text_files.py, processes this data into a CoNLL-style suitable for training a tokenizer. (The other annotations are left blank.)