Skip to content
This repository has been archived by the owner on Aug 12, 2020. It is now read-only.

Existing Work: Readers/Writers/Datatypes #11

Open
sebpuetz opened this issue May 8, 2019 · 2 comments
Open

Existing Work: Readers/Writers/Datatypes #11

sebpuetz opened this issue May 8, 2019 · 2 comments

Comments

@sebpuetz
Copy link

sebpuetz commented May 8, 2019

for e.g. CONLL format(s)

@rth
Copy link

rth commented May 8, 2019

For parsers of NLP related formats, there are e.g.,

  • CoNLL-X reader / writer by @danieldk: https://github.com/danieldk/conllx-rs
  • CoNLL-U, CoNLL-X, CoreNLP CoNLL parsers: https://docs.rs/nlp-io/ . Unfortunately, the underlying repo no longer exists and the releases were yanked from cargo.io.
    We could try to contact the authors once this repo is in a bit better shape.
  • Another format I would be interested is some way to represent sparse document-term matrices. It's maybe more related to serializing sparse data in general. For instance,

@sebpuetz
Copy link
Author

I just released the first proper version of a crate to read and process constituency trees at https://github.com/sebpuetz/lumberjack.

The crate is still rather unpolished and I'm unsure about what the public API should be, but it supports reading the NEGRA export format, various flavours of bracketed trees, conversion from and to @danieldk's conllx format with and without encoded constituency structure. Further, a bunch of operations on the trees are possible like filtering specific non-terminals.

There is another inactive Rust crate for reading bracketed constituency trees at https://github.com/sjmielke/ptb-reader-rust.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants