Word attributes missing in .vrt file #1

fbanados · 2024-07-29T18:30:31Z

Word Attributes missing:

lemma
dependency
analysis
gloss

As seen in

fbanados · 2024-07-29T18:33:09Z

Fields are currently being presented in two msd fields:

fbanados · 2024-07-29T19:58:28Z

Corrected mapping of fields solves most of the issues with word attributes. However, dependencies are not present in the base vrt file.

fbanados · 2024-07-29T20:23:31Z

Search criteria need to be well-formed in fields. I've changed the config to at least make behaviour work in the following cases (recreated, following the PDF):

fbanados · 2024-07-29T20:25:35Z

To finish replicating the behaviour of the slides, we need to recover the msd and dependency fields into the vrt file. I do not think this is the case now. I believe that is a linguist problem, the programming part would work now.

aarppe · 2024-07-29T20:59:09Z

This is excellent, as many of the lost features are now back! If this works for the A-W corpus, I can easily run the same analyses for the Bloomfield and Miscellaneous corpora, so that we can add them to our Cree text collection.

The msd (Morphosyntactic Description) field was a historical vestige and may have been reinterpreted as the analysis field. Not all words had a dependency field, though many did.

I can recreate the *.vrt file, if you remind me what is the tabular order of the linguistic fields?

fbanados · 2024-07-29T21:04:29Z

currently the order is word lemma analysis gloss, let me know where you place the dependency field and I'll update the script https://github.com/UAlbertaALTLab/korp-config/blob/main/crk_WolfartAhenakew_encode.sh (note that, if you want to run the script locally, you need to change the path. I've committed the file I'm currently using locally, of course once this is deployed it would point to the server paths)

fbanados · 2024-07-29T21:05:20Z

My plan is once the new machine is setup, I'd put korp.altlab.dev there first for testing.

fbanados · 2024-07-29T21:08:11Z

Although it does not immediately have priority, we could address UAlbertaALTLab/korp-frontend#29 and UAlbertaALTLab/korp-frontend#24 as well.

aarppe · 2024-07-29T21:15:41Z

How the regular analysis process works is that the various fields get organized as follows, with tabs in between:

token	lemma	analysis	dependency	Gloss	Rest ...
niwâpamâw	wâpamêw	+V+TA+Ind+1Sg+3SgO	@PRED-TA	s/he sees s.o.	...

We could add RW or WN semantic classes, etc. Though one feature that would be good to sort out is how can spaces be included in the fields without them being mistaken for field delimiters, which is supposed to be make use of only tabulators. Now, I've replaced spaces with  code, and + signs with dots.

aarppe · 2024-07-29T21:16:23Z

UAlbertaALTLab/korp-frontend#24 would be the higher priority.

fbanados · 2024-07-29T21:24:12Z

How the regular analysis process works is that the various fields get organized as follows, with tabs in between:

token lemma analysis dependency Gloss Rest ...
niwâpamâw wâpamêw +V+TA+Ind+1Sg+3SgO @PRED-TA s/he sees s.o. ...

That order is ok!

We could add RW or WN semantic classes, etc. Though one feature that would be good to sort out is how can spaces be included in the fields without them being mistaken for field delimiters, which is supposed to be make use of only tabulators. Now, I've replaced spaces with  code, and + signs with dots.

I'm looking into that, as the frontend discusses multiple-value fields which would be what we want for fields like RW/WN/etc.

fbanados · 2024-07-29T21:30:36Z

UAlbertaALTLab/korp-frontend#24 would be the higher priority.

It should not take me too long to create the config files for a new corpus, but the vrt files in the private repo only have the token field. I can use that if it's sufficient, but maybe we'd want to also have other fields? (analysis, gloss, lemma, etc.)

aarppe · 2024-07-29T21:36:52Z

Yeah, the point is that we have a skeleton *.vrt file with only the tokens and the structural metadata, on which all the layers of linguistic analysis can be added, and rerun when the analyzers improve. The scripts we've got should be able to accomplish the analyses quite quickly. The linguistic analysis fields would be exactly the same as for the Ahenakew-Wolfart corpus. What would differ potentially is the meta-structure of the texts, as the available metadata is different.

aarppe · 2024-07-29T22:11:21Z

There's now a revised version of the analyzed VRT file for the A-W corpus, in: altlab/crk/generated/ahenakew_wolfart_fst+cg+gloss.vrt

This is done by the following sequence:

cat corpora/ahenakew_wolfart.vrt | gawk '{ if(match($0, ".+~$")!=0) sub("~$",""); print; }' | bin/fst-cg-analyze-vrt.sh analyser-gt-strict.hfstol /Users/arppe/gt/lang-crk/src/cg3/disambiguator.cg3 analyser-gt-relaxed.hfstol /Users/arppe/gt/lang-crk/src/cg3/functions.cg3 generator-gt-strict.hfstol | bin/vrt2korp.sh > generated/ahenakew_wolfart_fst+cg+gloss.vrt

Some further conversion of special characters to HTML might be needed in the script: bin/vrt2korp.sh.

aarppe · 2024-07-30T21:22:33Z

A first version of the Bloomfield corpus in VRT form is presented here: UAlbertaALTLab/korp-frontend#24 (comment)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Word attributes missing in .vrt file #1

Word attributes missing in .vrt file #1

fbanados commented Jul 29, 2024 •

edited

Loading

fbanados commented Jul 29, 2024

fbanados commented Jul 29, 2024

fbanados commented Jul 29, 2024

fbanados commented Jul 29, 2024 •

edited

Loading

aarppe commented Jul 29, 2024 •

edited

Loading

fbanados commented Jul 29, 2024

fbanados commented Jul 29, 2024

fbanados commented Jul 29, 2024

aarppe commented Jul 29, 2024

aarppe commented Jul 29, 2024

fbanados commented Jul 29, 2024

fbanados commented Jul 29, 2024

aarppe commented Jul 29, 2024

aarppe commented Jul 29, 2024 •

edited

Loading

aarppe commented Jul 30, 2024

Word attributes missing in .vrt file #1

Word attributes missing in .vrt file #1

Comments

fbanados commented Jul 29, 2024 • edited Loading

fbanados commented Jul 29, 2024

fbanados commented Jul 29, 2024

fbanados commented Jul 29, 2024

fbanados commented Jul 29, 2024 • edited Loading

aarppe commented Jul 29, 2024 • edited Loading

fbanados commented Jul 29, 2024

fbanados commented Jul 29, 2024

fbanados commented Jul 29, 2024

aarppe commented Jul 29, 2024

aarppe commented Jul 29, 2024

fbanados commented Jul 29, 2024

fbanados commented Jul 29, 2024

aarppe commented Jul 29, 2024

aarppe commented Jul 29, 2024 • edited Loading

aarppe commented Jul 30, 2024

fbanados commented Jul 29, 2024 •

edited

Loading

fbanados commented Jul 29, 2024 •

edited

Loading

aarppe commented Jul 29, 2024 •

edited

Loading

aarppe commented Jul 29, 2024 •

edited

Loading