Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Word attributes missing in .vrt file #1

Open
3 of 4 tasks
fbanados opened this issue Jul 29, 2024 · 15 comments
Open
3 of 4 tasks

Word attributes missing in .vrt file #1

fbanados opened this issue Jul 29, 2024 · 15 comments

Comments

@fbanados
Copy link
Member

fbanados commented Jul 29, 2024

Word Attributes missing:

  • lemma
  • dependency
  • analysis
  • gloss

As seen in
Screenshot 2024-07-29 at 12 29 05 PM
Screenshot 2024-07-29 at 12 29 37 PM
Screenshot 2024-07-29 at 12 29 47 PM

@fbanados
Copy link
Member Author

Fields are currently being presented in two msd fields:
Screenshot 2024-07-29 at 12 33 03 PM

@fbanados
Copy link
Member Author

Corrected mapping of fields solves most of the issues with word attributes. However, dependencies are not present in the base vrt file.

@fbanados
Copy link
Member Author

Search criteria need to be well-formed in fields. I've changed the config to at least make behaviour work in the following cases (recreated, following the PDF):
Screenshot 2024-07-29 at 1 59 27 PM
Screenshot 2024-07-29 at 2 16 27 PM
Screenshot 2024-07-29 at 2 18 06 PM
Screenshot 2024-07-29 at 2 19 16 PM
Screenshot 2024-07-29 at 2 20 12 PM
Screenshot 2024-07-29 at 2 22 08 PM
Screenshot 2024-07-29 at 2 23 01 PM

@fbanados
Copy link
Member Author

fbanados commented Jul 29, 2024

To finish replicating the behaviour of the slides, we need to recover the msd and dependency fields into the vrt file. I do not think this is the case now. I believe that is a linguist problem, the programming part would work now.

@aarppe
Copy link

aarppe commented Jul 29, 2024

This is excellent, as many of the lost features are now back! If this works for the A-W corpus, I can easily run the same analyses for the Bloomfield and Miscellaneous corpora, so that we can add them to our Cree text collection.

The msd (Morphosyntactic Description) field was a historical vestige and may have been reinterpreted as the analysis field. Not all words had a dependency field, though many did.

I can recreate the *.vrt file, if you remind me what is the tabular order of the linguistic fields?

@fbanados
Copy link
Member Author

currently the order is word lemma analysis gloss, let me know where you place the dependency field and I'll update the script https://github.com/UAlbertaALTLab/korp-config/blob/main/crk_WolfartAhenakew_encode.sh (note that, if you want to run the script locally, you need to change the path. I've committed the file I'm currently using locally, of course once this is deployed it would point to the server paths)

@fbanados
Copy link
Member Author

My plan is once the new machine is setup, I'd put korp.altlab.dev there first for testing.

@fbanados
Copy link
Member Author

Although it does not immediately have priority, we could address UAlbertaALTLab/korp-frontend#29 and UAlbertaALTLab/korp-frontend#24 as well.

@aarppe
Copy link

aarppe commented Jul 29, 2024

How the regular analysis process works is that the various fields get organized as follows, with tabs in between:

token lemma analysis dependency Gloss Rest ...
niwâpamâw wâpamêw +V+TA+Ind+1Sg+3SgO @PRED-TA s/he sees s.o. ...

We could add RW or WN semantic classes, etc. Though one feature that would be good to sort out is how can spaces be included in the fields without them being mistaken for field delimiters, which is supposed to be make use of only tabulators. Now, I've replaced spaces with  code, and + signs with dots.

@aarppe
Copy link

aarppe commented Jul 29, 2024

UAlbertaALTLab/korp-frontend#24 would be the higher priority.

@fbanados
Copy link
Member Author

How the regular analysis process works is that the various fields get organized as follows, with tabs in between:

token lemma analysis dependency Gloss Rest ...
niwâpamâw wâpamêw +V+TA+Ind+1Sg+3SgO @PRED-TA s/he sees s.o. ...

That order is ok!

We could add RW or WN semantic classes, etc. Though one feature that would be good to sort out is how can spaces be included in the fields without them being mistaken for field delimiters, which is supposed to be make use of only tabulators. Now, I've replaced spaces with  code, and + signs with dots.

I'm looking into that, as the frontend discusses multiple-value fields which would be what we want for fields like RW/WN/etc.

@fbanados
Copy link
Member Author

UAlbertaALTLab/korp-frontend#24 would be the higher priority.

It should not take me too long to create the config files for a new corpus, but the vrt files in the private repo only have the token field. I can use that if it's sufficient, but maybe we'd want to also have other fields? (analysis, gloss, lemma, etc.)

@aarppe
Copy link

aarppe commented Jul 29, 2024

Yeah, the point is that we have a skeleton *.vrt file with only the tokens and the structural metadata, on which all the layers of linguistic analysis can be added, and rerun when the analyzers improve. The scripts we've got should be able to accomplish the analyses quite quickly. The linguistic analysis fields would be exactly the same as for the Ahenakew-Wolfart corpus. What would differ potentially is the meta-structure of the texts, as the available metadata is different.

@aarppe
Copy link

aarppe commented Jul 29, 2024

There's now a revised version of the analyzed VRT file for the A-W corpus, in: altlab/crk/generated/ahenakew_wolfart_fst+cg+gloss.vrt

This is done by the following sequence:

cat corpora/ahenakew_wolfart.vrt | gawk '{ if(match($0, ".+~$")!=0) sub("~$",""); print; }' | bin/fst-cg-analyze-vrt.sh analyser-gt-strict.hfstol /Users/arppe/gt/lang-crk/src/cg3/disambiguator.cg3 analyser-gt-relaxed.hfstol /Users/arppe/gt/lang-crk/src/cg3/functions.cg3 generator-gt-strict.hfstol | bin/vrt2korp.sh > generated/ahenakew_wolfart_fst+cg+gloss.vrt

Some further conversion of special characters to HTML might be needed in the script: bin/vrt2korp.sh.

@aarppe
Copy link

aarppe commented Jul 30, 2024

A first version of the Bloomfield corpus in VRT form is presented here: UAlbertaALTLab/korp-frontend#24 (comment)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants