Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reverse Lemmatisation? #89

Open
blhills opened this issue Nov 26, 2020 · 5 comments
Open

Reverse Lemmatisation? #89

blhills opened this issue Nov 26, 2020 · 5 comments

Comments

@blhills
Copy link

blhills commented Nov 26, 2020

Hey Jan, thanks for the awesome work. Been using the R package to handle lemmatisation on media corpora for multiple Central and Eastern European languages, however, I am wondering if there is a way to essentially reverse the process.

so I can run this:

library(udpipe)

udmodel <- udpipe_download_model(language = "croatian")

x <- udpipe(x = "izbori izbore izbora izborima", object = udmodel)

x

doc_id paragraph_id sentence_id sentence start end term_id token_id token lemma upos xpos
1 doc1 1 1 izbori izbore izbora izborima 1 6 1 1 izbori izbor VERB Vmr3s
2 doc1 1 1 izbori izbore izbora izborima 8 13 2 2 izbore izbor NOUN Ncmpa
3 doc1 1 1 izbori izbore izbora izborima 15 20 3 3 izbora izbor NOUN Ncmpg
4 doc1 1 1 izbori izbore izbora izborima 22 29 4 4 izborima izbor NOUN Ncmpd

but what I would like is a way I can do something like

x <- udpipe(x = "izbor", object = udmodel)

and have it return the list of "izbori, izbore, izbora, izborima"

Is this possible?

@jwijffels
Copy link
Contributor

jwijffels commented Nov 26, 2020

Hello blhills, no it is currently not possible in the API to generate all inflected forms of a lemma. The lemma rules are in the C++ code but deeply behind the general API.
Maybe we can ask this in the morphodita github repository.

@jwijffels
Copy link
Contributor

@foxik is there a part in the morphodita C++ API which allows for generating all possible inflected forms of a lemma or can it be easily accessed on the UDPipe C++ API?

@foxik
Copy link

foxik commented Nov 26, 2020

MorphoDiTa offers such a functionality https://ufal.mff.cuni.cz/morphodita/api-reference#morpho_generate , but it needs a morphological dictionary (which we have only for Czech, Slovak and English). I.e., UDPipe models do not have any idea of "valid forms for a given lemma" -- they are designed only for analysis using rules like "remove -ed" (and let the tagger to choose a valid result); for generation, these rules create a lot of invalid forms for a given lemma...

@jwijffels
Copy link
Contributor

Thank you Milan.

@blhills I think the easiest is that on your corpus of news articles, you do the lemmatisation and keep the generated token/lemma combinations.

@blhills
Copy link
Author

blhills commented Nov 27, 2020

hmm yeah so just build out my own dataset of lemmas+inflections and call that dataset when i want to find the appropriate words.

Its one of those things that logically seemed pretty simple so thought perhaps i had overlooked a way of doing it in the package.

Anyway thanks for the help and the great package it is one of the best tools i have found for the work im doing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants