-
Notifications
You must be signed in to change notification settings - Fork 33
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reverse Lemmatisation? #89
Comments
Hello blhills, no it is currently not possible in the API to generate all inflected forms of a lemma. The lemma rules are in the C++ code but deeply behind the general API. |
@foxik is there a part in the morphodita C++ API which allows for generating all possible inflected forms of a lemma or can it be easily accessed on the UDPipe C++ API? |
MorphoDiTa offers such a functionality https://ufal.mff.cuni.cz/morphodita/api-reference#morpho_generate , but it needs a morphological dictionary (which we have only for Czech, Slovak and English). I.e., UDPipe models do not have any idea of "valid forms for a given lemma" -- they are designed only for analysis using rules like "remove -ed" (and let the tagger to choose a valid result); for generation, these rules create a lot of invalid forms for a given lemma... |
Thank you Milan. @blhills I think the easiest is that on your corpus of news articles, you do the lemmatisation and keep the generated token/lemma combinations. |
hmm yeah so just build out my own dataset of lemmas+inflections and call that dataset when i want to find the appropriate words. Its one of those things that logically seemed pretty simple so thought perhaps i had overlooked a way of doing it in the package. Anyway thanks for the help and the great package it is one of the best tools i have found for the work im doing. |
Hey Jan, thanks for the awesome work. Been using the R package to handle lemmatisation on media corpora for multiple Central and Eastern European languages, however, I am wondering if there is a way to essentially reverse the process.
so I can run this:
doc_id paragraph_id sentence_id sentence start end term_id token_id token lemma upos xpos
1 doc1 1 1 izbori izbore izbora izborima 1 6 1 1 izbori izbor VERB Vmr3s
2 doc1 1 1 izbori izbore izbora izborima 8 13 2 2 izbore izbor NOUN Ncmpa
3 doc1 1 1 izbori izbore izbora izborima 15 20 3 3 izbora izbor NOUN Ncmpg
4 doc1 1 1 izbori izbore izbora izborima 22 29 4 4 izborima izbor NOUN Ncmpd
but what I would like is a way I can do something like
x <- udpipe(x = "izbor", object = udmodel)
and have it return the list of "izbori, izbore, izbora, izborima"
Is this possible?
The text was updated successfully, but these errors were encountered: