Reverse Lemmatisation? #89

blhills · 2020-11-26T15:00:28Z

Hey Jan, thanks for the awesome work. Been using the R package to handle lemmatisation on media corpora for multiple Central and Eastern European languages, however, I am wondering if there is a way to essentially reverse the process.

so I can run this:

library(udpipe)

udmodel <- udpipe_download_model(language = "croatian")

x <- udpipe(x = "izbori izbore izbora izborima", object = udmodel)

x

doc_id paragraph_id sentence_id sentence start end term_id token_id token lemma upos xpos
1 doc1 1 1 izbori izbore izbora izborima 1 6 1 1 izbori izbor VERB Vmr3s
2 doc1 1 1 izbori izbore izbora izborima 8 13 2 2 izbore izbor NOUN Ncmpa
3 doc1 1 1 izbori izbore izbora izborima 15 20 3 3 izbora izbor NOUN Ncmpg
4 doc1 1 1 izbori izbore izbora izborima 22 29 4 4 izborima izbor NOUN Ncmpd

but what I would like is a way I can do something like

x <- udpipe(x = "izbor", object = udmodel)

and have it return the list of "izbori, izbore, izbora, izborima"

Is this possible?

jwijffels · 2020-11-26T15:23:12Z

Hello blhills, no it is currently not possible in the API to generate all inflected forms of a lemma. The lemma rules are in the C++ code but deeply behind the general API.
Maybe we can ask this in the morphodita github repository.

jwijffels · 2020-11-26T16:32:11Z

@foxik is there a part in the morphodita C++ API which allows for generating all possible inflected forms of a lemma or can it be easily accessed on the UDPipe C++ API?

foxik · 2020-11-26T17:10:51Z

MorphoDiTa offers such a functionality https://ufal.mff.cuni.cz/morphodita/api-reference#morpho_generate , but it needs a morphological dictionary (which we have only for Czech, Slovak and English). I.e., UDPipe models do not have any idea of "valid forms for a given lemma" -- they are designed only for analysis using rules like "remove -ed" (and let the tagger to choose a valid result); for generation, these rules create a lot of invalid forms for a given lemma...

jwijffels · 2020-11-27T14:54:07Z

Thank you Milan.

@blhills I think the easiest is that on your corpus of news articles, you do the lemmatisation and keep the generated token/lemma combinations.

blhills · 2020-11-27T16:00:12Z

hmm yeah so just build out my own dataset of lemmas+inflections and call that dataset when i want to find the appropriate words.

Its one of those things that logically seemed pretty simple so thought perhaps i had overlooked a way of doing it in the package.

Anyway thanks for the help and the great package it is one of the best tools i have found for the work im doing.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reverse Lemmatisation? #89

Reverse Lemmatisation? #89

blhills commented Nov 26, 2020

jwijffels commented Nov 26, 2020 •

edited

Loading

jwijffels commented Nov 26, 2020

foxik commented Nov 26, 2020

jwijffels commented Nov 27, 2020

blhills commented Nov 27, 2020

Reverse Lemmatisation? #89

Reverse Lemmatisation? #89

Comments

blhills commented Nov 26, 2020

jwijffels commented Nov 26, 2020 • edited Loading

jwijffels commented Nov 26, 2020

foxik commented Nov 26, 2020

jwijffels commented Nov 27, 2020

blhills commented Nov 27, 2020

jwijffels commented Nov 26, 2020 •

edited

Loading