-
Notifications
You must be signed in to change notification settings - Fork 8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
divvunspell having problems with Cyrillic capital letters (it seems) #39
Comments
More examples:
Thus, hfst-lookup and hfst-ospell behave as expected, divvunspell does not (it does not recognise the correct form, and thus also does not suggest it as a correction for the incorrect form. More examples behaving in the same way: Приобье, Сарымо-Русскинское, Сӯкыръя, Тӯрват, хо̄тпан, ва̄та̄л-хащта̄л, вертолёт, занятиен, школа, колхозыт, коньяк, ляпат As can be seen, the examples are both plain small-letter Russian Cyrillics as well as capital letters and composed letters. These failed forms constitute appr. 5% of a dataset of 607 word pairs. |
Exactly the same issue has been commented upon earlier, in #19 , some years ago. The problems (capital letters and composed letters) seem unrelated, but taken together, they lead to spellers for Cyrillic-based languages and languages with composed letters (a large part of our languages) being dysfunctional. I thus hope the issue can get some attention. |
This is an old problem that was probably patched over in hfst-ospell after it was used as a a source for divvunspell; if errmodel doesn't know the alphabets needed for lexicon it gets confused c.f. giellalt/lang-mns@37e7724 |
But how come? Why doesn't the error model know the alphatet needed (i.e., I take you to mean: the capital letters?). There is nothing magic with cyrillic letters per se, I thought, but evidently it is. Here is the situation:
Looking hard at this, I adjusted mns in line with mns, and now it works. I thus close the bug. It still is a mystery how sme works without having declared capital letters. Admittedly, it uses .regex and not .txt, but I do not see how that should affect the outcome. In any case, although I invite anyone puzzled by the remaining incongruence to delve into it, I now have a working divvunspell for Cyrillic-based languages, and close the bug. |
There's no I actually wrote a hfst-tool |
We don't deduce the alphabet from the lexical FST automatically. We did so in an early version of the speller infra, but then we instead needed to list all the symbols we did NOT want as part of the error model alphabet. That turned out to be even more confusing and counter-intuitive, so what we have now is a system where you need to explicitly (and in some cases implicitly) list all symbols/letters you want to include in the error model alphabet. By default I always suggest to NOT include capital letters. Including them leads to a much bigger error model, and similarly slower speller. At the same time we know that capital letters are almost only found in first position, and letters in that position are (generally) rarely wrong. There is also built-in automatic processing of upper-lower case shifting in the speller code, so that it is only the lexical case that needs consideration. All of this is to say that actual need for including upper-case letters in the error model alphabet is usually very small, so small that it is typically not worth the costs (very much bigger error model, and much slower speller, cf above). That is, think twice before adding upper case letters, and if you need to, consider whether you need all the upper case letters, or only a select few and known problem letters, as in your example. |
Well. Before adding the capital letters the speller did not recognise words written with capital letters (cf. the "to repeat" in my first posting. After adding the capital letters, the speller did recognise such words. Some letters never occur word-initially, but they do occur in words written with capital letters only. I am thus hesitant to follow your advice here. |
That does not make any sense. Adding letters to the error model should have no consequence for whether a word is accepted or not by the speller, since that is done by the acceptor only. The error model takes no part in this process (and should not). If you can come up with a reproducible example demonstrating this behaviour, please add it here (or in a new issue) — that is definitely a serious bug, and should be fixed ASAP. But I don't believe it before I see it.
I wrote in my comment:
This includes initial upper-casing and all-caps. That is, from an FST point of view, When generating suggestions, all of |
Not too exciting here, just repeating myself, but hte reversed order. I first take the git version of the mns speller, and run the word for Leningrad (capital cyrillic L) through it. It is accepted by divvunspell:
I then remove lines 46 through 89 (= all the capital letters) from the file editdist.default.txt, and save the file (I do not check it in, but it can be repeated as just described), and recompile the speller:
I then repeat the same test, with the following result:
The only difference is thus the removal of the capital letters from The word is recognised by the analyser:
|
There are subtle differences, though:
Whether this is of any help I do not know. |
I think Trond is right here, divvunspell fails when the, e.g. uppercase, alphabet does not exist in the alphabet of the whole error model. I have been wondering why there is an extra case handling and extra edit distance re-weighting step in divvunspell (vs hfst-ospell) that seemingly reduplicates what error model should be doing and I think now that it might have been implemented to partially workaround this problem.. |
To repeat:
The form Эспоо is in the normative fst, and is recognised by hfst-ospell, but not by divvunspell, even though both spellers access the same freshly compiled version of the speller, mns.zhfst.
The lemma (Эспоо) is from the shared-urj-Cyrl, but that should not be relevant (it is in the resulting fst). The editdist.default.txt file declares the е/э pair (and, on a side note, also the composed long э̄). Capital letters are not declared explicitly:
The text was updated successfully, but these errors were encountered: