-
Notifications
You must be signed in to change notification settings - Fork 8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Multiple acceptors and error models #25
Comments
I have few ideas based on experiments of e.g. optimizing the sizes in memory and on disk and also some experiments of other spell models and or word completion. In principle there is quite direct tradeoff of space and speed and complexity of keeping any components of FSAs apart or performing their compositions / lookups in runtime. So on that end, it might be good to just have a generally quite flexible model of parts that get assembled on the fly. One FSA that is worth considering is a weighting model. This would allow switching the acceptors to be unweighted and should theoretically save at least x bytes (however many floats use) per state and edge for both in-memory and on-disk, while the weighting model should assign some weight to all strings it will probably be less complex than analyser, or could be other statistical model altogether. In prediction or word completion, there is also a place for a kind of morph acceptor model, since we want completions for potentially unfinished word-forms (e.g. compound forms that are bound and parts of complex words). |
We have now a working next-word prediction and autocomplete ML model. Do I understand the idea correctly that on top of this ml model, there should be another one that should perform spellchecking task? For example: Input --> gets checked by the spellchecking ML model --> if input is OK the model switches to the ML autocomplete/next-word prediction task. if the input is not correct, the spellchecking model suggests corrections . |
I am not sure I understand all of this, but here is what I think should happen:
I am not sure what role the regular spell checker should have beyond verifying suggestions from ML model. It might be useful to run it against the ML suggestions, but it might as well be better to just filter the suggestions (that is necessary in any case). We need to test this and see how it behaves :) |
Yeah I think the simplest first approach is to get a decently large nbest list of suggestions from the ML model and run it through the spell-checker to only suggest completions that are probably understandable for the user. If it is autocompletion after user has input some letters of the word, and the ML model is only trained with complete word-forms in context, from a corpus, there might be a need to account that user is already misspelling the initial part of the word, this won't show up in the (gold) corpus? It might also be possible to make a model to predict initial misspellings to correct word-forms, using the corpus of marked up errors. I'm thinking e.g. user types like "...uit norgga ark" the autocomplete should probably be able to complete 'árktalaš' or so, I have a feeling this is how gboard and swype work on bigger languages, not sure if ML model trained on raw text can do that but maybe? As a comparison the strictly rule-based or FSA model of completion without context (with context should be possible extension) also s probably usually at least composed of:
As I understand it it is just a question of how much of this can be baked into single ML model, e.g. if the corpus data is correctly spelled and plenty it would model the dictionary of correctly spelled words or morph or character or other such textpiece combinations without needing to query the rulebased dictionary, but yeah in practice we will see when we test stuff :-) |
When I first made this issue I didn't espect completion and prediction to be available this early, so I believe we now need to change the plans a bit. Here is what I suggest for the next steps:
@flammie 's point about misspelled input is very relevant. A variant of his suggestion is to use the current fst error model on any given input, and feed the N best corrections ot the ML model, and then return the most likely candidate of the lot. A potential problem with this approach is the raw number of candidates from the error model, but that can be remedied by using the --beam option: time echo ark | hfst-lookup -q -b 7 tools/spellcheckers/errmodel.default.hfst
ark ark 0,000000
ark aqk 6,000000
ark aqrk 6,000000
ark arkq 6,000000
ark arq 6,000000
ark arqk 6,000000
ark qark 6,000000
ark qrk 6,000000
ark árk 6,000000
real 0m0.379s
user 0m0.208s
sys 0m0.166s Most of these are garbage, but the one we want is also there, and will probably produce good completion suggestions from the ML model. At least worth a try 😄 |
Both old ideas and new development suggest a more flexible approach to accceptors and error models. Below is a list of things discussed in the past, + new ideas inspired by the ongoing machine learning work by @gusmakali, on word completion and prediction. Also some of the tasks mentioned in #19 are relevant to this.
Multiple error models
The idea is that all of the above could be present in one and the same speller archive, and with some configuration specification as to when to apply which model. A very tentative idea could be that a machine learning error model will either get it right with the top hypothesis, or completely fail (as determined by filtering the hypothesis against the lexicon), thus use that one as a first step, then fall back to a hand-tuned error model, and when that fails (it could be written to be on the safe side, ie not suggest anything outside a certain set of errors), fall back to the default error model.
Exactly how this should work and interact is very much an open question, but divvunspell should provide the machinery so that linguists can experiment with it to reach an optimal setup for a given language and device type.
Multiple acceptors
And possibly other variants too.
There are at least two ideas here:
As part of this work it is probably necessary to rework the zhfst archive format, probably by making the bhfst format the standard, including the json config file used there.
The text was updated successfully, but these errors were encountered: