Spelling corrector breaks on NUMERIC fields #55

CodeOptimist · 2024-03-21T20:06:21Z

When a user typo's a numeric field in their query:

  File "C:\Users\Chris\AppData\Local\Python\venv\document-search\Lib\site-packages\whoosh\searching.py", line 931, in correct_query
    return sqc.correct_query(q, qstring)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\Chris\AppData\Local\Python\venv\document-search\Lib\site-packages\whoosh\spelling.py", line 327, in correct_query
    sugs = c.suggest(token.text, prefix=prefix, maxdist=maxdist)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\Chris\AppData\Local\Python\venv\document-search\Lib\site-packages\whoosh\spelling.py", line 66, in suggest
    for item in _suggestions(text, maxdist, prefix):
  File "C:\Users\Chris\AppData\Local\Python\venv\document-search\Lib\site-packages\whoosh\spelling.py", line 111, in _suggestions
    for sug in reader.terms_within(sugfield, text, maxdist, prefix=prefix):
  File "C:\Users\Chris\AppData\Local\Python\venv\document-search\Lib\site-packages\whoosh\codec\base.py", line 364, in find_matches
    match = dfa.next_valid_string(term)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\Chris\AppData\Local\Python\venv\document-search\Lib\site-packages\whoosh\automata\fsa.py", line 267, in next_valid_string
    for i, label in enumerate(string):
                    ^^^^^^^^^^^^^^^^^
TypeError: 'int' object is not iterable

If we look here:

whoosh/src/whoosh/searching.py

Line 914 in d9a3fa2

correctors[fieldname] = self.reader().corrector(fieldname)

This doesn't seem right. We're using ReaderCorrector as the default for all fields? SegmentReader.terms_within() uses Automata.terms_within() which is Levenshtein distance:

whoosh/src/whoosh/codec/base.py

Line 376 in d9a3fa2

dfa = self.levenshtein_dfa(uterm, maxdist, prefix)

and chokes on the int correctly returned from NUMERIC.from_bytes() received within W3FieldCursor:

whoosh/src/whoosh/codec/whoosh3.py

Line 541 in d9a3fa2

self._text = self._fieldobj.from_bytes(text)

Pretty sure Levenshtein distance isn't meant to be supported on NUMERIC fields? Since they're stored like

whoosh/src/whoosh/fields.py

Line 712 in d9a3fa2

def to_bytes(self, x, shift=0):

though maybe I'm the dumb one.

Shouldn't searching.py be something more like:

        # Fill in default corrector objects for fields that don't have a custom
        # one in the "correctors" dictionary
        from whoosh.fields import TEXT  # <-----
        for fieldname, field in self.schema.items():  # <-----
            fieldname = aliases.get(fieldname, fieldname)
            if isinstance(field, TEXT) and fieldname not in correctors:  # <-----
                correctors[fieldname] = self.reader().corrector(fieldname)

Anyway that's the fix I'm using for now. I only need corrections for text fields.

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Spelling corrector breaks on NUMERIC fields #55

Spelling corrector breaks on NUMERIC fields #55

CodeOptimist commented Mar 21, 2024 •

edited

Loading

Spelling corrector breaks on NUMERIC fields #55

Spelling corrector breaks on NUMERIC fields #55

Comments

CodeOptimist commented Mar 21, 2024 • edited Loading

CodeOptimist commented Mar 21, 2024 •

edited

Loading