You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Description:
The proximity search (~N) in Whoosh shows inconsistent behavior based on the nature of the words in the indexed document. It seems that individual letters and commonly used filler words might be disregarded, whereas semantically meaningful words are counted.
Expected Behavior:
A search for "hello world"~2 should match strings where "hello" and "world" are separated by up to two terms, regardless of the nature or semantic value of the intervening terms.
Actual Behavior:
The behavior of the proximity search appears inconsistent:
It matches strings like "hello X Y Z A B C D world" and "hello to a the but and for this world" even though there are many terms between "hello" and "world".
It does not match strings like "hello add more words to illustrate the problem world", correctly following the ~2 constraint.
Minimal Working Example (MWE):
fromwhoosh.fieldsimportSchema, TEXTfromwhoosh.indeximportcreate_infromwhoosh.qparserimportQueryParserfromwhoosh.filedb.filestoreimportRamStorageschema=Schema(content=TEXT(stored=True))
storage=RamStorage()
defcreate_new_index():
returnstorage.create_index(schema)
defadd_to_index(idx, content):
writer=idx.writer()
writer.add_document(content=content)
writer.commit()
defmatches_whoosh(query, indexed_opinion):
withindexed_opinion.searcher() assearcher:
parsed_query=QueryParser("content", indexed_opinion.schema).parse(query)
results=searcher.search(parsed_query)
returnlen(results) >0# Add test cases and print resultstest_cases= {
"Case 1": "hello X Y Z A B C D world",
"Case 2": "hello to a the but and for this world",
"Case 3": "hello add more words to illustrate the problem world"
}
query='"hello world"~2'forcase_name, contentintest_cases.items():
idx=create_new_index()
add_to_index(idx, content)
print(f"{case_name}: {matches_whoosh(query, idx)}")
Environment:
Whoosh version: 2.7.4
Python version: 3.10.0
Operating System: macOS 13.5.1 with Apple M1 Pro
The text was updated successfully, but these errors were encountered:
cclauss
pushed a commit
to cclauss/whoosh-1
that referenced
this issue
Feb 9, 2024
# Description
This resolves the code coverage reporting, so the actual source files
will also have coverage reported. I configured my own fork with a token
and you can view the results:
https://app.codecov.io/github/stumpylog/whoosh-reloaded
The main fix is to install using `pip install -e .` or editable.
Otherwise, coverage was not picked up those files are being relvant.
The other small fix was to only run the testing once.
Closes: mchaput#48
# Checklist:
- [x] I have performed a self-review of my own code
- [ ] I have commented my code in hard-to-understand areas
- [ ] I have made corresponding changes to the documentation
Description:
The proximity search (
~N
) in Whoosh shows inconsistent behavior based on the nature of the words in the indexed document. It seems that individual letters and commonly used filler words might be disregarded, whereas semantically meaningful words are counted.Expected Behavior:
A search for
"hello world"~2
should match strings where "hello" and "world" are separated by up to two terms, regardless of the nature or semantic value of the intervening terms.Actual Behavior:
The behavior of the proximity search appears inconsistent:
~2
constraint.Minimal Working Example (MWE):
Environment:
The text was updated successfully, but these errors were encountered: