You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
@dorianbrown In the seminal paper for this package, the Okapi at TREC-3 paper, and most other places, BM25 is defined over query terms rather than tokens, which would indicate that repeated query tokens should not impact the score. However, that does not seem to be the case in the rank-bm25 library:
This can be easily solved by the user by passing set(query)1 rather than query to the get_scores() method, but it seems as something that the user would expect to happen automatically. At the very least, we may want to document this.
1 Alternatively, list(dict.fromkeys(query)) for reproducible ordering, since floating point summation is not always associative.
The text was updated successfully, but these errors were encountered:
In Pyterrier/Terrier implementation of BM25 the number of times a query term is repeated matters. It is often used to upweight certain query terms. Look how the scoring changes:
@dorianbrown In the seminal paper for this package, the Okapi at TREC-3 paper, and most other places, BM25 is defined over query terms rather than tokens, which would indicate that repeated query tokens should not impact the score. However, that does not seem to be the case in the rank-bm25 library:
rank_bm25/rank_bm25.py
Line 117 in 329b794
This can be easily solved by the user by passing
set(query)
1 rather thanquery
to theget_scores()
method, but it seems as something that the user would expect to happen automatically. At the very least, we may want to document this.1 Alternatively,
list(dict.fromkeys(query))
for reproducible ordering, since floating point summation is not always associative.The text was updated successfully, but these errors were encountered: