You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I have noticed that when a token is present in exactly half of the documents, its contribution to the score is 0, which can be reproduced by the following snippet:
corpus = ["This text contains keyword1 and Keyword2",
"That is a text that contains keyword1 and term1",
"Page contains no keywords but contains term1 and term2",
"This text contains no keywords"]
tokenized_corpus = [doc.split() for doc in corpus]
bm25 = BM25Okapi(tokenized_corpus)
query = "This is a question about keyword1 & term1"
tokenized_query = query.split()
doc_scores = bm25.get_scores(tokenized_query)
> array([0. , 1.52856224, 0. , 0. ])
In the above example two documents containing the same tokens as the query has 0 score. I believe this is unexpected behavior.
Actually, any token that has zero or non-negative small positive value will have a lower score than a token that has negative idf. (due to negative idfs being assigned epsilon * average_idf). I suggest the distribution of scores per token is more calibrated by adopting IDF calculation given here: https://en.wikipedia.org/wiki/Okapi_BM25
log( (N - n(qi) + 0.5) / (n(qi) + 0.5) + 1)
The text was updated successfully, but these errors were encountered:
I resolved the issue by changing the formula for calculating IDF to math.log(self.corpus_size + 1) - math.log(freq + 0.5).
You can change BM25Okapi class _calc_idf method:
def_calc_idf(self, nd):
""" Calculates frequencies of terms in documents and in corpus. This algorithm sets a floor on the idf values to eps * average_idf """# collect idf sum to calculate an average idf for epsilon valueidf_sum=0# collect words with negative idf to set them a special epsilon value.# idf can be negative if word is contained in more than half of documentsnegative_idfs= []
forword, freqinnd.items():
idf=math.log(self.corpus_size+1) -math.log(freq+0.5)
self.idf[word] =idfidf_sum+=idfifidf<0:
negative_idfs.append(word)
self.average_idf=idf_sum/len(self.idf)
eps=self.epsilon*self.average_idfforwordinnegative_idfs:
self.idf[word] =eps
Hello all,
I have noticed that when a token is present in exactly half of the documents, its contribution to the score is 0, which can be reproduced by the following snippet:
In the above example two documents containing the same tokens as the query has 0 score. I believe this is unexpected behavior.
Actually, any token that has zero or non-negative small positive value will have a lower score than a token that has negative idf. (due to negative idfs being assigned epsilon * average_idf). I suggest the distribution of scores per token is more calibrated by adopting IDF calculation given here:
https://en.wikipedia.org/wiki/Okapi_BM25
log( (N - n(qi) + 0.5) / (n(qi) + 0.5) + 1)
The text was updated successfully, but these errors were encountered: