Score is 0 when a token is in exactly 50% of the documents. #39

nkarahan-ing · 2024-05-22T07:33:59Z

Hello all,

I have noticed that when a token is present in exactly half of the documents, its contribution to the score is 0, which can be reproduced by the following snippet:

corpus = ["This text contains keyword1 and Keyword2",
           "That is a text that contains keyword1 and term1",
            "Page contains no keywords but contains term1 and term2",
           "This text contains no keywords"]

tokenized_corpus = [doc.split() for doc in corpus]

bm25 = BM25Okapi(tokenized_corpus)

query = "This is a question about keyword1 & term1"
tokenized_query = query.split()

doc_scores = bm25.get_scores(tokenized_query)
> array([0.        , 1.52856224, 0.        , 0.        ])

In the above example two documents containing the same tokens as the query has 0 score. I believe this is unexpected behavior.
Actually, any token that has zero or non-negative small positive value will have a lower score than a token that has negative idf. (due to negative idfs being assigned epsilon * average_idf). I suggest the distribution of scores per token is more calibrated by adopting IDF calculation given here:
https://en.wikipedia.org/wiki/Okapi_BM25

log( (N - n(qi) + 0.5) / (n(qi) + 0.5) + 1)

The text was updated successfully, but these errors were encountered:

danerlt · 2024-05-24T07:03:15Z

@nkarahan-ing
Below is the formula for calculating IDF from the BM25 Wikipedia page.

I resolved the issue by changing the formula for calculating IDF to math.log(self.corpus_size + 1) - math.log(freq + 0.5).

You can change BM25Okapi class _calc_idf method:

    def _calc_idf(self, nd):
        """
        Calculates frequencies of terms in documents and in corpus.
        This algorithm sets a floor on the idf values to eps * average_idf
        """
        # collect idf sum to calculate an average idf for epsilon value
        idf_sum = 0
        # collect words with negative idf to set them a special epsilon value.
        # idf can be negative if word is contained in more than half of documents
        negative_idfs = []
        for word, freq in nd.items():
            idf = math.log(self.corpus_size + 1) - math.log(freq + 0.5)
            self.idf[word] = idf
            idf_sum += idf
            if idf < 0:
                negative_idfs.append(word)
        self.average_idf = idf_sum / len(self.idf)

        eps = self.epsilon * self.average_idf
        for word in negative_idfs:
            self.idf[word] = eps

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Score is 0 when a token is in exactly 50% of the documents. #39

Score is 0 when a token is in exactly 50% of the documents. #39

nkarahan-ing commented May 22, 2024

danerlt commented May 24, 2024 •

edited

Loading

Score is 0 when a token is in exactly 50% of the documents. #39

Score is 0 when a token is in exactly 50% of the documents. #39

Comments

nkarahan-ing commented May 22, 2024

danerlt commented May 24, 2024 • edited Loading

danerlt commented May 24, 2024 •

edited

Loading