Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Score is 0 when a token is in exactly 50% of the documents. #39

Open
nkarahan-ing opened this issue May 22, 2024 · 1 comment
Open

Score is 0 when a token is in exactly 50% of the documents. #39

nkarahan-ing opened this issue May 22, 2024 · 1 comment

Comments

@nkarahan-ing
Copy link

Hello all,

I have noticed that when a token is present in exactly half of the documents, its contribution to the score is 0, which can be reproduced by the following snippet:

corpus = ["This text contains keyword1 and Keyword2",
           "That is a text that contains keyword1 and term1",
            "Page contains no keywords but contains term1 and term2",
           "This text contains no keywords"]

tokenized_corpus = [doc.split() for doc in corpus]

bm25 = BM25Okapi(tokenized_corpus)

query = "This is a question about keyword1 & term1"
tokenized_query = query.split()

doc_scores = bm25.get_scores(tokenized_query)
> array([0.        , 1.52856224, 0.        , 0.        ])

In the above example two documents containing the same tokens as the query has 0 score. I believe this is unexpected behavior.
Actually, any token that has zero or non-negative small positive value will have a lower score than a token that has negative idf. (due to negative idfs being assigned epsilon * average_idf). I suggest the distribution of scores per token is more calibrated by adopting IDF calculation given here:
https://en.wikipedia.org/wiki/Okapi_BM25

log( (N - n(qi) + 0.5) / (n(qi) + 0.5) + 1)

@danerlt
Copy link

danerlt commented May 24, 2024

@nkarahan-ing
Below is the formula for calculating IDF from the BM25 Wikipedia page.

image

I resolved the issue by changing the formula for calculating IDF to math.log(self.corpus_size + 1) - math.log(freq + 0.5).

You can change BM25Okapi class _calc_idf method:

    def _calc_idf(self, nd):
        """
        Calculates frequencies of terms in documents and in corpus.
        This algorithm sets a floor on the idf values to eps * average_idf
        """
        # collect idf sum to calculate an average idf for epsilon value
        idf_sum = 0
        # collect words with negative idf to set them a special epsilon value.
        # idf can be negative if word is contained in more than half of documents
        negative_idfs = []
        for word, freq in nd.items():
            idf = math.log(self.corpus_size + 1) - math.log(freq + 0.5)
            self.idf[word] = idf
            idf_sum += idf
            if idf < 0:
                negative_idfs.append(word)
        self.average_idf = idf_sum / len(self.idf)

        eps = self.epsilon * self.average_idf
        for word in negative_idfs:
            self.idf[word] = eps

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants