Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

IDF in BM25Okapi is not version from Atire #35

Open
msalwen opened this issue Oct 30, 2023 · 2 comments
Open

IDF in BM25Okapi is not version from Atire #35

msalwen opened this issue Oct 30, 2023 · 2 comments
Labels
bug Something isn't working

Comments

@msalwen
Copy link

msalwen commented Oct 30, 2023

In Trotman, Jia & Crane the idf measure is given as log(N/df_t) (see top of page 5), where N is the corpus size and df_t is the number of docs containing term t. This is always non-negative. In the implementation of BM25Okapi, you have used an earlier version of the idf computation (Robertson-Spark Jones, see eq (4) bottom of page 2 in Trotman, Puurula & Burgess), which can become negative and which you handle by setting negative values to eps.

The balance of the score calculation in BM25Okapi follows Atire exactly, and the implementations for BM25L and BM25+ align perfectly with the descriptions in Trotman, Puurula & Burgess. I wonder why the idf implementation for BM25Okapi deviates from the Atire specification.

@dorianbrown
Copy link
Owner

Sorry for the late reply, but this project has a been a little dormant for me.

When trying to address some issues, I started looking into references of the implementation, and I've run into many different variations of the Okapi/ATIRE algorithm. When implementing this one, I'm not sure why I ended up following the Robertson-Spark Jones formulation, but do I understand correctly that log(N/df_t) implementation is the preferred one in this case?

I'm also wondering if it makes sense to include the mentioned variations in some way (i.e. rsj_idf = True for Robertson-Spark Jones), but for that it would be useful to get some feedback from you guys.

@dorianbrown dorianbrown added the bug Something isn't working label Oct 8, 2024
@dorianbrown
Copy link
Owner

Some of the confusion also came from the wikipedia page citing the Robertson-Spark Jones IDF: https://en.wikipedia.org/wiki/Okapi_BM25

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants