Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TF-IDF Entity Sampling Implementation Sensitive to Row Order Due to tokenList Usage #5

Open
Allaa-boutaleb opened this issue Dec 3, 2024 · 1 comment

Comments

@Allaa-boutaleb
Copy link

Hello, thanks for this impressive contribution! There’s just a small issue I’ve noticed when playing around with the code. I’m not sure if this was intended or not.

The current implementation of tfidf_entity sampling in preprocessor.py has a bug that makes column embeddings sensitive to row ordering, which contradicts the intended column-wise nature of TF-IDF entity sampling.

The problem lies in how tokens are selected after TF-IDF scoring. Currently, the code:

  1. Calculates TF-IDF scores and creates a sorted tokenFreq dictionary
  2. But then builds the final token sequence by iterating through tokenList, which preserves the original row order:
tokenFreq = dict(sorted(tokenFreq.items(), key=lambda x: (-x[1], str(x[0]))))
# ...
for t in tokenList:
    if t in tokenFreq and t not in tokens:
        tokens += str(t).split(' ')

This creates two issues:

  1. pandas' unique() preserves the order of first appearance, so the tokenList reflects the original row order
  2. When multiple entities have the same TF-IDF score (common in real data), their ordering in the final serialization depends on row order rather than being deterministic

As a result, the same table with shuffled rows can produce different serializations and thus different embeddings, despite having identical column content. This affects embedding similarity computations between tables that should be identified as identical.

Could you please elaborate further on this? Just trying to see if this was intended and I didn’t understand it properly, or if it’s an actual mistake. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants
@Allaa-boutaleb and others