TF-IDF Entity Sampling Implementation Sensitive to Row Order Due to tokenList Usage #5

Allaa-boutaleb · 2024-12-03T16:34:22Z

Hello, thanks for this impressive contribution! There’s just a small issue I’ve noticed when playing around with the code. I’m not sure if this was intended or not.

The current implementation of tfidf_entity sampling in preprocessor.py has a bug that makes column embeddings sensitive to row ordering, which contradicts the intended column-wise nature of TF-IDF entity sampling.

The problem lies in how tokens are selected after TF-IDF scoring. Currently, the code:

Calculates TF-IDF scores and creates a sorted tokenFreq dictionary
But then builds the final token sequence by iterating through tokenList, which preserves the original row order:

tokenFreq = dict(sorted(tokenFreq.items(), key=lambda x: (-x[1], str(x[0]))))
# ...
for t in tokenList:
    if t in tokenFreq and t not in tokens:
        tokens += str(t).split(' ')

This creates two issues:

pandas' unique() preserves the order of first appearance, so the tokenList reflects the original row order
When multiple entities have the same TF-IDF score (common in real data), their ordering in the final serialization depends on row order rather than being deterministic

As a result, the same table with shuffled rows can produce different serializations and thus different embeddings, despite having identical column content. This affects embedding similarity computations between tables that should be identified as identical.

Could you please elaborate further on this? Just trying to see if this was intended and I didn’t understand it properly, or if it’s an actual mistake. Thanks!

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TF-IDF Entity Sampling Implementation Sensitive to Row Order Due to tokenList Usage #5

TF-IDF Entity Sampling Implementation Sensitive to Row Order Due to tokenList Usage #5

Allaa-boutaleb commented Dec 3, 2024

TF-IDF Entity Sampling Implementation Sensitive to Row Order Due to tokenList Usage #5

TF-IDF Entity Sampling Implementation Sensitive to Row Order Due to tokenList Usage #5

Comments

Allaa-boutaleb commented Dec 3, 2024