You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hello, thanks for this impressive contribution! There’s just a small issue I’ve noticed when playing around with the code. I’m not sure if this was intended or not.
The current implementation of tfidf_entity sampling in preprocessor.py has a bug that makes column embeddings sensitive to row ordering, which contradicts the intended column-wise nature of TF-IDF entity sampling.
The problem lies in how tokens are selected after TF-IDF scoring. Currently, the code:
Calculates TF-IDF scores and creates a sorted tokenFreq dictionary
But then builds the final token sequence by iterating through tokenList, which preserves the original row order:
pandas' unique() preserves the order of first appearance, so the tokenList reflects the original row order
When multiple entities have the same TF-IDF score (common in real data), their ordering in the final serialization depends on row order rather than being deterministic
As a result, the same table with shuffled rows can produce different serializations and thus different embeddings, despite having identical column content. This affects embedding similarity computations between tables that should be identified as identical.
Could you please elaborate further on this? Just trying to see if this was intended and I didn’t understand it properly, or if it’s an actual mistake. Thanks!
The text was updated successfully, but these errors were encountered:
Hello, thanks for this impressive contribution! There’s just a small issue I’ve noticed when playing around with the code. I’m not sure if this was intended or not.
The current implementation of
tfidf_entity
sampling inpreprocessor.py
has a bug that makes column embeddings sensitive to row ordering, which contradicts the intended column-wise nature of TF-IDF entity sampling.The problem lies in how tokens are selected after TF-IDF scoring. Currently, the code:
tokenFreq
dictionarytokenList
, which preserves the original row order:This creates two issues:
unique()
preserves the order of first appearance, so thetokenList
reflects the original row orderAs a result, the same table with shuffled rows can produce different serializations and thus different embeddings, despite having identical column content. This affects embedding similarity computations between tables that should be identified as identical.
Could you please elaborate further on this? Just trying to see if this was intended and I didn’t understand it properly, or if it’s an actual mistake. Thanks!
The text was updated successfully, but these errors were encountered: