Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[jvm-packages] Spark training ranking broken #6713

Closed
sammynammari opened this issue Feb 19, 2021 · 11 comments · Fixed by #11023 or #11047
Closed

[jvm-packages] Spark training ranking broken #6713

sammynammari opened this issue Feb 19, 2021 · 11 comments · Fixed by #11023 or #11047
Labels
LTR Learning to rank

Comments

@sammynammari
Copy link

sammynammari commented Feb 19, 2021

I've discovered a bug in ranking having to do with how XGBoost internally handles group information when repartitioning.

XGBoost will internally repartition based on group info, which involves aggregating based on the group. This expects that within each partition, the data is sorted by groupId.

However, the data will never be sorted by groupId, because prior to this step, XGBoost will internally perform a repartition which shuffles all of the data. As a result, when training, XGBoost will see that almost every group has only a single element in it, and so training will finish in a couple of iterations with ndcg = 1.

The workaround that I've come up with is to make sure in advance that my data is partitioned such that the number of partitions equals the num_workers param, with each partition sorted by groupId. This will internally bypass the repartition that randomly shuffles all of the groups.

It looks like this issue may have been raised before: #3489

I am using Spark 2.4.0 and XGBoost 1.1.2 for Scala 2.11.

@trivialfis
Copy link
Member

Not familiar with the spark package. It seems to be fixed before in the issue you linked, so this might be a regression? Are the tests added in efc4f85 useful in detecting such errors?

@sammynammari
Copy link
Author

sammynammari commented Feb 19, 2021

I don't think the issue I linked was fixed, because the commit added an iterator which expects data to be sorted within partitions.

The first unit test is collecting the entire RDD, flattening it, and then sorting by groupId before calling any asserts. So, all partition information is lost in the unit test. The second unit test just checks if the booster is null. As far as I can tell, neither of these unit tests would detect an issue related to ordering by groups within partitions.

@trivialfis
Copy link
Member

Not sure if it's helpful. On XGBoost's dask interface (another distributed package) I just give up the use of group structure and use query id directly. Users have to sort the data according to query id first, but most of the time that's already done during preprocessing so I think it's fine.

@sammynammari
Copy link
Author

@trivialfis thanks for the suggestion, but from what I can tell, the Spark package requires that a group column be set on the dataframe. I don't think there's any notion of a qid like in the libsvm data format.

@trivialfis trivialfis added the LTR Learning to rank label Mar 30, 2021
@voganrc
Copy link

voganrc commented Jul 9, 2021

+1

I encountered the same issue when upgrading my LTR model from XGBoost 0.82 to 1.0, and @sammynammari's workaround fixes it.

I think a patch needs to be made for the problematic line that was added in this commit, and I'm surprised that more people haven't encountered this bug.

@trivialfis
Copy link
Member

Hi, if I were to replace the group structure by qid would it be a huge problem?

@sammynammari
Copy link
Author

sammynammari commented Jul 16, 2021 via email

@trivialfis
Copy link
Member

I thought groupId and qid were the same concept? What is the difference that you are suggesting?

Yes, they are the same concept. It's easier to simplify the handling logic if each same has an associated qid in sorted order.

@sammynammari
Copy link
Author

Hi, any update here?

@trivialfis
Copy link
Member

I'm revamping the LTR implementation and tracking related issues here https://github.com/dmlc/xgboost/projects/8 .

@wbo4958
Copy link
Contributor

wbo4958 commented Nov 27, 2024

Hi guys, Sorry for late response, I made a PR to fix this issue, could you help test it? #11023

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
LTR Learning to rank
Projects
None yet
4 participants