[jvm-packages] Spark training ranking broken #6713

sammynammari · 2021-02-19T01:19:41Z

I've discovered a bug in ranking having to do with how XGBoost internally handles group information when repartitioning.

XGBoost will internally repartition based on group info, which involves aggregating based on the group. This expects that within each partition, the data is sorted by groupId.

However, the data will never be sorted by groupId, because prior to this step, XGBoost will internally perform a repartition which shuffles all of the data. As a result, when training, XGBoost will see that almost every group has only a single element in it, and so training will finish in a couple of iterations with ndcg = 1.

The workaround that I've come up with is to make sure in advance that my data is partitioned such that the number of partitions equals the num_workers param, with each partition sorted by groupId. This will internally bypass the repartition that randomly shuffles all of the groups.

It looks like this issue may have been raised before: #3489

I am using Spark 2.4.0 and XGBoost 1.1.2 for Scala 2.11.

The text was updated successfully, but these errors were encountered:

trivialfis · 2021-02-19T01:29:34Z

Not familiar with the spark package. It seems to be fixed before in the issue you linked, so this might be a regression? Are the tests added in efc4f85 useful in detecting such errors?

sammynammari · 2021-02-19T01:44:51Z

I don't think the issue I linked was fixed, because the commit added an iterator which expects data to be sorted within partitions.

The first unit test is collecting the entire RDD, flattening it, and then sorting by groupId before calling any asserts. So, all partition information is lost in the unit test. The second unit test just checks if the booster is null. As far as I can tell, neither of these unit tests would detect an issue related to ordering by groups within partitions.

trivialfis · 2021-02-19T02:55:08Z

Not sure if it's helpful. On XGBoost's dask interface (another distributed package) I just give up the use of group structure and use query id directly. Users have to sort the data according to query id first, but most of the time that's already done during preprocessing so I think it's fine.

sammynammari · 2021-02-19T18:44:47Z

@trivialfis thanks for the suggestion, but from what I can tell, the Spark package requires that a group column be set on the dataframe. I don't think there's any notion of a qid like in the libsvm data format.

voganrc · 2021-07-09T20:10:32Z

+1

I encountered the same issue when upgrading my LTR model from XGBoost 0.82 to 1.0, and @sammynammari's workaround fixes it.

I think a patch needs to be made for the problematic line that was added in this commit, and I'm surprised that more people haven't encountered this bug.

trivialfis · 2021-07-11T06:09:16Z

Hi, if I were to replace the group structure by qid would it be a huge problem?

sammynammari · 2021-07-16T22:28:32Z

Hi, thanks for looking at this again. I’m not sure I understand the question - I thought groupId and qid were the same concept? What is the difference that you are suggesting?

…

On Jul 10, 2021, at 11:09 PM, Jiaming Yuan ***@***.***> wrote: Hi, if I were to replace the group structure by qid would it be a huge problem? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or unsubscribe.

trivialfis · 2021-07-23T08:32:38Z

I thought groupId and qid were the same concept? What is the difference that you are suggesting?

Yes, they are the same concept. It's easier to simplify the handling logic if each same has an associated qid in sorted order.

sammynammari · 2023-01-25T18:30:44Z

Hi, any update here?

trivialfis · 2023-01-26T02:47:44Z

I'm revamping the LTR implementation and tracking related issues here https://github.com/dmlc/xgboost/projects/8 .

wbo4958 · 2024-11-27T03:23:25Z

Hi guys, Sorry for late response, I made a PR to fix this issue, could you help test it? #11023

trivialfis added the LTR Learning to rank label Mar 30, 2021

trivialfis mentioned this issue Nov 19, 2024

[jvm-packages] Fix partition related issue #9491

Open

wbo4958 mentioned this issue Nov 27, 2024

[jvm-packages] LTR: distribute the features with same group into same partition #11023

Merged

trivialfis closed this as completed in #11023 Dec 3, 2024

wbo4958 mentioned this issue Dec 4, 2024

[pyspark] LTR: distribute the features with same group into same partition #11047

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[jvm-packages] Spark training ranking broken #6713

[jvm-packages] Spark training ranking broken #6713

sammynammari commented Feb 19, 2021 •

edited

Loading

trivialfis commented Feb 19, 2021

sammynammari commented Feb 19, 2021 •

edited

Loading

trivialfis commented Feb 19, 2021

sammynammari commented Feb 19, 2021

voganrc commented Jul 9, 2021

trivialfis commented Jul 11, 2021

sammynammari commented Jul 16, 2021 via email

trivialfis commented Jul 23, 2021

sammynammari commented Jan 25, 2023

trivialfis commented Jan 26, 2023

wbo4958 commented Nov 27, 2024

[jvm-packages] Spark training ranking broken #6713

[jvm-packages] Spark training ranking broken #6713

Comments

sammynammari commented Feb 19, 2021 • edited Loading

trivialfis commented Feb 19, 2021

sammynammari commented Feb 19, 2021 • edited Loading

trivialfis commented Feb 19, 2021

sammynammari commented Feb 19, 2021

voganrc commented Jul 9, 2021

trivialfis commented Jul 11, 2021

sammynammari commented Jul 16, 2021 via email

trivialfis commented Jul 23, 2021

sammynammari commented Jan 25, 2023

trivialfis commented Jan 26, 2023

wbo4958 commented Nov 27, 2024

sammynammari commented Feb 19, 2021 •

edited

Loading

sammynammari commented Feb 19, 2021 •

edited

Loading