-
-
Notifications
You must be signed in to change notification settings - Fork 8.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[jvm-packages] Spark training ranking broken #6713
Comments
Not familiar with the spark package. It seems to be fixed before in the issue you linked, so this might be a regression? Are the tests added in efc4f85 useful in detecting such errors? |
I don't think the issue I linked was fixed, because the commit added an iterator which expects data to be sorted within partitions. The first unit test is collecting the entire RDD, flattening it, and then sorting by |
Not sure if it's helpful. On XGBoost's dask interface (another distributed package) I just give up the use of group structure and use query id directly. Users have to sort the data according to query id first, but most of the time that's already done during preprocessing so I think it's fine. |
@trivialfis thanks for the suggestion, but from what I can tell, the Spark package requires that a |
+1 I encountered the same issue when upgrading my LTR model from XGBoost 0.82 to 1.0, and @sammynammari's workaround fixes it. I think a patch needs to be made for the problematic line that was added in this commit, and I'm surprised that more people haven't encountered this bug. |
Hi, if I were to replace the group structure by qid would it be a huge problem? |
Hi, thanks for looking at this again. I’m not sure I understand the question - I thought groupId and qid were the same concept? What is the difference that you are suggesting?
… On Jul 10, 2021, at 11:09 PM, Jiaming Yuan ***@***.***> wrote:
Hi, if I were to replace the group structure by qid would it be a huge problem?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub, or unsubscribe.
|
Yes, they are the same concept. It's easier to simplify the handling logic if each same has an associated qid in sorted order. |
Hi, any update here? |
I'm revamping the LTR implementation and tracking related issues here https://github.com/dmlc/xgboost/projects/8 . |
Hi guys, Sorry for late response, I made a PR to fix this issue, could you help test it? #11023 |
I've discovered a bug in ranking having to do with how XGBoost internally handles group information when repartitioning.
XGBoost will internally repartition based on group info, which involves aggregating based on the group. This expects that within each partition, the data is sorted by
groupId
.However, the data will never be sorted by
groupId
, because prior to this step, XGBoost will internally perform a repartition which shuffles all of the data. As a result, when training, XGBoost will see that almost every group has only a single element in it, and so training will finish in a couple of iterations with ndcg = 1.The workaround that I've come up with is to make sure in advance that my data is partitioned such that the number of partitions equals the
num_workers
param, with each partition sorted bygroupId
. This will internally bypass the repartition that randomly shuffles all of the groups.It looks like this issue may have been raised before: #3489
I am using Spark 2.4.0 and XGBoost 1.1.2 for Scala 2.11.
The text was updated successfully, but these errors were encountered: