-
-
Notifications
You must be signed in to change notification settings - Fork 8.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[jvm-packages] [breaking] rework xgboost4j-spark and xgboost4j-spark-gpu #10639
Conversation
Hi @trivialfis @hcho3 @dotbg, Could you kindly help review this PR, Thx very much. |
Hi there, The CI got passed, could you @trivialfis @hcho3 @dotbg help review this PR? Thx very much. |
Apologies for the delay, I will look at it this Friday. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried to go through this rework but I'm afraid it's beyond my expertise. Could you please update the examples and documents so that we can get a feel of what's going on with the new interface? Also, we might need to schedule an online interview, @hcho3 might want to join as well.
Yes, I would love to join. |
Some questions about build artifacts and packaging:
|
@trivialfis Do you have a particular agenda in mind for the online interview? Would it be primarily a Q&A session? I can set up an Outlook invite to choose the best time for all of us. |
For breaking changes PR, I am mostly concerned about the user interface, hence the request for documentation:
And we will need to have a clear picture for both CPU and GPU. As for internal code structure, i don't have any concern. It looks readable to me and I'm fairly sure that @wbo4958 does a better job than me for JVM based projects. If there's no way to even submit an error for breaking change, should we publish a different package on maven? |
Sorry for delay, I was on vacation these days. |
[Bobby] Yes, we still need to build tow libxgboost4j.so
[Bobby] the libxgboost4j.so with cuda will reside in xgboost4j package. xgboost4j-spark-gpu now depends on xgboost4j/xgboost4j-spark
[Bobby] what do you mean how long? |
@trivialfis @hcho3, yeah, let's have a online review session. |
CPU users will complain about the size of xgboost4j package if we choose this route. |
Not exactly. When compiling the CPU versions, there's no cuda things in libxgboost4j.so which will be put into public/nightly places. But for GPU version, we first compile libxgboost4j.so with CUDA and put it into xgboost4j package, but we don't distribute this xgboost4j to the maven repo, instead, we will only distribute xgboost4j-spark-gpu package to the maven, which is a uber package including xgboost4j/xgboost4j-spark. |
Hey @trivialfis @hcho3, CI passed, Could you help review it? Thx |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you please help address the implication for the following items:
- Remove any ETL operations in XGBoost
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you please help address the implication for the following items:
- Remove any ETL operations in XGBoost
...ages/xgboost4j-spark-gpu/src/main/scala/ml/dmlc/xgboost4j/scala/spark/GpuXGBoostPlugin.scala
Outdated
Show resolved
Hide resolved
...ages/xgboost4j-spark-gpu/src/main/scala/ml/dmlc/xgboost4j/scala/spark/GpuXGBoostPlugin.scala
Outdated
Show resolved
Hide resolved
...ages/xgboost4j-spark-gpu/src/main/scala/ml/dmlc/xgboost4j/scala/spark/GpuXGBoostPlugin.scala
Outdated
Show resolved
Hide resolved
...xgboost4j-spark-gpu/src/test/scala/ml/dmlc/xgboost4j/scala/spark/GpuXGBoostPluginSuite.scala
Outdated
Show resolved
Hide resolved
jvm-packages/xgboost4j-spark/src/main/scala/ml/dmlc/xgboost4j/scala/spark/Utils.scala
Show resolved
Hide resolved
...packages/xgboost4j-spark/src/main/scala/ml/dmlc/xgboost4j/scala/spark/XGBoostEstimator.scala
Show resolved
Hide resolved
.../xgboost4j-spark/src/main/scala/ml/dmlc/xgboost4j/scala/spark/params/TreeBoosterParams.scala
Show resolved
Hide resolved
For example, ETL includes "caching dataset", and collecting group for ranking issue. |
...ages/xgboost4j-spark-gpu/src/main/scala/ml/dmlc/xgboost4j/scala/spark/GpuXGBoostPlugin.scala
Show resolved
Hide resolved
@WeichenXu123 @rongou @Craigacp It would be great if we could get some help reviewing the PR when you are available. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not too familiar with this part of the code base, but overall looks good.
I haven't written any Spark or Scala code since Spark 1.4 so I don't think I'll have much ability to review this PR. |
I've started reviewing the spark-related code, but it takes a while |
...ages/xgboost4j-spark-gpu/src/main/scala/ml/dmlc/xgboost4j/scala/spark/GpuXGBoostPlugin.scala
Outdated
Show resolved
Hide resolved
...ages/xgboost4j-spark-gpu/src/main/scala/ml/dmlc/xgboost4j/scala/spark/GpuXGBoostPlugin.scala
Outdated
Show resolved
Hide resolved
...xgboost4j-spark-gpu/src/test/scala/ml/dmlc/xgboost4j/scala/spark/GpuXGBoostPluginSuite.scala
Outdated
Show resolved
Hide resolved
Hi @trivialfis, I've fixed all the comments. |
@wbo4958 Where is the definition of xgboost ranker? |
@trivialfis, I will add it in the following PR, I didn't want to blow up this PR. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Still have more to go through but here are a batch of comments (some questions).
...ages/xgboost4j-spark-gpu/src/main/scala/ml/dmlc/xgboost4j/scala/spark/GpuXGBoostPlugin.scala
Outdated
Show resolved
Hide resolved
jvm-packages/xgboost4j-spark-gpu/src/main/java/ml/dmlc/xgboost4j/java/QuantileDMatrix.java
Outdated
Show resolved
Hide resolved
jvm-packages/xgboost4j-spark-gpu/src/main/scala/ml/dmlc/xgboost4j/scala/QuantileDMatrix.scala
Outdated
Show resolved
Hide resolved
jvm-packages/xgboost4j-spark/src/main/scala/ml/dmlc/xgboost4j/scala/spark/XGBoostPlugin.scala
Show resolved
Hide resolved
...ages/xgboost4j-spark-gpu/src/main/scala/ml/dmlc/xgboost4j/scala/spark/GpuXGBoostPlugin.scala
Outdated
Show resolved
Hide resolved
...ages/xgboost4j-spark-gpu/src/main/scala/ml/dmlc/xgboost4j/scala/spark/GpuXGBoostPlugin.scala
Show resolved
Hide resolved
...ages/xgboost4j-spark-gpu/src/main/scala/ml/dmlc/xgboost4j/scala/spark/GpuXGBoostPlugin.scala
Show resolved
Hide resolved
...ages/xgboost4j-spark-gpu/src/main/scala/ml/dmlc/xgboost4j/scala/spark/GpuXGBoostPlugin.scala
Show resolved
Hide resolved
...ages/xgboost4j-spark-gpu/src/main/scala/ml/dmlc/xgboost4j/scala/spark/GpuXGBoostPlugin.scala
Show resolved
Hide resolved
jvm-packages/xgboost4j-spark/src/main/scala/ml/dmlc/xgboost4j/scala/spark/XGBoost.scala
Outdated
Show resolved
Hide resolved
For ranking task, I think the library should make sure the records with the same group column go to the same task, otherwise it will impact the model performance as we describe in 9491 . It looks like this PR didn't fix this issue yet. |
I will look into the ranking later. In theory, as long as the partitions for each local workers are sorted, the gradient is considered correct. |
Huge thanks to @dotbg @eordentlich for the reviews and for catching some issues early. I wouldn't be able to do that myself. Could you please help approve the PR for spark hanlding? I don't have any more comments about logic related to XGBoost itself. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixes looks great. 2 more minor comments.
...packages/xgboost4j-spark/src/main/scala/ml/dmlc/xgboost4j/scala/spark/XGBoostRegressor.scala
Outdated
Show resolved
Hide resolved
...packages/xgboost4j-spark/src/main/scala/ml/dmlc/xgboost4j/scala/spark/XGBoostEstimator.scala
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
Hi @trivialfis, could you help merge it? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the nice rework! Looking forward to the complete implementation.
About this PR
Features and Fixes
Introduce an abstract XGBoost Estimator:
Update to the latest XGBoost parameters
Address the missing value handling:
Remove any ETL operations in XGBoost
Rework the GPU plugin:
Expand sanity tests for CPU and GPU consistency:
Breaking changes
Remove some unused parameters like Rabit related parameters and train_test_ratio. etc.
Separate XGBoost-spark parameter from the whole XGBoost parameters, and make the constructor of XGBoost estimator only accept the XGBoost parameters instead of parameters for xgboost4j-spark. Eg,
or
Test this PR