-
-
Notifications
You must be signed in to change notification settings - Fork 8.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Call for help] Bringing the R package up-to-date. #9475
Comments
As a side note, at the moment I'm the maintainer of the CRAN package. If there are contributors who want to share this I'm more than happy to assist and change the maintainership. I have some familiarity with R for statistical computing, but it's not my daily drive and it's unlikely that I will gain a deep understanding of the community in near future. |
👋🏻 I'd be happy to help out more with the R package, if you're open to it. I'll try some more small PRs to get familiar with XGBoost generally and the R package specifically before addressing the specific features you've listed. |
@jameslamb Thank you for working on the R package! Feel free to ping me if there's anything I can help. |
@trivialfis I'm interested in tackling the first issue: adding categorical support to the existing R implementation. I should be able to get it done within the next month or two. Are you open to me creating a draft PR in a fork and pinging you + @jameslamb when it's ready? |
Thank you for volunteering! Feel free to ping me. |
Great initiative, thanks a lot @trivialfis :
|
Sure I could help if needed. But I think a first step would be to decide on what the final result should look like, how far are the changes meant to be, and whether we're going to hold off the R release until all changes are complete (e.g. by keeping a separate branch) or introduce them gradually alongside with releases; or maybe even release under a different name like Right now, Changing the interface to be more idiomatic would be nicer for users, and I think this major version jump is the best chance to introduce big changes as there probably won't be any next major release any time soon, if ever. However, there's a lot of packages that depend on So I'd say let's first clarify on what's acceptable to change and what isn't. For example, off the top of my mind, I can think of the following un-idiomatic behaviors that could be improved:
Some of these would involve large breakage in reverse dependencies, while others wouldn't, but I think the biggest gains here are precisely the ones that would cause breakage. |
I am personally in favor of creating |
@hcho3: I wanted to propose {xgboost2} as well, but did not dare :-) |
I'm happy with xgboost2. Should it be a different package or can live side by side with the existing one? |
@trivialfis We can host the R code in this repository, but we'd submit the code as |
I will leave that for the actual implementation to decide. I'm fine with having a different CRAN package or continue to use the existing one with a new interface. |
I see. So we could define new functions suffixed with
What do you think? @david-cortes @mayer79 @jameslamb @dfsnow |
@trivialfis Also, we need to decide whether we want to try to submit the current XGBoost 2.0 to CRAN. My guess is that it would be a lot of work; correct me if I'm wrong. For example, we'd need to submit pull requests to reverse dependencies. The new requirement for limiting multithreading on the test farm is proving to be a headache as well. My personal preference is to not bother with submitting current XGBoost code to CRAN. Let's start making great changes to the R package now, and later we can submit |
This one is solved after many debugging efforts during the 2.0 release. The reverse dependency is a little bit soul-crushing though, which I haven't been able to finish.
I'm totally on board with this, it's just whether @david-cortes and @mayer79 want to take an incremental approach or start anew, or somewhere in the middle with a transition peroid. Either way, I'm happy to assist and will try to support whatever decision is made by the implementor. |
We would probably write the train/cv-API from scratch, but recycle the internals that wrap the C++ functions. I will start to study the current functions this weekend. |
I don't know what's the best way forward here. I see several potential routes. (a) Release under name
(b) Release an
(c) Release under name
(d) Release under name
From these options, I guess (d) is the most "correct" one. (c) is very tempting for me. (a) is suboptimal for everyone but perhaps a feasible compromise. (b) I don't like but I guess it's the way most high-profile packages operate. Regardless of the choice, it'd nevertheless be helpful if we could come up with a roadmap with checklists of all the things that need to be changed, and perhaps a discussion thread to discuss about what changes are needed or welcomed (e.g. to discuss about how to deal with categorical features in the hypothetical formula interface). And of course would be helpful if this roadmap could be kept up-to-date from those discussions and updated as PRs are merged. And also to make a distinction between user-facing and non-user-facing - for example, using inplace predict makes no difference in the external interface, and would thus not block a release or block other PRs, whereas breaking down the model-fitting function into a formula and an x/y interface would be a big user-facing change. As a different topic, I am also wondering about timelines and mode of development. For example, In the case of I'd say introducing a new interface in one go where everyone in this thread makes comments would make things easier, but it's more prone to bugs and hard to review. Not sure what maintainers here would think though. On yet another topic, I am wondering if it'd be possible to get helper features in the C interface from the core xgboost developers in order to improve the R interface. For example, I had a previous (unsuccessful) go at modifying parts of the Likewise, it'd be helpful if for example the C interface could take a parameter As yet another comment, the R interface currently has a divison between |
We can mitigate the disadvantage of this approach by first doing a design review. Contributors would review the proposed interface and see if it is reasonable. As the most impactful changes are user-facing, design review will be useful. Personally, as long as the new interface is more ergonomic and idiomatic R, I would not quibble with minor implementation details too much. I am no expert of R, and I would have to defer to you when it comes to nitty gritty details of programming in R. |
Yes, I am more than happy to help. |
Let's start with this. I will look into the issue. Related #9640 .
I think this will probably be an R-specific issue, considering that it's the only interface that uses a 1-based index. I don't think solving it in C is easier or cleaner than solving it in R. Maybe we can have some utilities in R to handle all indexing. |
@david-cortes Thank you for the very detailed and informative breakdown of various options and obstacles. Here's my two cents: I think we all agree that a huge set of breaking changes is inevitable. It's not quite productive to think about compatibility when the goal is to rewrite everything, The question is how to handle reverse dependencies after we introduce the brand-new interface. This is orthogonal to how we will develop the new R package, we can worry about the CRAN package after we have finalized the new interface. The development of the new package should focus on the package itself and bear no historical burden. So, let's take a step back and focus on the new interface for now. For the actual development, I agree that either a checklist or a design doc would be helpful for everyone to see the direction and status. We can also discuss some technical details in the doc. Feel free to open a new issue for this. (or use google doc if you prefer). For instance, we can sort out the details for the predict function and what's needed to implement it in the design doc.
After having the tracking issue/design doc, I think we should merge PRs progressively even if there's a feature-complete prototype (we can split it up into multiple PRs). This is a good practice for reviews and for organizing code for unit tests. In the end, I think we will follow the usual steps:
I can host online chats using teams if needed, particularly for working on C-related primitives. ;-) |
Just to clarify, I don't mean to break everyone's code. I think we will either launch a new package or launch a new interface without touching the existing one. Among the options listed by @david-cortes , I think we can agree that we will have two interfaces by the end, the question is how to introduce the new one to everyone. |
Created a different issue to discuss details of what the R interface should look like and discuss about some potential issues in the implementation: Would be ideal if people in this thread could take a look and comment there before starting with the coding. |
Given the plan to create an entirely new interface, is there still a reason to work on the existing one? Specifically, I planned to add categorical support sometime before the end of the year, is there still an appetite for that PR? As an aside, while I love the idea of a better maintained/more idiomatic R interface, I'm a bit nervous about a full rewrite. My concerns as a mostly uninformed outside observer are:
Basically, I'm worried that what we'll end up with is a a mostly unmaintained |
It doesn't need to be a full re-write since ideally the higher-level interface would be calling the lower-level interface (which likely won't end up with many changes from current code) in most cases and adding extra things on top. So features like adding categorical features for |
Probably need some more fixes on 1.7: Dear maintainer, Please see the problems shown on Please correct before 2023-12-12 to safely retain your package on CRAN. Best, |
We are maintaining the R package with the best effort, but due to resource limitations, the R package feature has been lagging behind compared to the Python package. Here is a list of things that can be improved. Contributions are appreciated!
factor
type and pass related information into xgboost.QuantileDMatrix
support can help R users to reduce memory usage.DataIter
.The text was updated successfully, but these errors were encountered: