-
Notifications
You must be signed in to change notification settings - Fork 179
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[RF] Random Forest hangs during fit on KDDCup09-Upselling #984
Comments
Hi @Innixma, thanks for creating an issue, the problem will be analyzed. |
Hey @Innixma I gave it a try and trained a RF on this dataset. I do not see the issue you are reporting. I'm measuring 11m41s training time for the optimized version vs. 10m54s on stock. The performance drop is already reported in #1050. Keep in mind that you're running on a huge dataset. I'm using a machine with 250GB memory, out of which 50 GB were allocated at the time of running the tests. It's very likely that the 32 GB of your EC2 instance just don't cut it. |
Thanks @ahuber21! I will revisit this topic once the accuracy drop issue has been resolved, as I can run the benchmark again and see if it still occurs. It is reasonable to suspect it is a memory issue, and if it occurs again I'll play around with different sized instances to see where the breaking point is and if native RF uses more/less memory than scikit-learn-intelex RF. |
Hey @Innixma, the accuracy drop is understood and stems from an optimization we do for performance that does not translate to all use cases. I am working on an API change that will give the user more control so they can trade performance vs. accuracy themselves. But for now, you could running on KDDCup09-Upselling again with 2023.1.1 to see if the performance improved. In a second run, you could modify https://github.com/intel/scikit-learn-intelex/blob/master/daal4py/sklearn/ensemble/_forest.py#L254 and set As I said, we're working on an update to make this less messy and more transparent. |
Thats great to hear @ahuber21! I will plan to test it once it is part of an official release. We are working on automating our benchmarking logic, which should enable us to test these options relatively easily once it is available in the next few months. We are planning to release AutoGluon v1.0 by EOY. I think our team will focus our efforts on determining which backend we use for each model type prior to v1.0 release, probably starting around July, and at that point we will do a deep dive comparing native scikit-learn with scikit-learn-intelex and potentially other packages. At minimum this comparison would include RandomForest, ExtraTrees, KNearestNeighbors, & LinearRegression/LogisticRegression, but could potentially be more. The things we are looking for to determine which backend to use will be in the following priority:
|
Describe the bug
scikit-learn-intelex Random Forest hangs on KDDCup09 dataset (Large binary classification 2GB dataset in AutoMLBenchmark)
To Reproduce
Steps to reproduce the behavior:
Hyperparameters:
Note: You likely need to preprocess the dataset before sending it to RF. I am using AutoGluon to do this automatically as I'm testing scikit-learn-intelex RF integration. This may be easier to test in a couple weeks after AutoGluon v0.4.0 releases, as it will include a toggle to enable intelex RF.
Expected behavior
If not using intelex and just sklearn (with same hyperparameters), the model trains quickly and has no issues:
Output/Screenshots
Environment:
pip freeze:
The text was updated successfully, but these errors were encountered: