Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

KMeans Sparse Init Update #2796

Conversation

md-shafiul-alam
Copy link
Contributor

Description

The pull request fixes the difference in centroids between dense and sparse method for KMeans++ init method.

Changes proposed in this pull request:

  • bug fix daal kmeans++ init for sparse data
  • allow oneDAL to pass nTrials to match with Scikit-learn

@md-shafiul-alam md-shafiul-alam changed the title [bug] KMeans Sparse Init Update KMeans Sparse Init Update May 21, 2024
@md-shafiul-alam
Copy link
Contributor Author

/intelci: run

Copy link
Contributor

@Vika-F Vika-F left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The modifications look good to me.
It would be better to add a couple of comments though.

Comment on lines 48 to 53
std::int64_t trial_count = desc.get_local_trials_count();
if (trial_count == -1)
{
const auto additional = std::log(cluster_count);
trial_count = 2 + std::int64_t(additional);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add the comment describing the logic behind this. Maybe the link to Sklearn docs.

@@ -292,15 +292,21 @@ public:
const algorithmFPType * const pLastAddedCenter, const algorithmFPType * const aWeights,
const algorithmFPType * const pDistSqBest)
{
algorithmFPType sumOfDist2 = algorithmFPType(0);
size_t csrCursor = 0u;
algorithmFPType sumOfDist2 = algorithmFPType(0);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add the description to updateMinDistForITrials function.
I understand that it was not there before, but I hope that by adding couple of comments at a time we can make oneDAL's code more readable.

Comment on lines +48 to +55
//number of trials to pick each centroid from, 2 + int(ln(cluster_count)) works better than vanilla kmeans++
//https://github.com/scikit-learn/scikit-learn/blob/a63b021310ba13ea39ad3555f550d8aeec3002c5/sklearn/cluster/_kmeans.py#L108
std::int64_t trial_count = desc.get_local_trials_count();
if (trial_count == -1) {
const auto additional = std::log(cluster_count);
trial_count = 2 + std::int64_t(additional);
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What kind of improvements this give? It seems this changes original behavior. Please also reflect it in the documentation if not already done.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is actually an original behavior in daal and oneDAL distributed, somehow was missed on oneDAL CPU

Copy link
Contributor

@samir-nasibli samir-nasibli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@md-shafiul-alam it is labeled like bug. Is it buggy or just fixes the bug?

Unfortunately we don't have good description for some labels, but should be addressed further.

@md-shafiul-alam
Copy link
Contributor Author

@md-shafiul-alam it is labeled like bug. Is it buggy or just fixes the bug?

Unfortunately we don't have good description for some labels, but should be addressed further.

This fixes the bug

@md-shafiul-alam
Copy link
Contributor Author

Changes have been incorporated in PR#2815

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants