KMeans Sparse Init Update #2796

md-shafiul-alam · 2024-05-21T13:55:07Z

Description

The pull request fixes the difference in centroids between dense and sparse method for KMeans++ init method.

Changes proposed in this pull request:

bug fix daal kmeans++ init for sparse data
allow oneDAL to pass nTrials to match with Scikit-learn

md-shafiul-alam · 2024-05-22T11:56:17Z

/intelci: run

cpp/daal/src/algorithms/kmeans/kmeans_plusplus_init_impl.i

cpp/oneapi/dal/algo/kmeans_init/backend/cpu/compute_kernel_dense.cpp

Vika-F

The modifications look good to me.
It would be better to add a couple of comments though.

Vika-F · 2024-05-24T11:13:50Z

cpp/oneapi/dal/algo/kmeans_init/backend/cpu/compute_kernel_dense.cpp

+    std::int64_t trial_count = desc.get_local_trials_count();
+    if (trial_count == -1)
+    {
+        const auto additional = std::log(cluster_count);
+        trial_count = 2 + std::int64_t(additional);
+    }


Please add the comment describing the logic behind this. Maybe the link to Sklearn docs.

Vika-F · 2024-05-24T11:18:25Z

cpp/daal/src/algorithms/kmeans/kmeans_plusplus_init_impl.i

@@ -292,15 +292,21 @@ public:
                                            const algorithmFPType * const pLastAddedCenter, const algorithmFPType * const aWeights,
                                            const algorithmFPType * const pDistSqBest)
    {
-        algorithmFPType sumOfDist2 = algorithmFPType(0);
-        size_t csrCursor           = 0u;
+        algorithmFPType sumOfDist2            = algorithmFPType(0);


Please add the description to updateMinDistForITrials function.
I understand that it was not there before, but I hope that by adding couple of comments at a time we can make oneDAL's code more readable.

samir-nasibli · 2024-06-19T08:22:34Z

cpp/oneapi/dal/algo/kmeans_init/backend/cpu/compute_kernel_dense.cpp

+    //number of trials to pick each centroid from, 2 + int(ln(cluster_count)) works better than vanilla kmeans++
+    //https://github.com/scikit-learn/scikit-learn/blob/a63b021310ba13ea39ad3555f550d8aeec3002c5/sklearn/cluster/_kmeans.py#L108
+    std::int64_t trial_count = desc.get_local_trials_count();
+    if (trial_count == -1) {
+        const auto additional = std::log(cluster_count);
+        trial_count = 2 + std::int64_t(additional);
+    }
+


What kind of improvements this give? It seems this changes original behavior. Please also reflect it in the documentation if not already done.

This is actually an original behavior in daal and oneDAL distributed, somehow was missed on oneDAL CPU

samir-nasibli

@md-shafiul-alam it is labeled like bug. Is it buggy or just fixes the bug?

Unfortunately we don't have good description for some labels, but should be addressed further.

md-shafiul-alam · 2024-06-20T07:34:35Z

@md-shafiul-alam it is labeled like bug. Is it buggy or just fixes the bug?

Unfortunately we don't have good description for some labels, but should be addressed further.

This fixes the bug

md-shafiul-alam · 2024-06-20T07:36:00Z

Changes have been incorporated in PR#2815

bug fix in kmeans sparse init

2f52a21

md-shafiul-alam added the bug label May 21, 2024

md-shafiul-alam changed the title ~~[bug] KMeans Sparse Init Update~~ KMeans Sparse Init Update May 21, 2024

md-shafiul-alam added 2 commits May 21, 2024 09:58

format and error fix

eee6863

align trial_count computation

328407d

md-shafiul-alam marked this pull request as ready for review May 22, 2024 11:52

md-shafiul-alam requested review from Alexsandruss, samir-nasibli and Alexandr-Solovev as code owners May 22, 2024 11:52

md-shafiul-alam requested review from Vika-F, ethanglaser, samir-nasibli, Alexsandruss and Alexandr-Solovev and removed request for samir-nasibli, Alexsandruss and Alexandr-Solovev May 22, 2024 11:53

md-shafiul-alam mentioned this pull request May 22, 2024

KMeans OOP uxlfoundation/scikit-learn-intelex#1770

Merged

8 tasks

Alexandr-Solovev reviewed May 23, 2024

View reviewed changes

cpp/daal/src/algorithms/kmeans/kmeans_plusplus_init_impl.i Outdated Show resolved Hide resolved

Alexandr-Solovev reviewed May 23, 2024

View reviewed changes

cpp/oneapi/dal/algo/kmeans_init/backend/cpu/compute_kernel_dense.cpp Outdated Show resolved Hide resolved

Alexandr-Solovev reviewed May 23, 2024

View reviewed changes

cpp/oneapi/dal/algo/kmeans_init/backend/cpu/compute_kernel_dense.cpp Outdated Show resolved Hide resolved

address review

137dee4

Vika-F reviewed May 24, 2024

View reviewed changes

md-shafiul-alam added 3 commits May 24, 2024 06:59

add comment and refactor

bc5138b

update comments

a5891d0

update comments

d7c2333

md-shafiul-alam mentioned this pull request Jun 10, 2024

CI: Initial additions for fp32 and windows GPU test support uxlfoundation/scikit-learn-intelex#1778

Merged

samir-nasibli reviewed Jun 19, 2024

View reviewed changes

md-shafiul-alam closed this Jun 20, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

KMeans Sparse Init Update #2796

KMeans Sparse Init Update #2796

md-shafiul-alam commented May 21, 2024

md-shafiul-alam commented May 22, 2024

Vika-F left a comment

Vika-F May 24, 2024

Vika-F May 24, 2024

samir-nasibli Jun 19, 2024

md-shafiul-alam Jun 20, 2024

samir-nasibli left a comment

md-shafiul-alam commented Jun 20, 2024

md-shafiul-alam commented Jun 20, 2024

KMeans Sparse Init Update #2796

KMeans Sparse Init Update #2796

Conversation

md-shafiul-alam commented May 21, 2024

Description

md-shafiul-alam commented May 22, 2024

Vika-F left a comment

Choose a reason for hiding this comment

Vika-F May 24, 2024

Choose a reason for hiding this comment

Vika-F May 24, 2024

Choose a reason for hiding this comment

samir-nasibli Jun 19, 2024

Choose a reason for hiding this comment

md-shafiul-alam Jun 20, 2024

Choose a reason for hiding this comment

samir-nasibli left a comment

Choose a reason for hiding this comment

md-shafiul-alam commented Jun 20, 2024

md-shafiul-alam commented Jun 20, 2024