ENH: Implement GPU version of spectral decomposition as linear regression fallback #2976

david-cortes-intel · 2024-11-14T10:30:20Z

Description

As a follow up from #2930

This PR adds a GPU version of the fallback for linear regression based on spectral decomposition, which mirrors the CPU version. It reuses the same tests, but this time executing them with fp32 too.

A few things I wasn't sure about:

Here I'm taking the minimum of the diagonal of the Cholesky factorization using a sycl reduction class, but I'm not sure if that's the usual way in oneDAL.
I am also not 100% sure if it is guaranteed that this operation (minimum of the diagonal) with an 'at_device' at the end would force execution to wait until the operation is completed before returning the value.
There's an operation here for which oneMKL doesn't provide the equivalent LAPACK functionality, which is: "column-major matrix divided by a broadcasted row vector" (i.e. multiply each row of the matrix by a scalar from the vector). I implemented it as a 2D for-loop with a multiplication by the inverse (which is faster but less precise), but I don't know if this is better than a 1D loop that would take a whole row, as rows are contiguous in memory.

PR should start as a draft, then move to ready for review state after CI is passed and all applicable checkboxes are closed.
This approach ensures that reviewers don't spend extra time asking for regular requirements.

You can remove a checkbox as not applicable only if it doesn't relate to this PR in any way.
For example, PR with docs update doesn't require checkboxes for performance while PR with any change in actual code should have checkboxes and justify how this code change is expected to affect performance (or justification should be self-evident).

Checklist to comply with before moving PR from draft:

PR completeness and readability

I have reviewed my changes thoroughly before submitting this pull request.
I have commented my code, particularly in hard-to-understand areas.
Git commit message contains an appropriate signed-off-by string (see CONTRIBUTING.md for details).
I have added a respective label(s) to PR if I have a permission for that.
I have resolved any merge conflicts that might occur with the base branch.

Testing

I have run it locally and tested the changes extensively.
All CI jobs are green or I have provided justification why they aren't.
I have extended testing suite if new functionality was introduced in this PR.

Performance

Not applicable (addition is a fallback for cases that would have failed previously).

david-cortes-intel · 2024-11-14T10:33:41Z

/intelci: run

cpp/oneapi/dal/backend/primitives/lapack/solve_dpc.cpp

Alexandr-Solovev · 2024-11-14T11:09:39Z

cpp/oneapi/dal/backend/primitives/lapack/solve_dpc.cpp


    opt_array<Float> dummy{};
-    auto potrf_event = potrf_factorization<uplo>(queue, nxtx, dummy, { xtx_event });
-    auto potrs_event = potrs_solution<uplo>(queue, nxtx, nxty, dummy, { potrf_event, xty_event });
+    try {


As I remember we use try catch only in tests/examples. Does it make sense to avoid it here?

It's meant to catch an exception thrown by oneMKL in the call to potrf, over which we have no control AFAIK. An alternative would be to modify the wrapper, but it's also used elsewhere in places where the exception is not catched (so a user error is thrown).

I am Ok with having a try in the code; but let's make it more clear for the readers - please add the description of what is happening here - why the system is tried to be solved twice in some cases.

Added a comment about the logic for the try-catch block.

david-cortes-intel · 2024-11-14T16:47:18Z

Oddly, the first version of this function (minimum of the diagonal) that I wrote seems to succeed in some tests, but fail in others (particularly when it it called repeatedly, as in incremental linear regression):

template <typename Float>
Float diagonal_minimum(sycl::queue& queue,
                       const Float* Matrix,
                       const std::int64_t dim_matrix,
                       sycl::event& event_Matrix) {
    constexpr auto alloc = sycl::usm::alloc::device;
    auto diag_min_holder = array<Float>::empty(queue, 1, alloc);
    sycl::event diag_min_event = queue.submit([&](sycl::handler& h) {
        auto min_reduction = sycl::reduction(diag_min_holder.get_mutable_data(), sycl::minimum<>());
        h.depends_on(event_Matrix);
        h.parallel_for(dim_matrix, min_reduction, [=](const auto& i, auto& min_obj) {
            min_obj.combine(Matrix[i * (dim_matrix + 1)]);
        });
    });
    return ndview<Float, 1>::wrap(diag_min_holder).at_device(queue, 0, { diag_min_event });
}

(tried also adding event_Matrix to the dependencies of the return line, but it made no difference)

But this slower version doesn't seem to have any issue:

template <typename Float>
Float diagonal_minimum(sycl::queue& queue,
                       const Float* Matrix,
                       const std::int64_t dim_matrix,
                       sycl::event& event_Matrix) {
    constexpr auto alloc = sycl::usm::alloc::device;
    auto idx_min_holder = array<std::int64_t>::empty(queue, 1, alloc);
    sycl::event diag_min_event = mkl::blas::column_major::iamin(
        queue,
        dim_matrix,
        Matrix,
        dim_matrix + 1,
        idx_min_holder.get_mutable_data(),
        mkl::index_base::zero,
        { event_Matrix }
    );
    const std::int64_t idx_min = ndview<std::int64_t, 1>::wrap(idx_min_holder).at_device(queue, 0, { diag_min_event });
    return ndview<Float, 1>::wrap(Matrix, dim_matrix * dim_matrix).at_device(queue, idx_min * (dim_matrix + 1), { event_Matrix });
}

Any hints and what might be happening?

david-cortes-intel · 2024-11-14T16:51:25Z

/intelci: run

david-cortes-intel · 2024-11-15T07:25:56Z

Another thing that I'm now not very sure about is: should calls to oneMKL lapack manually call queue.wait_and_throw() in order to catch runtime exceptions from them?

This is what the oneMKL docs mention:

oneMKL error handling relies on the mechanism of C++ exceptions. Should error occur, it will be propagated at the point of a function call where it is caught using standard C++ error handling mechanism.

... which sounds like it would let it wait until the exception can be thrown, but then it leaves me wondering how it handles queueing.

david-cortes-intel · 2024-11-15T07:55:58Z

After some investigation regarding the diagonal_minimum function, turns out the issue with the function for diagonal minimum was that it wasn't initializing the reducer object, despite what the docs mention:
https://github.khronos.org/SYCL_Reference/iface/reduction-variables.html#known-identities

Adding a manual initialization did the trick.

david-cortes-intel · 2024-11-15T08:05:34Z

/intelci: run

david-cortes-intel · 2024-11-15T12:10:18Z

/intelci: run

david-cortes-intel · 2024-11-15T15:53:09Z

/intelci: run

Vika-F

The overall changes look good to me, they just need to be aligned a bit with the codebase. Please see the comments.

cpp/oneapi/dal/backend/primitives/lapack/solve_dpc.cpp

Vika-F · 2024-11-18T12:56:21Z

cpp/oneapi/dal/backend/primitives/lapack/solve_dpc.cpp


    opt_array<Float> dummy{};
-    auto potrf_event = potrf_factorization<uplo>(queue, nxtx, dummy, { xtx_event });
-    auto potrs_event = potrs_solution<uplo>(queue, nxtx, nxty, dummy, { potrf_event, xty_event });
+    try {


I am Ok with having a try in the code; but let's make it more clear for the readers - please add the description of what is happening here - why the system is tried to be solved twice in some cases.

Vika-F · 2024-11-18T13:09:22Z

Another thing that I'm now not very sure about is: should calls to oneMKL lapack manually call queue.wait_and_throw() in order to catch runtime exceptions from them?

@david-cortes-intel Good point. I think yes, .wait_and_throw() is needed after the last oneMKL function call in try to catch the exception.
Otherwise it can be missed as oneMKL calls run asynchronously on device.

cpp/oneapi/dal/backend/primitives/lapack/solve_dpc.cpp

Co-authored-by: Victoriya Fedotova <[email protected]>

david-cortes-intel · 2024-11-18T15:25:44Z

/intelci: run

david-cortes-intel · 2024-11-18T15:26:09Z

Another thing that I'm now not very sure about is: should calls to oneMKL lapack manually call queue.wait_and_throw() in order to catch runtime exceptions from them?

@david-cortes-intel Good point. I think yes, .wait_and_throw() is needed after the last oneMKL function call in try to catch the exception. Otherwise it can be missed as oneMKL calls run asynchronously on device.

Thanks for confirming. Added a manual call to wait_and_throw.

Vika-F · 2024-11-19T12:21:53Z

cpp/oneapi/dal/backend/primitives/lapack/solve_dpc.cpp

+        if (diag_min <= 1e-6)
+            throw mkl::lapack::computation_error("", "", 0);


That should be the last portion of comments.

Please provide some descriptive name for 1e-6 threshold. We are trying to reduce the number of magic constants in our code.

Throwing oneMKL exception here looks like a hack to call the code in catch.
It would be better just to call spectral decomposition here and in catch. Maybe re-write it a bit in order not to duplicate the code with array copying.

Added a comment about the threshold and changed the logic of the fallback to a named section.

Maybe I do not understand something, but why this block cannot be just a function? For the sake of simplicity.

goto in DPC++ code looks pretty weird. And besides the performance aspect, the use of goto make the code harder to understand. As the execution order and the call stack can be very hard to track.

Changed it into a function.

Vika-F · 2024-11-20T09:52:26Z

/intelci: run

Vika-F

Let's wait for a [semi-]green CI and LGTM.

david-cortes-intel · 2024-11-21T07:51:44Z

/intelci: run

david-cortes-intel · 2024-11-21T11:06:25Z

/intelci: run

david-cortes-intel · 2024-11-21T16:45:13Z

CI is passing now, will merge.

add spectral decomposition fallback on GPU

b3c4d41

david-cortes-intel added the dpc++ Issue/PR related to DPC++ functionality label Nov 14, 2024

david-cortes-intel requested a review from Vika-F November 14, 2024 10:30

david-cortes-intel requested review from Alexsandruss, samir-nasibli and Alexandr-Solovev as code owners November 14, 2024 10:30

david-cortes-intel marked this pull request as draft November 14, 2024 10:30

Alexandr-Solovev reviewed Nov 14, 2024

View reviewed changes

cpp/oneapi/dal/backend/primitives/lapack/solve_dpc.cpp Outdated Show resolved Hide resolved

Alexandr-Solovev reviewed Nov 14, 2024

View reviewed changes

cpp/oneapi/dal/backend/primitives/lapack/solve_dpc.cpp Outdated Show resolved Hide resolved

Alexandr-Solovev reviewed Nov 14, 2024

View reviewed changes

cpp/oneapi/dal/backend/primitives/lapack/solve_dpc.cpp Show resolved Hide resolved

Alexandr-Solovev reviewed Nov 14, 2024

View reviewed changes

david-cortes-intel mentioned this pull request Nov 14, 2024

ENH: Enable linear regression on GPU with non-PSD systems uxlfoundation/scikit-learn-intelex#2167

Merged

8 tasks

fix failing min diagonal

d528fab

switch back to sycl reduction for diag min

5255e64

try another way

141b65b

use ndarray instead of array for allocation

e88ed47

david-cortes-intel marked this pull request as ready for review November 15, 2024 12:10

avoid copying unused entries in array

a325452

Vika-F reviewed Nov 18, 2024

View reviewed changes

cpp/oneapi/dal/backend/primitives/lapack/solve_dpc.cpp Show resolved Hide resolved

Update cpp/oneapi/dal/backend/primitives/lapack/solve_dpc.cpp

6addc03

Co-authored-by: Victoriya Fedotova <[email protected]>

david-cortes-intel added 6 commits November 18, 2024 15:07

follow internal logic for error messages

8cc9ca1

add comment about the try-catch block

e39f2ac

add manual wait-and-throw call where appropriate

9a4d13e

use internal wrappers for blas calls

9a1f48d

add comment about dimensions of inputs

8b3d1c1

change format of comments

e86b5bd

Vika-F reviewed Nov 19, 2024

View reviewed changes

david-cortes-intel added 4 commits November 19, 2024 13:41

add comment about threshold for diagonal minimum

15b686d

avoid manually throwing MKL error

4cff7b9

remove accidentally commited file

58ebb83

wrap procedure in a function

4e241dc

Vika-F approved these changes Nov 20, 2024

View reviewed changes

david-cortes-intel merged commit 08fd918 into uxlfoundation:main Nov 21, 2024
18 checks passed

david-cortes-intel mentioned this pull request Nov 26, 2024

ENH: Moving Ridge out of preview uxlfoundation/scikit-learn-intelex#2154

Merged

13 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH: Implement GPU version of spectral decomposition as linear regression fallback #2976

ENH: Implement GPU version of spectral decomposition as linear regression fallback #2976

david-cortes-intel commented Nov 14, 2024 •

edited

Loading

david-cortes-intel commented Nov 14, 2024

Alexandr-Solovev Nov 14, 2024

david-cortes-intel Nov 14, 2024

Vika-F Nov 18, 2024 •

edited

Loading

david-cortes-intel Nov 18, 2024

david-cortes-intel commented Nov 14, 2024 •

edited

Loading

david-cortes-intel commented Nov 14, 2024

david-cortes-intel commented Nov 15, 2024 •

edited

Loading

david-cortes-intel commented Nov 15, 2024

david-cortes-intel commented Nov 15, 2024

david-cortes-intel commented Nov 15, 2024

david-cortes-intel commented Nov 15, 2024

Vika-F left a comment

Vika-F Nov 18, 2024 •

edited

Loading

Vika-F commented Nov 18, 2024

david-cortes-intel commented Nov 18, 2024

david-cortes-intel commented Nov 18, 2024

Vika-F Nov 19, 2024

david-cortes-intel Nov 19, 2024

Vika-F Nov 19, 2024

david-cortes-intel Nov 19, 2024

Vika-F commented Nov 20, 2024

Vika-F left a comment

david-cortes-intel commented Nov 21, 2024

david-cortes-intel commented Nov 21, 2024

david-cortes-intel commented Nov 21, 2024

		if (diag_min <= 1e-6)
		throw mkl::lapack::computation_error("", "", 0);

ENH: Implement GPU version of spectral decomposition as linear regression fallback #2976

ENH: Implement GPU version of spectral decomposition as linear regression fallback #2976

Conversation

david-cortes-intel commented Nov 14, 2024 • edited Loading

Description

david-cortes-intel commented Nov 14, 2024

Alexandr-Solovev Nov 14, 2024

Choose a reason for hiding this comment

david-cortes-intel Nov 14, 2024

Choose a reason for hiding this comment

Vika-F Nov 18, 2024 • edited Loading

Choose a reason for hiding this comment

david-cortes-intel Nov 18, 2024

Choose a reason for hiding this comment

david-cortes-intel commented Nov 14, 2024 • edited Loading

david-cortes-intel commented Nov 14, 2024

david-cortes-intel commented Nov 15, 2024 • edited Loading

david-cortes-intel commented Nov 15, 2024

david-cortes-intel commented Nov 15, 2024

david-cortes-intel commented Nov 15, 2024

david-cortes-intel commented Nov 15, 2024

Vika-F left a comment

Choose a reason for hiding this comment

Vika-F Nov 18, 2024 • edited Loading

Choose a reason for hiding this comment

Vika-F commented Nov 18, 2024

david-cortes-intel commented Nov 18, 2024

david-cortes-intel commented Nov 18, 2024

Vika-F Nov 19, 2024

Choose a reason for hiding this comment

david-cortes-intel Nov 19, 2024

Choose a reason for hiding this comment

Vika-F Nov 19, 2024

Choose a reason for hiding this comment

david-cortes-intel Nov 19, 2024

Choose a reason for hiding this comment

Vika-F commented Nov 20, 2024

Vika-F left a comment

Choose a reason for hiding this comment

david-cortes-intel commented Nov 21, 2024

david-cortes-intel commented Nov 21, 2024

david-cortes-intel commented Nov 21, 2024

david-cortes-intel commented Nov 14, 2024 •

edited

Loading

Vika-F Nov 18, 2024 •

edited

Loading

david-cortes-intel commented Nov 14, 2024 •

edited

Loading

david-cortes-intel commented Nov 15, 2024 •

edited

Loading

Vika-F Nov 18, 2024 •

edited

Loading