-
Notifications
You must be signed in to change notification settings - Fork 221
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH: Implement GPU version of spectral decomposition as linear regression fallback #2976
ENH: Implement GPU version of spectral decomposition as linear regression fallback #2976
Conversation
/intelci: run |
|
||
opt_array<Float> dummy{}; | ||
auto potrf_event = potrf_factorization<uplo>(queue, nxtx, dummy, { xtx_event }); | ||
auto potrs_event = potrs_solution<uplo>(queue, nxtx, nxty, dummy, { potrf_event, xty_event }); | ||
try { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As I remember we use try catch only in tests/examples. Does it make sense to avoid it here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's meant to catch an exception thrown by oneMKL in the call to potrf
, over which we have no control AFAIK. An alternative would be to modify the wrapper, but it's also used elsewhere in places where the exception is not catched (so a user error is thrown).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am Ok with having a try
in the code; but let's make it more clear for the readers - please add the description of what is happening here - why the system is tried to be solved twice in some cases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added a comment about the logic for the try-catch block.
Oddly, the first version of this function (minimum of the diagonal) that I wrote seems to succeed in some tests, but fail in others (particularly when it it called repeatedly, as in incremental linear regression): template <typename Float>
Float diagonal_minimum(sycl::queue& queue,
const Float* Matrix,
const std::int64_t dim_matrix,
sycl::event& event_Matrix) {
constexpr auto alloc = sycl::usm::alloc::device;
auto diag_min_holder = array<Float>::empty(queue, 1, alloc);
sycl::event diag_min_event = queue.submit([&](sycl::handler& h) {
auto min_reduction = sycl::reduction(diag_min_holder.get_mutable_data(), sycl::minimum<>());
h.depends_on(event_Matrix);
h.parallel_for(dim_matrix, min_reduction, [=](const auto& i, auto& min_obj) {
min_obj.combine(Matrix[i * (dim_matrix + 1)]);
});
});
return ndview<Float, 1>::wrap(diag_min_holder).at_device(queue, 0, { diag_min_event });
} (tried also adding But this slower version doesn't seem to have any issue: template <typename Float>
Float diagonal_minimum(sycl::queue& queue,
const Float* Matrix,
const std::int64_t dim_matrix,
sycl::event& event_Matrix) {
constexpr auto alloc = sycl::usm::alloc::device;
auto idx_min_holder = array<std::int64_t>::empty(queue, 1, alloc);
sycl::event diag_min_event = mkl::blas::column_major::iamin(
queue,
dim_matrix,
Matrix,
dim_matrix + 1,
idx_min_holder.get_mutable_data(),
mkl::index_base::zero,
{ event_Matrix }
);
const std::int64_t idx_min = ndview<std::int64_t, 1>::wrap(idx_min_holder).at_device(queue, 0, { diag_min_event });
return ndview<Float, 1>::wrap(Matrix, dim_matrix * dim_matrix).at_device(queue, idx_min * (dim_matrix + 1), { event_Matrix });
} Any hints and what might be happening? |
/intelci: run |
Another thing that I'm now not very sure about is: should calls to oneMKL lapack manually call This is what the oneMKL docs mention:
... which sounds like it would let it wait until the exception can be thrown, but then it leaves me wondering how it handles queueing. |
After some investigation regarding the Adding a manual initialization did the trick. |
/intelci: run |
/intelci: run |
/intelci: run |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The overall changes look good to me, they just need to be aligned a bit with the codebase. Please see the comments.
|
||
opt_array<Float> dummy{}; | ||
auto potrf_event = potrf_factorization<uplo>(queue, nxtx, dummy, { xtx_event }); | ||
auto potrs_event = potrs_solution<uplo>(queue, nxtx, nxty, dummy, { potrf_event, xty_event }); | ||
try { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am Ok with having a try
in the code; but let's make it more clear for the readers - please add the description of what is happening here - why the system is tried to be solved twice in some cases.
@david-cortes-intel Good point. I think yes, |
Co-authored-by: Victoriya Fedotova <[email protected]>
/intelci: run |
Thanks for confirming. Added a manual call to |
if (diag_min <= 1e-6) | ||
throw mkl::lapack::computation_error("", "", 0); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That should be the last portion of comments.
-
Please provide some descriptive name for 1e-6 threshold. We are trying to reduce the number of magic constants in our code.
-
Throwing oneMKL exception here looks like a hack to call the code in catch.
It would be better just to call spectral decomposition here and in catch. Maybe re-write it a bit in order not to duplicate the code with array copying.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added a comment about the threshold and changed the logic of the fallback to a named section.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe I do not understand something, but why this block cannot be just a function? For the sake of simplicity.
goto
in DPC++ code looks pretty weird. And besides the performance aspect, the use of goto
make the code harder to understand. As the execution order and the call stack can be very hard to track.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed it into a function.
/intelci: run |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's wait for a [semi-]green CI and LGTM.
/intelci: run |
1 similar comment
/intelci: run |
CI is passing now, will merge. |
Description
As a follow up from #2930
This PR adds a GPU version of the fallback for linear regression based on spectral decomposition, which mirrors the CPU version. It reuses the same tests, but this time executing them with fp32 too.
A few things I wasn't sure about:
PR should start as a draft, then move to ready for review state after CI is passed and all applicable checkboxes are closed.
This approach ensures that reviewers don't spend extra time asking for regular requirements.
You can remove a checkbox as not applicable only if it doesn't relate to this PR in any way.
For example, PR with docs update doesn't require checkboxes for performance while PR with any change in actual code should have checkboxes and justify how this code change is expected to affect performance (or justification should be self-evident).
Checklist to comply with before moving PR from draft:
PR completeness and readability
Testing
Performance
Not applicable (addition is a fallback for cases that would have failed previously).