Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CONV] fix naive conv kernel for large tensors #3434

Open
wants to merge 5 commits into
base: develop
Choose a base branch
from

Conversation

bghimireamd
Copy link
Contributor

@bghimireamd bghimireamd commented Dec 12, 2024

for larger tensor I was seeing
gdims[0] : 38,654,705,664
globalWorkSizeX : 4,294,967,295 (max allowed by uint32_t)
MaxGridDimX : 2,147,483,647

gdims[0] was exceeding MaxGridDimX globalWorkSizeX for below driver command.

./bin/MIOpenDriver convbfp16 -n 589824 -c 256 -H 4 -w 34 -k 256 -y 1 -x 3 -p 0 -q 0 -u 1 -v 1 -l 1 -j 1 -m c onv -g 1 -F 1 -t 1

In this PR I reduce the gdims[0] to be set within MaxGridDimX.

Copy link
Collaborator

@BradPepersAMD BradPepersAMD left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we add a unit test for this?

}
else
{
grid_size = (all_workload + block_size - 1) / block_size;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See #2748
It is an integer Ceil() function.
It's just a reminder that the problem still exists.

Comment on lines +376 to +384
size_t all_workload = static_cast<size_t>(group) * n * ho;
if(all_workload <= block_size)
{
grid_size = all_workload;
}
else
{
grid_size = (all_workload + block_size - 1) / block_size;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't feel like the problem is solved here, actually I see a few more problems.
Now it divides total workload by 256 - technically speaking it's just 256 times further from now. Quite far away, but still there.
And since it is dividing total workload by 256, we have 256 times underloaded GPU. Can be a huge performance drop for a wide range of legit tensor sizes, and even it's a naive algorithm, we are using it everywhere in the tests to compute reference data.
The last concern is the kernel itself - it should be aware about that fact that the number of groups can be capped, and it should contain extra loop to handle it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will implemented the kernel itself to handle the capped number of groups.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure that a new kernel should be implemented, or the old one can be changed, or even the old one has already got this support and we should change anything - firstly it should be checked.

Underloaded GPU problem should be fixed too.
Let's imagine - all_workload is 256 and we have a grid size of 256; when it is 257, the grid size suddenly becomes 2.
We have more work but fewer workers.

Copy link
Contributor Author

@bghimireamd bghimireamd Dec 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we don't need to modify the kernel. We could loop over the same kernel, adjusting the chunk size and buffer offsets as needed. This would handle the limitation of uint32_t in hipExtModuleLaunchKernel which currently overflows when we pass a global work size as gridX((589824 *256) *256 ).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was looking to see how we handle this issue in other locations (since it seems like it would be a global constraint).

Looks like the batched_transpose solver also has a version of this issue (and seems somewhat likely we have this issue throughout MIOpen).

For HIP this is a general constraint across any kernel launch I think:

What are the maximum limits of kernel launch parameters?

Product of block.x, block.y, and block.z should be less than 1024. Please note, HIP does not support kernel launch with total work items defined in dimension with size gridDim x blockDim >= 2^32, so gridDim.x * blockDim.x, gridDim.y * blockDim.y and gridDim.z * blockDim.z are always less than 2^32.

I think we might need to come up with a general solution for this, and make sure it's implemented broadly.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was looking to see how we handle this issue in other locations (since it seems like it would be a global constraint).

We don't, kind of. There are some places where the kernel is aware about number of workgroups limit and sometimes the number of workgroups is capped by some value like 4096. That's mostly it.

I think we might need to come up with a general solution for this, and make sure it's implemented broadly.

I'm not sure if thatcan be easily implemented. The main reason is that: the number of workgroups heavily depends on the algorithm and, the most important, on the kernel itself, and sometimes it even comes from heuristics.
Putting some hardlimit in the library will not resolve the problem, and it can even do a bad stuff like previously runtime explicitly failed to launch the kernel, but now it will be silently capped, launched and produced a wrong result, which will be much harded to notice, especially when you run production code without any verifications.

@CAHEK7
Copy link
Contributor

CAHEK7 commented Dec 13, 2024

can we add a unit test for this?

It's barely possible, at least without affecting the current development and CI, and at least the test should be added as a special "huge tensor" tests - specifically tailored case for specific machines.

  1. we may have memory problems on CI machines - not every GPU can support so huge tensors (probably it's not relevant for MI200+ cards)
  2. we will have a problem with tensor initialization on the host - probably the problem does not exist for convolutions, since Artem replaced CPU based tensor init with GPU based (and probably for some particular datatypes), but all the other algorithms do not use GPU initialization
  3. we will have problems with verification - even if we initialize everything on gpu and compute everything on gpu, we still have to copy the results and the reference data back to CPU and run slow CPU based verification
  4. it's almost like 3 - since it's a naive algorithm, we are using CPU for it's verification. CPU convolution is kind of slow for huge tensors

@bghimireamd
Copy link
Contributor Author

can we add a unit test for this?

It's barely possible, at least without affecting the current development and CI, and at least the test should be added as a special "huge tensor" tests - specifically tailored case for specific machines.

  1. we may have memory problems on CI machines - not every GPU can support so huge tensors (probably it's not relevant for MI200+ cards)
  2. we will have a problem with tensor initialization on the host - probably the problem does not exist for convolutions, since Artem replaced CPU based tensor init with GPU based (and probably for some particular datatypes), but all the other algorithms do not use GPU initialization
  3. we will have problems with verification - even if we initialize everything on gpu and compute everything on gpu, we still have to copy the results and the reference data back to CPU and run slow CPU based verification
  4. it's almost like 3 - since it's a naive algorithm, we are using CPU for it's verification. CPU convolution is kind of slow for huge tensors

We do have https://github.com/ROCm/MIOpen/blob/develop/test/gpu_reference_kernel.cpp

@CAHEK7
Copy link
Contributor

CAHEK7 commented Dec 13, 2024

We do have https://github.com/ROCm/MIOpen/blob/develop/test/gpu_reference_kernel.cpp

Yes, that's a naive CPU single threaded ultra slow verification for naive GPU algorithm. That test is not about "huge" tensors, it has exactly those problems which I described.

@bghimireamd
Copy link
Contributor Author

can we add a unit test for this?

It's barely possible, at least without affecting the current development and CI, and at least the test should be added as a special "huge tensor" tests - specifically tailored case for specific machines.

  1. we may have memory problems on CI machines - not every GPU can support so huge tensors (probably it's not relevant for MI200+ cards)
  2. we will have a problem with tensor initialization on the host - probably the problem does not exist for convolutions, since Artem replaced CPU based tensor init with GPU based (and probably for some particular datatypes), but all the other algorithms do not use GPU initialization
  3. we will have problems with verification - even if we initialize everything on gpu and compute everything on gpu, we still have to copy the results and the reference data back to CPU and run slow CPU based verification
  4. it's almost like 3 - since it's a naive algorithm, we are using CPU for it's verification. CPU convolution is kind of slow for huge tensors

Yes, we do need to do the slow cpu run. I can the test a nightly run.

@CAHEK7
Copy link
Contributor

CAHEK7 commented Dec 17, 2024

Yes, we do need to do the slow cpu run. I can the test a nightly run.

I'm not sure that we do need. It depends on the way how we treat the reference data.

For example, when two algorithms have a consensus during a manual run, we can assume that they produce the same data.
Following that logic, we can do the following steps:

  1. manually run naive GPU vs naive CPU and be sure that both are producing the same output
  2. add the same case, but using naive GPU and non-naive GPU algorithms - assuming that if naive GPU produced correct result on step 1 and naive and non-naive GPUs are doing the same, it means that everything is fine.

The only case when it can get broken is when we simultaneously and exactly in the same way break naive and non-naive implementations - in that case both algorithms produce the same wrong result, having the test passed. In the other cases, the test will indicate that either naive or non-naive version is broken, and that's enough to start manually checking everything.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants