Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(library): Propagate upstream Marlin kernel fix #366

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

ahadnagy
Copy link

@ahadnagy ahadnagy commented Jan 6, 2025

What does this PR do?

Fixes #332

TLDR; There was a data race bug in the Marlin kernel. This fix basically adds a separate shared memory region for the final reduction tree. Unfortunately, this affects the minimum hardware requirements for the kernel, it won't work on GPUs with compute capability < 8.0.

Before submitting

  • Did you read the contributor guideline,
    Pull Request section?
  • Was this discussed/approved via a Github issue or the forum? Please add a link
    to it if that's the case.
  • Did you run all tests locally and make sure they pass.
  • Did you write any new necessary tests?

@ahadnagy ahadnagy requested a review from dacorvo as a code owner January 6, 2025 22:19
@ahadnagy ahadnagy marked this pull request as draft January 6, 2025 22:19
@ahadnagy ahadnagy marked this pull request as ready for review January 6, 2025 22:29
@ahadnagy ahadnagy changed the title fix(library): Propagate upstream Marlin kernel fix (WIP) fix(library): Propagate upstream Marlin kernel fix Jan 6, 2025
Increase shared mem. size

Fix shared mem. size, re-activate test

Remove debugging-related stuff
@ahadnagy
Copy link
Author

ahadnagy commented Jan 7, 2025

@dacorvo It seems like it's not gonna work an A10s due to its lower shared memory size. Is that a hard requirement for the library?

@dacorvo
Copy link
Collaborator

dacorvo commented Jan 7, 2025

@dacorvo It seems like it's not gonna work an A10s due to its lower shared memory size. Is that a hard requirement for the library?

Pretty much, yes, since one of the main use case for quantization is to be able to run bigger models on smaller devices.

@ahadnagy
Copy link
Author

ahadnagy commented Jan 7, 2025

Okay, in that case it'll be necessary to reduce the tile size. I'll check what vllm does on this front, and if that works on A10s at all. IIRC, their CI runs on L40s.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Corrupted outputs with Marlin int4 kernels as parallelization increases
2 participants