fix: propagate `precision` correctly to enable non-bf16 inference #165

Icedgarr · 2024-06-05T10:59:43Z

This PR fixes some incompatibilities that I encountered when instantiating TSS from fam/llm/fast_inference.py with older and less powerful GPUs (e.g. Google Colab T4 GPU).

fam/llm/fast_inference_utils.py was putting the model to the device (cuda) with dtype.bfloat16 instead of using the precision parameter that contains the selected dtype (by default float16 or bfloat16 depending on the GPU architecture).

The linear layer of the Attention class in fam/llm/fast_model.py was also missing the dtype definition using the one provided in the config.

vatsalaggarwal

thanks a lot for this! one minor comment, ready to merge o/w!

vatsalaggarwal · 2024-06-05T12:15:06Z

fam/llm/fast_model.py

@@ -188,8 +188,8 @@ def __init__(self, config: ModelArgs):

        total_head_dim = (config.n_head + 2 * config.n_local_heads) * config.head_dim
        # key, query, value projections for all heads, but in a batch
-        self.wqkv = nn.Linear(config.dim, total_head_dim, bias=False)
-        self.wo = nn.Linear(config.dim, config.dim, bias=False)
+        self.wqkv = nn.Linear(config.dim, total_head_dim, bias=False, dtype=config.dtype)


why're these required? I think with the fix here, this shouldn't be needed?

The following line was throwing an error due to the use of mixed types, q was float16 but k and v are bfloat16.
y = F.scaled_dot_product_attention(q, k, v, attn_mask=mask, dropout_p=0.0)

Error: TorchRuntimeError: Failed running call_function <built-in function scaled_dot_product_attention>(*(FakeTensor(..., device='cuda:0', size=(2, 16, s0, 128)), FakeTensor(..., device='cuda:0', size=(2, 16, 2048, 128), dtype=torch.float16), FakeTensor(..., device='cuda:0', size=(2, 16, 2048, 128), dtype=torch.float16)), **{'attn_mask': FakeTensor(..., device='cuda:0', size=(1, 1, s0, 2048), dtype=torch.bool), 'dropout_p': 0.0}): Expected query, key, and value to have the same dtype, but got query.dtype: float key.dtype: c10::Half and value.dtype: c10::Half instead.

However, I have just checked and this alone did not solve the issue, it worked after I run the code with the torch dynamo disabled as well (doing export TORCHDYNAMO_DISABLE=1). It may be some of the operations done to the key and value tensors (I suspect this one: k = k.repeat_interleave(self.n_head // self.n_local_heads, dim=1) because it is the only one performed on k and v, but not q).

If you prefer, I can remove this change from the PR and if I or someone else find the root cause and a way to solve it we can create another PR.

So queries are the right dtype but keys and values are not? That sounds like it might be related to kv-cache not being the right dtype ... but we seem to be setting it correctly here... did you already have a look there?

That's right, I see that the kv-cache runs when I execute the code, so it is likely to be what changes the dtypes, which according to the code you reference should not happen. If you agree I'll remove this part of the PR and investigate a bit further tomorrow to try to fix this other issue.

This reverts commit f01a9ce.

Icedgarr · 2024-06-06T09:25:36Z

I have reverted the last commit since it was not required for this fix.

RahulVadisetty91

need modification to add support for other devices or more robust error handling.

Icedgarr added 2 commits June 5, 2024 12:40

Use the dtype according to the GPU architecture.

65498d6

Define the dtype of the linear layer given by the config.

f01a9ce

vatsalaggarwal reviewed Jun 5, 2024

View reviewed changes

vatsalaggarwal changed the title ~~Fix b16 incompatibilities with older gpu~~ fix: propagate precision correctly to enable non-bf16 inference Jun 5, 2024

Revert "Define the dtype of the linear layer given by the config."

b9f5014

This reverts commit f01a9ce.

Icedgarr requested a review from vatsalaggarwal June 6, 2024 09:25

RahulVadisetty91 reviewed Sep 12, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: propagate `precision` correctly to enable non-bf16 inference #165

fix: propagate `precision` correctly to enable non-bf16 inference #165

Icedgarr commented Jun 5, 2024

vatsalaggarwal left a comment •

edited

Loading

vatsalaggarwal Jun 5, 2024

Icedgarr Jun 5, 2024 •

edited

Loading

vatsalaggarwal Jun 5, 2024 •

edited

Loading

Icedgarr Jun 5, 2024 •

edited

Loading

Icedgarr commented Jun 6, 2024

RahulVadisetty91 left a comment •

edited

Loading

fix: propagate precision correctly to enable non-bf16 inference #165

Are you sure you want to change the base?

fix: propagate precision correctly to enable non-bf16 inference #165

Conversation

Icedgarr commented Jun 5, 2024

vatsalaggarwal left a comment • edited Loading

Choose a reason for hiding this comment

vatsalaggarwal Jun 5, 2024

Choose a reason for hiding this comment

Icedgarr Jun 5, 2024 • edited Loading

Choose a reason for hiding this comment

vatsalaggarwal Jun 5, 2024 • edited Loading

Choose a reason for hiding this comment

Icedgarr Jun 5, 2024 • edited Loading

Choose a reason for hiding this comment

Icedgarr commented Jun 6, 2024

RahulVadisetty91 left a comment • edited Loading

Choose a reason for hiding this comment

fix: propagate `precision` correctly to enable non-bf16 inference #165

fix: propagate `precision` correctly to enable non-bf16 inference #165

vatsalaggarwal left a comment •

edited

Loading

Icedgarr Jun 5, 2024 •

edited

Loading

vatsalaggarwal Jun 5, 2024 •

edited

Loading

Icedgarr Jun 5, 2024 •

edited

Loading

RahulVadisetty91 left a comment •

edited

Loading