How convert bfloat16 to fp8 model? #661

zhipengChen · 2025-02-14T07:56:16Z

I use code below in kernel.py to convert bfloat16 to fp8. But I cannot use vllm load converted fp8 model.

def act_quant_kernel(x_ptr, y_ptr, s_ptr, BLOCK_SIZE: tl.constexpr):
pid = tl.program_id(axis=0)
offs = pid * BLOCK_SIZE + tl.arange(0, BLOCK_SIZE)
x = tl.load(x_ptr + offs).to(tl.bfloat16)
s = tl.max(tl.abs(x)) / 448.
y = x / s
y = y.to(y_ptr.dtype.element_ty)
tl.store(y_ptr + offs, y)
tl.store(s_ptr + pid, s)

def act_quant(x: torch.Tensor, block_size: int = 128) -> Tuple[torch.Tensor, torch.Tensor]:
assert x.is_contiguous()
assert x.size(-1) % block_size == 0
y = torch.empty_like(x, dtype=torch.float8_e4m3fn)
s = x.new_empty(*x.size()[:-1], x.size(-1) // block_size, dtype=torch.bfloat16)
grid = lambda meta: (triton.cdiv(x.numel(), meta['BLOCK_SIZE']), )
act_quant_kernel[grid](x, y, s, BLOCK_SIZE=block_size)
return y, s

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How convert bfloat16 to fp8 model? #661

How convert bfloat16 to fp8 model? #661

zhipengChen commented Feb 14, 2025

How convert bfloat16 to fp8 model? #661

How convert bfloat16 to fp8 model? #661

Comments

zhipengChen commented Feb 14, 2025