Initial implemention of disaggregated attention and qkv projection #1433

yingchen21 · 2024-07-09T15:35:44Z

Description of changes:
This PR moves the qkv projection (and output projection) from the attention operator to a seperate dense layer, so we can apply LORA to the projections of attention.

The qkv projection (and output projection) is removed from the attention operator, and done by dense instead. The attention operator now no longer has tensor weights. It takes in qkv projection and output the raw attention result (without applying output projection) with dimension vdim * num_kv_heads instead of the hidden demension.

Related Issues:

Linked Issues:

Issue #

Issues closed by this PR:

Closes #

This change is

… peft

…cator

* Add scripts for evaluation * Add absolute request rate value * Fix script for target arrival rate * Fix cpp req rate benchmark * update to use new dataset * Fix infinite loop * update * add data --------- Co-authored-by: Remi Delacourt <[email protected]> Co-authored-by: Gabriele Oliaro <[email protected]>

jiazhihao and others added 30 commits September 27, 2023 10:21

.

cbfd652

Merge branch 'inference' of https://github.com/flexflow/FlexFlow into…

f53a67a

… peft

format

60702fc

Merge branch 'inference' into peft

8360de0

resolve merge conflict

102745a

Merge branch 'peft' of https://github.com/flexflow/FlexFlow into peft

eaf42a4

implement LoraLinear

da9ce1b

add missing files

66230bd

format

f0d1155

Merge branch 'inference' into peft

00f926b

LoraLinear now takes two inputs and generates one output

fb203cc

LoRA forward pass works now

c3d9c38

[LoraLinear] update to allocate weight through per-GPU PEFTWeightAllo…

c4cfcc3

…cator

Merge branch 'inference' into peft

8b98c45

add API for registering PEFT models

ea8920b

Merge branch 'peft' of https://github.com/flexflow/FlexFlow into peft

0e09ac1

bug fix

44cc16b

format

29e5547

add reserved work space for peft activations and weights

dfd1c9a

Merge branch 'inference' into peft

bb76f75

fix merge conflicts, implement layernorm peft_bwd

e6f671d

cleanup

207b127

rms backward

231e244

rms peft

416c322

add LoraLinearConfig

f72067a

add an API for register peft request

49e5664

resolve merge conflict

367bfa5

format

008ffd9

resolve merge conflict

2e0aa76

.

ace7e3f

Flechman and others added 28 commits April 17, 2024 04:56

Merge branch 'inference' into peft

f3f6226

fix

b33f10f

fix

97562d6

add peft tests to ci

985c254

Merge branch 'inference' into peft

33dbd3d

shellcheck

f033b4e

fix

1011927

fix python requirements

9064c2b

fix

a125e86

fix

d74fe53

update ci test

0c6ae09

update alignment doc

93b6032

fix cross entropy loss bug

9546239

update alignment test

ff4b703

update test

b613666

add llama peft alignment test to ci

dde0b61

Fix values for unused params in incr_decoding

1a31b65

Add PEFTModelID NO_ID singleton instead of None

7e3d111

Fix PEFTModelID::NO_ID reference

079ba59

reduce logging

f464eb8

fix

8d89acd

fix

33c0fef

Add peft demo

6727d3a

Add readme for demo

6d7c245

fix alignment issue

511fd64

Initial implemention of disaggregated attention and qkv projection

2899ba2

fixed filename problem from renaming weight file

94e1563

yingchen21 force-pushed the attn-qkv-proj branch from 07d8e67 to 94e1563 Compare July 10, 2024 05:32

yingchen21 closed this Jul 10, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Initial implemention of disaggregated attention and qkv projection #1433

Initial implemention of disaggregated attention and qkv projection #1433

yingchen21 commented Jul 9, 2024 •

edited by wmdi

Loading

Initial implemention of disaggregated attention and qkv projection #1433

Initial implemention of disaggregated attention and qkv projection #1433

Conversation

yingchen21 commented Jul 9, 2024 • edited by wmdi Loading

yingchen21 commented Jul 9, 2024 •

edited by wmdi

Loading