Add tensor parallelism for RWKV #1237

jahatef · 2024-06-19T16:44:42Z

Adds tensor parallel implementation for rwkv, and support for Triton FLA implementation in GPT-NeoX

* add asserts and fix post training readme * precommit --------- Co-authored-by: Quentin Anthony <[email protected]>

* fix typo * fix neoxargs usage test * skip conversion test due to multiprocessing issue * precommit --------- Co-authored-by: Quentin Anthony <[email protected]>

* Add ERROR logging prefix and sort alphabetically * fix comment

Quentin-Anthony · 2024-11-16T02:04:19Z

configs/neox_arguments.md

@@ -843,6 +843,29 @@ Model Arguments



+- **dim_att**: int


we should either have unified args (across mamba, rwkv, transformers) for these, or prepend these args with whatever block type they're targeting (e.g. rwkv_dim_att).

Quentin-Anthony · 2024-11-16T02:05:24Z

configs/rwkv/430M.yml

+
+  "num_layers": 24,
+  "hidden_size": 1024,
+  "num_attention_heads": 16, # head_size = dim_att / num_attention_heads.


Similar comment here. Calling these attention heads is highly misleading.

I kind of disagree, as rwkv code generally references time mixing as attention, and the RWKV kernel is often called a type of "linear attention." But, I can add a bunch of configs to decouple rkwv and transformer config options, but this will just create a lot of config args that have essentially the same purpose in my opinion.

Quentin-Anthony · 2024-11-16T02:06:58Z

megatron/model/rwkv/v6/rwkv.py

+except ModuleNotFoundError:
+    print(
+        "Unable to import RWKV FLA kernels. Install them from our requirements/requirements-rwkv.txt, \
+    or directly from https://github.com/sustcsonglin/flash-linear-attention.git, or use CUDA kernels."


This last point "or use CUDA kernels" is confusing. Can you add a "by doing xyz" so that users know what you mean?

Quentin-Anthony · 2024-11-16T02:10:17Z

megatron/model/rwkv/v6/rwkv.py

@@ -104,7 +126,7 @@ class RWKV_TimeMix(nn.Module):
    TODO: fix jit compiling.


Is this based on the parser issue we discussed? I think it's worth testing just-jit and reordered jit and heuristics like I suggested before merging with this TODO

Quentin-Anthony · 2024-11-16T02:11:55Z

megatron/model/rwkv/v6/rwkv.py


-        self.ffn = RWKV_ChannelMix(neox_args, layer_number)
+        self.ffn = ParallelRWKV_ChannelMix(neox_args, layer_number, init_method=init_method)

        if neox_args.attention_dropout > 0:


another attention arg for rwkv. Can we decouple attn dropout from rwkv?

Quentin-Anthony · 2024-11-16T02:23:11Z

megatron/neox_arguments/arguments.py

-WARNING = f"{YELLOW}[WARNING]{END}"
+
+### Formatted logging prefixes ###
+ERROR = f"{RED}[ERROR]{END} "


I don't think we've properly merged this branch onto upstream main, since this is tracking as a change. Please do this.

Quentin-Anthony · 2024-11-16T02:23:43Z

megatron/neox_arguments/neox_args.py

@@ -277,6 +277,11 @@ class NeoXArgsModel(NeoXArgsTemplate):
          }
    """

+    rwkv_fla: bool = False


regen neox_arguments.md, since this isn't showing up there.

Quentin-Anthony · 2024-11-16T02:24:13Z

megatron/training.py

@@ -406,6 +406,9 @@ def get_batch(neox_args, data_iterator):
            datatype=datatype,
        )
    elif neox_args.train_impl == "kto":
+        assert (


I think these will also go away with a proper rebase onto latest main

- do not create a fake head dim and split the 'mixed_x_layer' into QKV layers directly.

…ype' option was removed (#1309) * fix 'intermediate_size' in Llama configuration files after the 'mlp_type' option was removed * config adjustments for llama and gated activations * pre-commit --------- Co-authored-by: jahatef <[email protected]> Co-authored-by: Quentin Anthony <[email protected]>

* Python 3.10 support In this issue Python 3.10 support was added #1122 * update wording on torch and python --------- Co-authored-by: Quentin Anthony <[email protected]>

)

* adds pyproject files and tests * formatting and add dev packages to dev req files * improve req testing --------- Co-authored-by: Quentin Anthony <[email protected]>

jahatef and others added 17 commits June 4, 2024 11:22

inital tp commits

4c7cb11

setup

46904d5

configs

e2933ef

merge

d1112ab

time mixing tp

43d641d

time-mixing

de02f37

time mixing debugging

dd441b6

reset time_faaaa

a418670

Add additional asserts and update post training readme (#1300)

540d856

* add asserts and fix post training readme * precommit --------- Co-authored-by: Quentin Anthony <[email protected]>

Fix failling tests (#1301)

12aac35

* fix typo * fix neoxargs usage test * skip conversion test due to multiprocessing issue * precommit --------- Co-authored-by: Quentin Anthony <[email protected]>

inital tp commits

97c7915

merge

5f89ed8

Add ERROR logging prefix and sort the prefixes alphabetically (#1308)

91cb759

* Add ERROR logging prefix and sort alphabetically * fix comment

inital tp commits

49b263a

cleanup

48de682

cleanup

c6fac96

Update local_setup.yml

5a259c0

jahatef marked this pull request as ready for review November 7, 2024 05:28

jahatef requested a review from Quentin-Anthony as a code owner November 7, 2024 05:28

jahatef added 2 commits November 10, 2024 15:37

add Triton FLA

c2d6c85

change version of rwkv-fla

bdb3658

Quentin-Anthony reviewed Nov 16, 2024

View reviewed changes

fix a GQA issue (#1314) (#1315)

ff7f328

- do not create a fake head dim and split the 'mixed_x_layer' into QKV layers directly.

tiandeyu-cs and others added 10 commits November 29, 2024 15:44

Python 3.10 support (#1313)

c4d7a54

* Python 3.10 support In this issue Python 3.10 support was added #1122 * update wording on torch and python --------- Co-authored-by: Quentin Anthony <[email protected]>

Fix documentation for converting SFT/DPO weights back to HF Llama (#1318

ee2f142

)

fix bug (#1311)

6e81f0b

Add support for dropout in sparse attention (#1312)

df95419

adds pyproject files and tests (#1302)

d682529

* adds pyproject files and tests * formatting and add dev packages to dev req files * improve req testing --------- Co-authored-by: Quentin Anthony <[email protected]>

undo merge error (#1325)

0bc11d6

inital tp commits

c6db95c

setup

daac503

Merge branch 'main' into rwkv-tp

bf478ce

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add tensor parallelism for RWKV #1237

Add tensor parallelism for RWKV #1237

jahatef commented Jun 19, 2024 •

edited

Loading

Quentin-Anthony Nov 16, 2024

Quentin-Anthony Nov 16, 2024

jahatef Nov 17, 2024

Quentin-Anthony Nov 16, 2024

Quentin-Anthony Dec 19, 2024

Quentin-Anthony Nov 16, 2024

Quentin-Anthony Nov 16, 2024

Quentin-Anthony Nov 16, 2024

Quentin-Anthony Nov 16, 2024

Quentin-Anthony Nov 16, 2024

		@@ -104,7 +126,7 @@ class RWKV_TimeMix(nn.Module):
		TODO: fix jit compiling.

Add tensor parallelism for RWKV #1237

Are you sure you want to change the base?

Add tensor parallelism for RWKV #1237

Conversation

jahatef commented Jun 19, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jahatef commented Jun 19, 2024 •

edited

Loading