Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[muP] Rework #1087

Open
wants to merge 109 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
109 commits
Select commit Hold shift + click to select a range
0d921f7
changed ordering for setting up norm_factor
lintangsutawika Dec 1, 2023
abee54d
Update NeoXArgs docs automatically
invalid-email-address Dec 1, 2023
a08c3ef
updated muP args to the minimum required
lintangsutawika Dec 1, 2023
c35e830
calculate m_width
lintangsutawika Dec 1, 2023
2807e52
Merge branch 'main' of https://github.com/EleutherAI/gpt-neox into re…
lintangsutawika Dec 1, 2023
2d127df
Merge branch 'rework-mup' of https://github.com/EleutherAI/gpt-neox i…
lintangsutawika Dec 1, 2023
81fdc4d
Update NeoXArgs docs automatically
invalid-email-address Dec 1, 2023
7d6b246
changed ordering for setting up norm_factor
lintangsutawika Dec 1, 2023
a0d1929
updated muP args to the minimum required
lintangsutawika Dec 1, 2023
d63b3b8
calculate m_width
lintangsutawika Dec 1, 2023
9be82fe
Update NeoXArgs docs automatically
invalid-email-address Dec 1, 2023
66214d9
removed redundant line
lintangsutawika Dec 1, 2023
17b7183
removed redundant lines
lintangsutawika Dec 1, 2023
a6bad07
Update NeoXArgs docs automatically
invalid-email-address Dec 1, 2023
63984bd
removed redundant lines
lintangsutawika Dec 1, 2023
02687a8
Merge branch 'rework-mup' of https://github.com/EleutherAI/gpt-neox i…
lintangsutawika Dec 1, 2023
11114e2
Update NeoXArgs docs automatically
invalid-email-address Dec 1, 2023
05c4de3
modify init with mup
lintangsutawika Dec 1, 2023
71a91e4
divide logits by the m_width
lintangsutawika Dec 1, 2023
99c8ce0
moved position of mup parameters being processed
lintangsutawika Dec 1, 2023
b253ab6
add note
lintangsutawika Dec 1, 2023
1919499
made param groups to hold flag for mup scaling
lintangsutawika Dec 6, 2023
17678e0
lr scale
lintangsutawika Dec 6, 2023
2bd5ae6
update config
lintangsutawika Dec 6, 2023
6642291
adjust process of mup variables
lintangsutawika Dec 6, 2023
8be6c66
remove calling save_base_shapes
lintangsutawika Dec 18, 2023
c9fb18b
lr adjustments is done in train_step to address lr being reset due to…
lintangsutawika Dec 18, 2023
795371c
lr scaling for mup is moved here instead
lintangsutawika Dec 18, 2023
087beee
removed mup usage for coord check
lintangsutawika Jan 3, 2024
16d04b1
merged with main
lintangsutawika Jan 3, 2024
e7b7bf6
latest update on coord check implementation
lintangsutawika Jan 24, 2024
8dea9ce
fix merge conflict
lintangsutawika Feb 2, 2024
3664eba
changed `mup_m_width` to `mup_width_multiplier`
lintangsutawika Feb 2, 2024
6a46247
fixed notations
lintangsutawika Feb 2, 2024
7439f9a
correct scale
lintangsutawika Feb 2, 2024
5b2d31c
m_emb * embed(X)
lintangsutawika Feb 2, 2024
98caa82
removed mup rescale in the layers
lintangsutawika Feb 2, 2024
5c99637
removed mup rescale in the layers
lintangsutawika Feb 2, 2024
a636f06
adjust mup_m_emb to mup_embedding_multiplier
lintangsutawika Feb 2, 2024
39190c5
add multiplier mup_output_multiplier
lintangsutawika Feb 20, 2024
2489cc0
reorder model loading
lintangsutawika Feb 20, 2024
23b8776
removed comments
lintangsutawika Feb 20, 2024
10e935e
removed comments
lintangsutawika Feb 20, 2024
a0aca99
implement full process
lintangsutawika Feb 20, 2024
9472b35
set neox_args.iteration to 0 for coord_check mode
lintangsutawika Feb 21, 2024
5c5f2df
move mup_width_multiplier init
lintangsutawika Feb 21, 2024
7eca3e7
mup_coord_check returns 2 df
lintangsutawika Feb 21, 2024
c9a3a65
can run
lintangsutawika Feb 21, 2024
a7877d4
remove commehts
lintangsutawika Feb 22, 2024
bd9d399
add hooks
lintangsutawika Feb 22, 2024
fe180d3
remove comments
lintangsutawika Feb 22, 2024
b240c19
uncomment activation data
lintangsutawika Feb 22, 2024
93b4241
plot coords
lintangsutawika Feb 22, 2024
d4899fc
removed variables, add way to plot only from rank 0
lintangsutawika Feb 22, 2024
f589e29
changed key name in dict
lintangsutawika Feb 22, 2024
8261e0d
remove print
lintangsutawika Feb 22, 2024
25aa786
fix how width_multiplier is applied
lintangsutawika Feb 22, 2024
4d246a1
updated plot config
lintangsutawika Feb 22, 2024
84c5380
update files
lintangsutawika Feb 26, 2024
b2f1101
Merge branch 'main' into rework-mup
lintangsutawika Feb 26, 2024
42d4cde
Update NeoXArgs docs automatically
invalid-email-address Feb 26, 2024
4c477d5
init function, add input embedding different initialization
lintangsutawika Feb 27, 2024
64dc4c5
Merge branch 'rework-mup' of https://github.com/EleutherAI/gpt-neox i…
lintangsutawika Feb 27, 2024
65c103e
changeoutput layer to normal
lintangsutawika Feb 27, 2024
08b5d40
change from mean to std
lintangsutawika Feb 27, 2024
2ca94a8
double attention head for every hidden size doubled
lintangsutawika Feb 27, 2024
7483246
Merge branch 'main' into rework-mup
lintangsutawika Feb 27, 2024
497485c
Update NeoXArgs docs automatically
invalid-email-address Feb 27, 2024
34fb7ca
added args
lintangsutawika Feb 27, 2024
2d53f1f
simplify coordcheck
lintangsutawika Feb 27, 2024
7897610
seperate sp and mup configs
lintangsutawika Feb 27, 2024
4f39209
perform coordcheck for sp and mup seperately
lintangsutawika Feb 27, 2024
5f84a3f
Update NeoXArgs docs automatically
invalid-email-address Feb 27, 2024
479b854
update
lintangsutawika Feb 28, 2024
21a7e32
update how params are sorted
lintangsutawika Feb 28, 2024
bb2e0c9
remove unused comments
lintangsutawika Feb 28, 2024
bf1ce06
adjust
lintangsutawika Feb 29, 2024
50a3dba
simplify
lintangsutawika Feb 29, 2024
c4c1660
fix mup embedding multiplier
lintangsutawika Feb 29, 2024
1c35911
embeddingpipe fix init
lintangsutawika Feb 29, 2024
84be4d4
changed how manual seed is loaded
lintangsutawika Feb 29, 2024
fbb4daf
removed musgd and other changces
lintangsutawika Feb 29, 2024
fa142ff
update config
lintangsutawika Feb 29, 2024
ad2336f
fixed how params are sorted
lintangsutawika Feb 29, 2024
fe73bc3
update how seed is computed
lintangsutawika Feb 29, 2024
a3bd44c
update to follow pre-commit format
lintangsutawika Feb 29, 2024
56b6c9b
update from main
lintangsutawika Feb 29, 2024
2365fd5
update
lintangsutawika Feb 29, 2024
e8639a0
Update NeoXArgs docs automatically
invalid-email-address Feb 29, 2024
47e1438
fix lr weighting
lintangsutawika Mar 5, 2024
a064f9b
hard set to 1.0 if neox_args.use_mup is false
lintangsutawika Mar 5, 2024
b0da27a
Merge branch 'main' into rework-mup
Quentin-Anthony Apr 21, 2024
6fe55f4
Update NeoXArgs docs automatically
invalid-email-address Apr 21, 2024
8bf8bcd
add new parameters
lintangsutawika May 2, 2024
7f0b033
add parameter checks
lintangsutawika May 2, 2024
f802869
updates to argument processing for mup
lintangsutawika May 2, 2024
cc71104
add data save and descriptions being printed
lintangsutawika May 2, 2024
c8feb39
update mup
lintangsutawika May 2, 2024
b6b3a02
update seed
lintangsutawika May 2, 2024
847e892
remove print text
lintangsutawika May 2, 2024
1b0027c
fixed kv
lintangsutawika May 2, 2024
055596f
update
lintangsutawika May 2, 2024
fabb45b
update dewcriptions being printed
lintangsutawika May 2, 2024
5ccf693
removed unused lines
lintangsutawika May 2, 2024
9dd583b
Merge branch 'rework-mup' of https://github.com/EleutherAI/gpt-neox i…
lintangsutawika May 2, 2024
6a8ad71
Merge branch 'main' into rework-mup
lintangsutawika May 2, 2024
485cad4
Update NeoXArgs docs automatically
invalid-email-address May 2, 2024
c291906
Merge branch 'main' into rework-mup
Quentin-Anthony Sep 23, 2024
1ac9add
Merge branch 'main' into rework-mup
Quentin-Anthony Oct 8, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
103 changes: 103 additions & 0 deletions configs/coord_check_mup.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,103 @@
{
# parallelism settings
"pipe_parallel_size": 1,
"model_parallel_size": 1,

# model settings
"num_layers": 2,
"num_attention_heads": 4,
"seq_length": 2048,
"max_position_embeddings": 2048,
"pos_emb": "rotary",
"rotary_pct": 0.25,
"no_weight_tying": true,
"gpt_j_residual": true,
"output_layer_parallelism": "column",

# these should provide some speedup but takes a while to build, set to true if desired
"scaled_upper_triang_masked_softmax_fusion": true,
"bias_gelu_fusion": true,

# # init methods
# "init_method": "small_init",
# "output_layer_init_method": "wang_init",

# init methods
"init_method": "normal",
"output_layer_init_method": "scaled_normal",

# optimizer settings
"optimizer": {
"type": "Adam",
"params": {
"betas": [0.9, 0.95],
"eps": 1.0e-8,
}
},
"lr_decay_style": constant,
"warmup": 0,

# for all zero_optimization options, see https://www.deepspeed.ai/docs/config-json/#zero-optimizations-for-fp16-training
"zero_optimization": {
"stage": 1,
"allgather_partitions": true,
"allgather_bucket_size": 1260000000,
"overlap_comm": true,
"reduce_scatter": true,
"reduce_bucket_size": 1260000000,
"contiguous_gradients": true,
"cpu_offload": false
},

# batch / data settings
"train_micro_batch_size_per_gpu": 8,
"gradient_accumulation_steps": 1,
"data_impl": "mmap",
"num_workers": 1,

# activation checkpointing
"checkpoint_activations": true,
"checkpoint_num_layers": 1,
"partition_activations": true,
"synchronize_each_layer": true,

# regularization
"gradient_clipping": 1.0,
"weight_decay": 0.1,
"hidden_dropout": 0,
"attention_dropout": 0,

# precision settings
"precision": "fp32",
# "fp16": {
# "fp16": true,
# "enabled": true,
# "loss_scale": 0,
# "loss_scale_window": 1000,
# "hysteresis": 2,
# "min_loss_scale": 1
# },

# misc. training settings
"train_iters": 10,
"log_interval": 1,
"distributed_backend": "nccl",

"coord_check": true,
"coord_check_nsteps": 5,
"coord_check_nseeds": 1,
"use_mup": true,
# base lr
"mup_lr": 0.01,
# base sigma
"mup_std": 0.08,
# base size
"mup_d_model_base": 256,
"mup_hidden_size": 256,

"tokenizer_type": "HFTokenizer",
"vocab-file": "/mnt/ssd-1/lintang/09-mup-neox/20B_tokenizer.json",
"data-path": "/mnt/ssd-1/lintang/09-mup-neox/data/enwik8/enwik8_text_document",
"mup_save": "/mnt/ssd-1/lintang/09-mup-neox/mup_results",

}
101 changes: 101 additions & 0 deletions configs/coord_check_sp.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,101 @@
{
# parallelism settings
"pipe_parallel_size": 1,
"model_parallel_size": 1,

# model settings
"num_layers": 2,
"num_attention_heads": 4,
"seq_length": 2048,
"max_position_embeddings": 2048,
"pos_emb": "rotary",
"rotary_pct": 0.25,
"no_weight_tying": true,
"gpt_j_residual": true,
"output_layer_parallelism": "column",

# these should provide some speedup but takes a while to build, set to true if desired
"scaled_upper_triang_masked_softmax_fusion": true,
"bias_gelu_fusion": true,

# # init methods
# "init_method": "small_init",
# "output_layer_init_method": "wang_init",

# init methods
"init_method": "normal",
"output_layer_init_method": "scaled_normal",

# optimizer settings
"optimizer": {
"type": "Adam",
"params": {
"lr": 0.01,
"betas": [0.9, 0.95],
"eps": 1.0e-8,
}
},
"lr_decay_style": constant,
"warmup": 0,

# for all zero_optimization options, see https://www.deepspeed.ai/docs/config-json/#zero-optimizations-for-fp16-training
"zero_optimization": {
"stage": 1,
"allgather_partitions": true,
"allgather_bucket_size": 1260000000,
"overlap_comm": true,
"reduce_scatter": true,
"reduce_bucket_size": 1260000000,
"contiguous_gradients": true,
"cpu_offload": false
},

# batch / data settings
"train_micro_batch_size_per_gpu": 8,
"gradient_accumulation_steps": 1,
"data_impl": "mmap",
"num_workers": 1,

# activation checkpointing
"checkpoint_activations": true,
"checkpoint_num_layers": 1,
"partition_activations": true,
"synchronize_each_layer": true,

# regularization
"gradient_clipping": 1.0,
"weight_decay": 0.1,
"hidden_dropout": 0,
"attention_dropout": 0,

# precision settings
"precision": "fp32",
# "fp16": {
# "fp16": true,
# "enabled": true,
# "loss_scale": 0,
# "loss_scale_window": 1000,
# "hysteresis": 2,
# "min_loss_scale": 1
# },

# misc. training settings
"train_iters": 10,
"log_interval": 1,
"distributed_backend": "nccl",

"coord_check": true,
"coord_check_nsteps": 5,
"coord_check_nseeds": 1,
# "use_mup": true,
# base sigma
"init_method_std": 0.08,
# base size
"hidden_size": 256,

"tokenizer_type": "HFTokenizer",
"vocab-file": "/mnt/ssd-1/lintang/09-mup-neox/20B_tokenizer.json",
"data-path": "/mnt/ssd-1/lintang/09-mup-neox/data/enwik8/enwik8_text_document",
"mup_save": "/mnt/ssd-1/lintang/09-mup-neox/mup_results",

}
86 changes: 62 additions & 24 deletions configs/neox_arguments.md
Original file line number Diff line number Diff line change
Expand Up @@ -327,6 +327,7 @@ Model Arguments
Default = None

Transformer hidden size.
When using muP, this is d_model



Expand Down Expand Up @@ -641,6 +642,7 @@ Model Arguments
Default = 0.02

Standard deviation of the zero mean normal distribution used for weight initialization.
When using muP this is the base std



Expand Down Expand Up @@ -934,6 +936,7 @@ Optimizer Arguments
Default = None

Max Learning rate during training
When using muP, this is the base learning rate



Expand Down Expand Up @@ -2179,7 +2182,42 @@ Training Arguments

Default = False

Whether to use Microsoft's Mup https://github.com/microsoft/mup
Whether to use muP



- **mup_save**: str

Default = None

Path to save results when using muP



- **mup_lr**: float

Default = None

An alias parameter for lr,
if not None will override lr



- **mup_std**: float

Default = None

An alias parameter for init_method_std,
if not None will override init_method_std



- **mup_hidden_size**: int

Default = None

An alias parameter for hidden_size,
if not None will override hidden_size



Expand All @@ -2191,68 +2229,68 @@ Training Arguments



- **save_base_shapes**: bool
- **coord_check_nsteps**: int

Default = False
Default = 10

Whether to save base shapes for mup. This will save the shapes to the path specified in base-shapes-file.
Number of steps to do for the coordinate check



- **base_shapes_file**: str
- **coord_check_nseeds**: int

Default = None
Default = 5

Path to the base shapes to save to/load from
Number of repetition for each size in coordinate check



- **mup_init_scale**: float
- **save_base_shapes**: bool

Default = 1.0
Default = False

Initialization scale: All the parameters are multiplied by this value
Whether to save base shapes for mup. This will save the shapes to the path specified in base-shapes-file.



- **mup_attn_temp**: float
- **base_shapes_file**: str

Default = 1.0
Default = None

Attention temperature: Reciprocal of the multiplier applied to the input to attention softmax
Path to the base shapes to save to/load from



- **mup_output_temp**: float
- **mup_embedding_multiplier**: float

Default = 1.0

Output temperature: Reciprocal of the multiplier applied to the input to softmax that
produces the distribution over output tokens.
Embedding output multiplier



- **mup_embedding_mult**: float
- **mup_output_multiplier**: float

Default = 1.0

Scalar by which we multiply the output of the embedding layer
Output logits multiplier



- **mup_rp_embedding_mult**: float
- **mup_width_multiplier**: float

Default = 1.0
Default = None

Scalar by which we multiply vectors representing relative position
Manually set the layer width multiplier (d_model/d_model,base)



- **mup_width_scale**: int
- **mup_d_model_base**: int

Default = 2
Default = 256

What to scale width by when creating the delta model for mup
d_model,base
Proxy (base) model's layer width



Expand Down
6 changes: 4 additions & 2 deletions megatron/learning_rates.py
Original file line number Diff line number Diff line change
Expand Up @@ -37,6 +37,7 @@ def __init__(
use_checkpoint_lr_scheduler=True,
override_lr_scheduler=False,
use_mup=False,
mup_width_multiplier=1,
):

# Class values.
Expand All @@ -51,6 +52,7 @@ def __init__(
self.override_lr_scheduler = override_lr_scheduler
self.use_checkpoint_lr_scheduler = use_checkpoint_lr_scheduler
self.use_mup = use_mup
self.mup_width_multiplier = mup_width_multiplier
if self.override_lr_scheduler:
assert not self.use_checkpoint_lr_scheduler, (
"both override and " "use-checkpoint are set."
Expand Down Expand Up @@ -95,8 +97,8 @@ def step(self, step_num=None):
self.num_iters = step_num
new_lr = self.get_lr()
for group in self.optimizer.param_groups:
if self.use_mup and "width_mult" in group:
group["lr"] = new_lr / group["width_mult"]
if self.use_mup and ("lr_adjust" in group) and group["lr_adjust"] is True:
group["lr"] = new_lr / self.mup_width_multiplier
else:
group["lr"] = new_lr

Expand Down
Loading
Loading