Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable rocm-support #353

Open
wants to merge 43 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
43 commits
Select commit Hold shift + click to select a range
ebb79c8
Squash 3 commits to 1
luukkonenr Oct 5, 2022
21c90de
Add --no-layer-norm-fusion argument
spyysalo Oct 21, 2022
e048713
Add --no-optimizer-fusion argument
spyysalo Oct 21, 2022
18e2c65
Bugfix (thanks to Thomas Wang for catching this)
spyysalo Oct 21, 2022
9b7cd05
Fix the bug of FusedLayerNorm on ROCm (#96)
hubertlu-tw Nov 18, 2022
277e1d3
Revert cherry-picked changes to .py
spyysalo Nov 18, 2022
2963cae
Add LUMI eval compat
Muennighoff Nov 22, 2022
32f039c
Update tasks
Muennighoff Nov 22, 2022
fdd57c4
Merge pull request #1 from bigscience-workshop/lumi_eval
spyysalo Nov 22, 2022
2ca2338
add inverse_sqrt lr decay style
NouamaneTazi Nov 22, 2022
ad60932
fix no warmup case
NouamaneTazi Nov 22, 2022
0823ad8
use t5x formula
NouamaneTazi Nov 23, 2022
a093db6
avoid num_steps > decay_steps case
NouamaneTazi Nov 23, 2022
b4601b9
remove casting as math.sqrt does that
NouamaneTazi Nov 23, 2022
4dae139
add lr-warmup-style argument taking "constant" or "linear" values
NouamaneTazi Nov 23, 2022
5fbb1dd
refactor num_steps_
NouamaneTazi Nov 23, 2022
6299fb2
docs
NouamaneTazi Nov 23, 2022
4e86650
fix formulas
NouamaneTazi Nov 24, 2022
50c6935
fix formula
NouamaneTazi Nov 24, 2022
5c642dd
correct comment
NouamaneTazi Nov 28, 2022
1b14a28
note about replicating t5x
NouamaneTazi Nov 28, 2022
5e811b6
Merge pull request #2 from NouamaneTazi/inverse-sqrt-lr
NouamaneTazi Nov 28, 2022
5365f41
quick fix for upper triang masked softmax cuda kernel for seq_len < 8192
NouamaneTazi Dec 6, 2022
9874963
Merge pull request #3 from NouamaneTazi/large-seqlen-kernels
spyysalo Dec 7, 2022
c41cc5e
Use torch.multiprocessing.set_start_method('spawn')
spyysalo Dec 7, 2022
6732bc9
skip_warmup on __setstate__
spyysalo Dec 9, 2022
ab29faf
Copy preliminary UL2
Muennighoff Dec 28, 2022
9328ad2
DeepSpeed compat
Muennighoff Dec 29, 2022
351f4f2
DS Group compat
Muennighoff Dec 30, 2022
abc19b8
Adapt eval for denoiser
Muennighoff Dec 30, 2022
816c32d
Simpler padding
Muennighoff Jan 3, 2023
bdbd54a
Fix sampling
Muennighoff Jan 3, 2023
cacf267
Switch padding
Muennighoff Jan 3, 2023
4769132
Merge pull request #4 from TurkuNLP/ul2
spyysalo Jan 4, 2023
557b09c
Upate sampling
Muennighoff Jan 23, 2023
a6f69bf
Update UL2
Muennighoff Jan 24, 2023
d0d277f
Add get_samples_mapping
Muennighoff Jan 24, 2023
3f29df8
Import math
Muennighoff Jan 24, 2023
5207386
Fix prefixlm
Muennighoff Feb 6, 2023
9490e50
tmp
Muennighoff May 19, 2023
9c8d02c
Merge branch 'main' into tmp
Muennighoff May 19, 2023
6936afb
Revert UL2 Tokenizer Changes
Muennighoff May 19, 2023
a1088c1
Merge pull request #7 from TurkuNLP/tmp
Muennighoff May 19, 2023
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
10 changes: 10 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -1,3 +1,13 @@
## HIP-compiled kernels etc.
*hip*
#
local_examples/
logs/
trash/
kb-runs-gpt/
ds_configs/
gpt2-tokenizer/
smi-output/
# tests
# megatron autogenerated indices
tests/data/*/*npy
Expand Down
1 change: 1 addition & 0 deletions examples/run_evalharness_deepspeed.md
Original file line number Diff line number Diff line change
Expand Up @@ -15,6 +15,7 @@ Get lm-eval harness (https://github.com/EleutherAI/lm-evaluation-harness) and `b
start-prod
pip install best-download==0.0.7
pip install git+https://github.com/EleutherAI/lm-evaluation-harness
pip install --upgrade scipy
```

2. Pre-download needed datasets
Expand Down
113 changes: 113 additions & 0 deletions examples/run_evalharness_lumi.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,113 @@
#!/bin/bash
#SBATCH --exclude=nid005159
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=32
#SBATCH --mem=256G
#SBATCH -p eap
#SBATCH -t 2-0:00:00
#SBATCH --gpus-per-node=mi250:1
#SBATCH --exclusive=user
#SBATCH --hint=nomultithread
#SBATCH --account=project_462000119
#SBATCH -o logs/%j.out
#SBATCH -e logs/%j.err

# if run without sbatch, invoke here
if [ -z $SLURM_JOB_ID ]; then
mkdir -p logs
sbatch "$0"
exit
fi

set -euo pipefail

# symlink logs/latest_eval.out and logs/latest_eval.err
ln -f -s $SLURM_JOB_ID.out logs/latest_eval.out
ln -f -s $SLURM_JOB_ID.err logs/latest_eval.err

# Data
CHECKPOINT_PATH=/scratch/project_462000119/muennighoff/nov-2022-optimization/checkpoints/global_step10
VARIANT=global_step10

export HF_DATASETS_OFFLINE=1
export HF_DATASETS_CACHE=/scratch/project_462000119/ds_cache

VOCAB_FILE="gpt2/vocab.json"
MERGE_FILE="gpt2/merges.txt"

PP_SIZE=1
TP_SIZE=1
# different from the training MICRO_BATCH_SIZE - no optim memory, so can do bigger BS
# make as big as it can fit into gpu w/o OOM, but not too close to 100%
EVAL_MICRO_BATCH_SIZE=1
MICRO_BS_MULTIPLIER=1

# Model parameters
SEQ_LEN=2048

# Dummy arguments
MEGATRON_REQUIRED_ARGS=" \
--num-layers -1 \
--hidden-size -1 \
--num-attention-heads -1 \
--seq-length -1 \
--max-position-embeddings -1 \
"

ZERO_STAGE=0

mkdir -p ds_configs
DS_CONFIG_PATH="ds_configs/$SLURM_JOB_ID.json"

cat <<EOF > $DS_CONFIG_PATH
{
"train_micro_batch_size_per_gpu": 1,
"train_batch_size": 1,
"gradient_clipping": 1.0,
"zero_optimization": {
"stage": $ZERO_STAGE
},
"bf16": {
"enabled": true
},
"steps_per_print": 2000,
"wall_clock_breakdown": false
}
EOF

DEEPSPEED_ARGS=" \
--deepspeed \
--deepspeed_config $DS_CONFIG_PATH \
--zero-stage $ZERO_STAGE \
"

CMD="Megatron-DeepSpeed/tasks/eval_harness/evaluate.py \
--load $CHECKPOINT_PATH \
--results_path $VARIANT-results.json \
--tensor-model-parallel-size $TP_SIZE \
--pipeline-model-parallel-size $PP_SIZE \
--vocab-file $VOCAB_FILE \
--merge-file $MERGE_FILE \
--micro-batch-size $EVAL_MICRO_BATCH_SIZE \
--no-load-optim \
--no-load-rng \
--bf16 \
--inference \
--seq-length $SEQ_LEN \
--task_list copa,piqa,rte,winogrande,hendrycksTest-abstract_algebra,hendrycksTest-anatomy,hendrycksTest-astronomy,hendrycksTest-business_ethics,hendrycksTest-clinical_knowledge,hendrycksTest-college_biology,hendrycksTest-college_chemistry,hendrycksTest-college_computer_science,hendrycksTest-college_mathematics,hendrycksTest-college_medicine,hendrycksTest-college_physics,hendrycksTest-computer_security,hendrycksTest-conceptual_physics,hendrycksTest-econometrics,hendrycksTest-electrical_engineering,hendrycksTest-elementary_mathematics,hendrycksTest-formal_logic,hendrycksTest-global_facts,hendrycksTest-high_school_biology,hendrycksTest-high_school_chemistry,hendrycksTest-high_school_computer_science,hendrycksTest-high_school_european_history,hendrycksTest-high_school_geography,hendrycksTest-high_school_government_and_politics,hendrycksTest-high_school_macroeconomics,hendrycksTest-high_school_mathematics,hendrycksTest-high_school_microeconomics,hendrycksTest-high_school_physics,hendrycksTest-high_school_psychology,hendrycksTest-high_school_statistics,hendrycksTest-high_school_us_history,hendrycksTest-high_school_world_history,hendrycksTest-human_aging,hendrycksTest-human_sexuality,hendrycksTest-international_law,hendrycksTest-jurisprudence,hendrycksTest-logical_fallacies,hendrycksTest-machine_learning,hendrycksTest-management,hendrycksTest-marketing,hendrycksTest-medical_genetics,hendrycksTest-miscellaneous,hendrycksTest-moral_disputes,hendrycksTest-moral_scenarios,hendrycksTest-nutrition,hendrycksTest-philosophy,hendrycksTest-prehistory,hendrycksTest-professional_accounting,hendrycksTest-professional_law,hendrycksTest-professional_medicine,hendrycksTest-professional_psychology,hendrycksTest-public_relations,hendrycksTest-security_studies,hendrycksTest-sociology,hendrycksTest-us_foreign_policy,hendrycksTest-virology,hendrycksTest-world_religions \
--intermed_results \
--adaptive_seq_len \
--micro_bs_multiplier $MICRO_BS_MULTIPLIER \
$MEGATRON_REQUIRED_ARGS \
$DEEPSPEED_ARGS \
"

echo $CMD

echo "START $SLURM_JOBID: $(date)"

srun --label launch.sh $CMD

echo "END $SLURM_JOBID: $(date)"

1 change: 1 addition & 0 deletions finetune_t0_non_causal_decoder.py
Original file line number Diff line number Diff line change
Expand Up @@ -33,6 +33,7 @@ def model_provider(pre_process=True, post_process=True):
enabled=args.zero_stage == 3,
mpu=mpu):
if args.deepspeed:
args.pretrain_causal_attention = False
model = GPTModelPipe(
num_tokentypes=0,
parallel_output=True,
Expand Down
64 changes: 61 additions & 3 deletions megatron/arguments.py
Original file line number Diff line number Diff line change
Expand Up @@ -24,7 +24,7 @@
import torch
import deepspeed

from megatron.enums import PositionEmbeddingType
from megatron.enums import PositionEmbeddingType, UL2ModelType
import megatron
from megatron.logging import log_levels

Expand All @@ -49,6 +49,7 @@ def parse_args(extra_args_provider=None, defaults={},
parser = _add_autoresume_args(parser)
parser = _add_biencoder_args(parser)
parser = _add_vit_args(parser)
parser = _add_ul2_args(parser)
parser = _add_logging_args(parser)
parser = _add_zero_args(parser)
parser = _add_memoryopt_args(parser)
Expand Down Expand Up @@ -309,6 +310,17 @@ def parse_args(extra_args_provider=None, defaults={},
"skip train iterations should be specified as two numbers, i.e. start-end"
)
args.skip_train_iteration_range = skip_train_iteration_range

args.ul2_model_type = UL2ModelType(args.ul2_model_type)
if (
args.ul2_model_type is not UL2ModelType.ENCODER_DECODER
and args.decoder_seq_length is not None
):
print(
f'WARNING: `--decoder_seq_length` is ignored when '
f'`--ul2-model-type` is not '
f'"{UL2ModelType.ENCODER_DECODER.value}"!'
)

if args.use_bnb_optimizer:
try:
Expand Down Expand Up @@ -549,6 +561,12 @@ def _add_training_args(parser):
group.add_argument('--no-bias-dropout-fusion', action='store_false',
help='Disable bias and dropout fusion.',
dest='bias_dropout_fusion')
group.add_argument('--no-layer-norm-fusion', action='store_false',
help='Disable fused layer norm.',
dest='layer_norm_fusion')
group.add_argument('--no-optimizer-fusion', action='store_false',
help='Disable FusedAdam/FusedSGD norm.',
dest='optimizer_fusion')
group.add_argument('--optimizer', type=str, default='adam',
choices=['adam', 'sgd'],
help='Optimizer function')
Expand Down Expand Up @@ -604,7 +622,7 @@ def _add_learning_rate_args(parser):
'and initial warmup, the learing rate at each '
'iteration would be different.')
group.add_argument('--lr-decay-style', type=str, default='linear',
choices=['constant', 'linear', 'cosine'],
choices=['constant', 'linear', 'cosine', 'inverse_sqrt'],
help='Learning rate decay function.')
group.add_argument('--lr-decay-iters', type=int, default=None,
help='number of iterations to decay learning rate over,'
Expand All @@ -615,6 +633,9 @@ def _add_learning_rate_args(parser):
group.add_argument('--lr-decay-tokens', type=int, default=None,
help='number of tokens to decay learning rate over,'
' If not None will override iter/sample-based decay')
group.add_argument('--lr-warmup-style', type=str, default='linear',
choices=['constant', 'linear'], help='Learning rate '
'warmup function.')
group.add_argument('--lr-warmup-fraction', type=float, default=None,
help='fraction of lr-warmup-(iters/samples) to use '
'for warmup (as a float)')
Expand Down Expand Up @@ -643,7 +664,8 @@ def _add_learning_rate_args(parser):
'from checkpoint and ignore input arguments.')
group.add_argument('--universal-checkpoint', action='store_true',
help='Loading a universal format checkpoint.')

group.add_argument('--reset-progress', action='store_true', default=None,
help='Reset iteration to 0 & do not load args.')
return parser


Expand Down Expand Up @@ -1023,6 +1045,42 @@ def _add_vit_args(parser):

return parser

def _add_ul2_args(parser):
group = parser.add_argument_group(title="UL2")

group.add_argument('--ul2-model-type', type=str, default='ED',
choices=['ED', 'ND', 'CD'],
help='What type of model to use for UL2 pretraining. '
'ED = encoder-decoder; ND = non-causal decoder-only; '
'CD = causal decoder-only')
group.add_argument('--ul2-denoiser-ratios', nargs='+', type=float,
default=None,
help='Probability of each denoising objective to be '
'selected. Uniform distribution by default.')
group.add_argument('--ul2-denoisers', nargs='+', type=str,
default=['R', 'R', 'S', 'X', 'X', 'X', 'X'],
choices=['R', 'S', 'X'],
help='What type of UL2 denoising objective the other '
'UL2 configurations refer to.')
group.add_argument('--ul2-mean-span-lengths', nargs='+', type=float,
default=[3, 8, 0.25, 3, 8, 64, 64],
help='Mean length for sampling span lengths. '
'Numbers < 1 indicate a mean length of the sequence '
'length times that number.')
group.add_argument('--ul2-mask-ratios', nargs='+', type=float,
default=[0.15, 0.15, 0.25, 0.5, 0.5, 0.15, 0.5],
help='Ratio of masked token in the full sequence.')
group.add_argument('--ul2-r-denoiser-token', type=str, default='[R]',
help='What token to prepend for the UL2 R-denoising '
'objective.')
group.add_argument('--ul2-s-denoiser-token', type=str, default='[S]',
help='What token to prepend for the UL2 S-denoising '
'objective.')
group.add_argument('--ul2-x-denoiser-token', type=str, default='[X]',
help='What token to prepend for the UL2 X-denoising '
'objective.')

return parser

def _add_zero_args(parser):
"""Text generate arguments."""
Expand Down
6 changes: 3 additions & 3 deletions megatron/checkpointing.py
Original file line number Diff line number Diff line change
Expand Up @@ -342,7 +342,7 @@ def load_checkpoint(model, optimizer, lr_scheduler, load_arg='load', strict=True
set_checkpoint_version(state_dict.get('checkpoint_version', 0))

# Set iteration.
if args.finetune or release:
if args.finetune or release or args.reset_progress:
iteration = 0
else:
try:
Expand All @@ -361,7 +361,7 @@ def load_checkpoint(model, optimizer, lr_scheduler, load_arg='load', strict=True
# Check arguments.
assert args.consumed_train_samples == 0
assert args.consumed_valid_samples == 0
if 'args' in state_dict:
if 'args' in state_dict and not args.reset_progress:
checkpoint_args = state_dict['args']
if not args.universal_checkpoint:
check_checkpoint_args(checkpoint_args)
Expand Down Expand Up @@ -480,4 +480,4 @@ def _checkpoint_info():
return {
"padded_vocab_size": args.padded_vocab_size,
"original_vocab_size": tokenizer.vocab_size,
}
}
Loading