v0.31.0: Better support for sharded state dict with FSDP and Bugfixes
Core
- Set
timeout
default to PyTorch defaults based on backend by @muellerzr in #2758 - fix duplicate elements in split_between_processes by @hkunzhe in #2781
- Add Elastic Launch Support to
notebook_launcher
by @yhna940 in #2788 - Fix Wrong use of sync_gradients used to implement sync_each_batch by @fabianlim in #2790
FSDP
- Introduce shard-merging util for FSDP by @muellerzr in #2772
- Enable sharded state dict + offload to cpu resume by @muellerzr in #2762
- Enable config for fsdp activation checkpointing by @helloworld1 in #2779
Megatron
- Upgrade huggingface's megatron to nvidia's megatron when use MegatronLMPlugin by @zhangsheng377 in #2501
What's Changed
- Add feature to allow redirecting std streams into log files when using torchrun as the launcher. by @lyuwen in #2740
- Update modeling.py by adding try-catch section to skip the unavailable devices by @MeVeryHandsome in #2681
- Fixed the problem of incorrect conditional judgment statement when configuring enable_cpu_affinity by @statelesshz in #2748
- Fix stacklevel in
logging
to log the actual user call site (instead of the call site inside the logger wrapper) of log functions by @luowyang in #2730 - LOMO / FIX: Support multiple optimizers by @younesbelkada in #2745
- Fix max_memory assignment by @SunMarc in #2751
- Fix duplicate environment variable check in multi-cpu condition by @yhna940 in #2752
- Simplify CLI args validation and ensure CLI args take precedence over config file. by @Iain-S in #2757
- Fix sagemaker config by @muellerzr in #2753
- fix cpu omp num threads set by @jiqing-feng in #2755
- Revert "Simplify CLI args validation and ensure CLI args take precedence over config file." by @muellerzr in #2763
- Enable sharded cpu resume by @muellerzr in #2762
- Sets default to PyTorch defaults based on backend by @muellerzr in #2758
- optimize get_module_leaves speed by @BBuf in #2756
- fix minor typo by @TemryL in #2767
- Fix small edge case in get_module_leaves by @SunMarc in #2774
- Skip deepspeed test by @SunMarc in #2776
- Enable config for fsdp activation checkpointing by @helloworld1 in #2779
- Add arg from CLI to fix failing test by @muellerzr in #2783
- Skip tied weights disk offload test by @SunMarc in #2782
- Introduce shard-merging util for FSDP by @muellerzr in #2772
- FIX / FSDP : Guard fsdp utils for earlier PyTorch versions by @younesbelkada in #2794
- Upgrade huggingface's megatron to nvidia's megatron when use MegatronLMPlugin by @zhangsheng377 in #2501
- Fixup CLI test by @muellerzr in #2796
- fix duplicate elements in split_between_processes by @hkunzhe in #2781
- Add Elastic Launch Support to
notebook_launcher
by @yhna940 in #2788 - Fix Wrong use of sync_gradients used to implement sync_each_batch by @fabianlim in #2790
- Fix type in accelerator.py by @qgallouedec in #2800
- fix comet ml test by @SunMarc in #2804
- New template by @muellerzr in #2808
- Fix access error for torch.mps when using torch==1.13.1 on macOS by @SunMarc in #2806
- 4-bit quantization meta device bias loading bug by @SunMarc in #2805
- State dictionary retrieval from offloaded modules by @blbadger in #2619
- add cuda dep for a test by @SunMarc in #2820
- Remove out-dated xpu device check code in
get_balanced_memory
by @faaany in #2826 - Fix DeepSpeed config validation error by changing
stage3_prefetch_bucket_size
value to an integer by @adk9 in #2814 - Improve test speeds by up to 30% in multi-gpu settings by @muellerzr in #2830
- monitor-interval, take 2 by @muellerzr in #2833
- Optimize the megatron plugin by @zhangsheng377 in #2822
- fix fstr format by @Jintao-Huang in #2810
New Contributors
- @lyuwen made their first contribution in #2740
- @MeVeryHandsome made their first contribution in #2681
- @luowyang made their first contribution in #2730
- @Iain-S made their first contribution in #2757
- @BBuf made their first contribution in #2756
- @TemryL made their first contribution in #2767
- @helloworld1 made their first contribution in #2779
- @hkunzhe made their first contribution in #2781
- @adk9 made their first contribution in #2814
- @Jintao-Huang made their first contribution in #2810
Full Changelog: v0.30.1...v0.31.0