forked from instructlab/training
-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
dev commit #5
Open
RobotSail
wants to merge
29
commits into
log-print-dev
Choose a base branch
from
log-print-dev-2
base: log-print-dev
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
dev commit #5
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Thank you for your contribution! Please make sure to review our contribution guidelines. |
this commit adds a new E2E job meant to test integration of training library changes with the CLI's "full" train pipeline to prevent any regressions it also updates the relevant mergify configuration Signed-off-by: Nathan Weinberg <[email protected]>
e2e: replace old small job with new medium job
was being incorrectly labeled as 'small" Signed-off-by: Nathan Weinberg <[email protected]>
was still using the old AMI from the previous job Signed-off-by: Nathan Weinberg <[email protected]>
was still using the old instance type Signed-off-by: Nathan Weinberg <[email protected]>
fix: incorrect label for AWS medium runner
Currently, the training library does not exit when an error is encountered within the training loop (invoked through torchrun). This commit updates that functionality so we correctly return an exit code of 1 on child failure. Additionally, this commit also adds the `make fix` command which automatically fixes all trivial issues picked up on by ruff Signed-off-by: Oleg S <[email protected]>
chore: add exit code & tox fix
Signed-off-by: Nathan Weinberg <[email protected]>
ci: grant HF_TOKEN access to the medium-size E2E CI job
this commit adds a new workflow to the Training repo it will run a nightly cron job to test the current 'main' branch of Training against the current 'main' branch of the CLI (instructlab) Signed-off-by: Nathan Weinberg <[email protected]>
Signed-off-by: Nathan Weinberg <[email protected]>
RobotSail
force-pushed
the
log-print-dev-2
branch
2 times, most recently
from
October 23, 2024 13:21
0b599b4
to
accaaae
Compare
ci: add large-size E2E CI job
Signed-off-by: Nathan Weinberg <[email protected]>
fix: add working directory config to steps in large E2E CI job
Signed-off-by: Nathan Weinberg <[email protected]>
fix: add remaining missing working directory configs
During development it's convenient to be able to run full distributed training, even on a smaller dataset, just to make sure that nothing obviously fails. This will also capture support for flash attention on the machine that it's run on, and for granite models. Signed-off-by: James Kunstle <[email protected]>
Signed-off-by: Nathan Weinberg <[email protected]>
ci: use org variable for AWS EC2 AMI in E2E CI jobs
also adds '-v' to 'pip install' so we can see environmental variable info for debugging issues related to installation Signed-off-by: Nathan Weinberg <[email protected]>
ci: convert med E2E CI job to L4 GPU
RobotSail
force-pushed
the
log-print-dev-2
branch
5 times, most recently
from
October 25, 2024 15:51
1d0676f
to
d18659d
Compare
RobotSail
force-pushed
the
log-print-dev-2
branch
3 times, most recently
from
October 25, 2024 17:00
8038986
to
38c5c1c
Compare
Updating the data collator for models with HF padding-free support, adding support for upcoming Granite HF model class, and updating flags/interface accordingly. ------------------------------------------------ * only compute lengths in the token dataset when it's not already present in the dataset Signed-off-by: aldo pareja-cardona <[email protected]> * Refactor padding function to support position_ids for FlashAttention - Added `supports_flash_attention` function to check GPU compatibility for FlashAttention. - Updated `make_collate_fn` to return `position_ids` instead of `attention_mask` when FlashAttention is supported. - Integrated the new padding logic into `setup_dataloader` to ensure compatibility with both Granite and non-Granite configurations. - Ensured backward compatibility by maintaining the original padding logic for GPUs that do not support FlashAttention. - Updated `main_ds.py` to use the new `supports_flash_attention` check for determining padding strategy. Signed-off-by: aldo pareja-cardona <[email protected]> * logging the global gradnorm now Signed-off-by: aldo pareja-cardona <[email protected]> * fixing deepspeed because it's not working with the scheduler we want Signed-off-by: aldo pareja-cardona <[email protected]> * fixing accelerate lr_scheduler Signed-off-by: aldo pareja-cardona <[email protected]> * fixing accelerate lr_scheduler Signed-off-by: aldo pareja-cardona <[email protected]> * samples seen was broken because now the samples are a single line Signed-off-by: aldo pareja-cardona <[email protected]> * find packing is wrong because when flash attention is supported padding should not be used when building the buckets Signed-off-by: aldo pareja-cardona <[email protected]> * black formatting Signed-off-by: aldo pareja-cardona <[email protected]> * it should not fail on granite 8b models anymore Signed-off-by: aldo pareja-cardona <[email protected]> * linting Signed-off-by: aldo pareja-cardona <[email protected]> * linting Signed-off-by: aldo pareja-cardona <[email protected]> * bug on padding when creating the multipack sampler Signed-off-by: aldo pareja-cardona <[email protected]> * linter Signed-off-by: aldo pareja-cardona <[email protected]> * linter Signed-off-by: aldo pareja-cardona <[email protected]> * Change old padding-free and granite flags to use_dolomite Signed-off-by: Mustafa Eyceoz <[email protected]> * Add safeguards and checks for flash attention when enabled/disabled Signed-off-by: Mustafa Eyceoz <[email protected]> * Rework flash attention checks for better modularity Signed-off-by: Mustafa Eyceoz <[email protected]> * Fix arg name Signed-off-by: Mustafa Eyceoz <[email protected]> * Update transformers to a version with Granite model class Signed-off-by: Mustafa Eyceoz <[email protected]> * Adding stateguards for dolomite and granite and model path check Signed-off-by: Mustafa Eyceoz <[email protected]> * Missing update Signed-off-by: Mustafa Eyceoz <[email protected]> * Clean up early validation checks and move to utils Signed-off-by: Mustafa Eyceoz <[email protected]> * Fix spelling mistake Signed-off-by: Mustafa Eyceoz <[email protected]> * Include AMD in flash attn check Signed-off-by: Mustafa Eyceoz <[email protected]> * Red-add is_padding_free with deprecation warning Signed-off-by: Mustafa Eyceoz <[email protected]> * Make use_dolomite default false Signed-off-by: Mustafa Eyceoz <[email protected]> * this is needed because the tag <MASK> is too common and some datasets will fail Signed-off-by: Mustafa Eyceoz <[email protected]> * added a warning in case the special tokens used for data processing are present in the dataset Signed-off-by: Mustafa Eyceoz <[email protected]> * added a warning in case the special tokens used for data processing are present in the dataset Signed-off-by: Mustafa Eyceoz <[email protected]> * Update valid data filter Signed-off-by: Mustafa Eyceoz <[email protected]> * Fix ruff formatting Signed-off-by: Mustafa Eyceoz <[email protected]> * Apply review feedback Signed-off-by: Mustafa Eyceoz <[email protected]> * Added comments Signed-off-by: Mustafa Eyceoz <[email protected]> --------- Signed-off-by: aldo pareja-cardona <[email protected]> Signed-off-by: Mustafa Eyceoz <[email protected]> Co-authored-by: aldo pareja-cardona <[email protected]> Co-authored-by: Mustafa Eyceoz <[email protected]>
RobotSail
force-pushed
the
log-print-dev-2
branch
9 times, most recently
from
October 25, 2024 18:57
e048e29
to
ad43827
Compare
adds basic smoketests for main_ds and data_process CLI args
RobotSail
force-pushed
the
log-print-dev-2
branch
9 times, most recently
from
October 25, 2024 21:42
b585a1b
to
8c3e237
Compare
Signed-off-by: Oleg S <[email protected]>
Signed-off-by: Oleg S <[email protected]>
RobotSail
force-pushed
the
log-print-dev-2
branch
from
October 25, 2024 21:43
8c3e237
to
24a5658
Compare
This pull request has been automatically marked as stale because it has not had activity within 90 days. It will be automatically closed if no further activity occurs within 30 days. |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
No description provided.