Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dev commit #5

Open
wants to merge 29 commits into
base: log-print-dev
Choose a base branch
from
Open

dev commit #5

wants to merge 29 commits into from

Conversation

RobotSail
Copy link
Owner

No description provided.

Copy link

Thank you for your contribution! Please make sure to review our contribution guidelines.

nathan-weinberg and others added 12 commits October 16, 2024 15:41
this commit adds a new E2E job meant to test integration
of training library changes with the CLI's "full" train
pipeline to prevent any regressions

it also updates the relevant mergify configuration

Signed-off-by: Nathan Weinberg <[email protected]>
e2e: replace old small job with new medium job
was being incorrectly labeled as 'small"

Signed-off-by: Nathan Weinberg <[email protected]>
was still using the old AMI from the previous job

Signed-off-by: Nathan Weinberg <[email protected]>
was still using the old instance type

Signed-off-by: Nathan Weinberg <[email protected]>
fix: incorrect label for AWS medium runner
Currently, the training library does not exit when an error is encountered
within the training loop (invoked through torchrun). This commit updates
that functionality so we correctly return an exit code of 1 on child failure.

Additionally, this commit also adds the `make fix` command which
automatically fixes all trivial issues picked up on by ruff

Signed-off-by: Oleg S <[email protected]>
ci: grant HF_TOKEN access to the medium-size E2E CI job
this commit adds a new workflow to the Training repo
it will run a nightly cron job to test the current
'main' branch of Training against the current
'main' branch of the CLI (instructlab)

Signed-off-by: Nathan Weinberg <[email protected]>
Signed-off-by: Nathan Weinberg <[email protected]>
@RobotSail RobotSail force-pushed the log-print-dev-2 branch 2 times, most recently from 0b599b4 to accaaae Compare October 23, 2024 13:21
mergify bot and others added 10 commits October 23, 2024 17:46
fix: add working directory config to steps in large E2E CI job
fix: add remaining missing working directory configs
During development it's convenient to be able to run full distributed
training, even on a smaller dataset, just to make sure that nothing
obviously fails. This will also capture support for flash attention on
the machine that it's run on, and for granite models.

Signed-off-by: James Kunstle <[email protected]>
ci: use org variable for AWS EC2 AMI in E2E CI jobs
also adds '-v' to 'pip install' so we can see
environmental variable info for debugging issues
related to installation

Signed-off-by: Nathan Weinberg <[email protected]>
ci: convert med E2E CI job to L4 GPU
@RobotSail RobotSail force-pushed the log-print-dev-2 branch 5 times, most recently from 1d0676f to d18659d Compare October 25, 2024 15:51
@RobotSail RobotSail force-pushed the log-print-dev-2 branch 3 times, most recently from 8038986 to 38c5c1c Compare October 25, 2024 17:00
Updating the data collator for models with HF padding-free support, adding support for upcoming Granite HF model class, and updating flags/interface accordingly.

------------------------------------------------

* only compute lengths in the token dataset when it's not already present in the dataset

Signed-off-by: aldo pareja-cardona <[email protected]>

* Refactor padding function to support position_ids for FlashAttention

- Added `supports_flash_attention` function to check GPU compatibility for FlashAttention.
- Updated `make_collate_fn` to return `position_ids` instead of `attention_mask` when FlashAttention is supported.
- Integrated the new padding logic into `setup_dataloader` to ensure compatibility with both Granite and non-Granite configurations.
- Ensured backward compatibility by maintaining the original padding logic for GPUs that do not support FlashAttention.
- Updated `main_ds.py` to use the new `supports_flash_attention` check for determining padding strategy.

Signed-off-by: aldo pareja-cardona <[email protected]>

* logging the global gradnorm now

Signed-off-by: aldo pareja-cardona <[email protected]>

* fixing deepspeed because it's not working with the scheduler we want

Signed-off-by: aldo pareja-cardona <[email protected]>

* fixing accelerate lr_scheduler

Signed-off-by: aldo pareja-cardona <[email protected]>

* fixing accelerate lr_scheduler

Signed-off-by: aldo pareja-cardona <[email protected]>

* samples seen was broken because now the samples are a single line

Signed-off-by: aldo pareja-cardona <[email protected]>

* find packing is wrong because when flash attention is supported padding should not be used when building the buckets

Signed-off-by: aldo pareja-cardona <[email protected]>

* black formatting

Signed-off-by: aldo pareja-cardona <[email protected]>

* it should not fail on granite 8b models anymore

Signed-off-by: aldo pareja-cardona <[email protected]>

* linting

Signed-off-by: aldo pareja-cardona <[email protected]>

* linting

Signed-off-by: aldo pareja-cardona <[email protected]>

* bug on padding when creating the multipack sampler

Signed-off-by: aldo pareja-cardona <[email protected]>

* linter

Signed-off-by: aldo pareja-cardona <[email protected]>

* linter

Signed-off-by: aldo pareja-cardona <[email protected]>

* Change old padding-free and granite flags to use_dolomite

Signed-off-by: Mustafa Eyceoz <[email protected]>

* Add safeguards and checks for flash attention when enabled/disabled

Signed-off-by: Mustafa Eyceoz <[email protected]>

* Rework flash attention checks for better modularity

Signed-off-by: Mustafa Eyceoz <[email protected]>

* Fix arg name

Signed-off-by: Mustafa Eyceoz <[email protected]>

* Update transformers to a version with Granite model class

Signed-off-by: Mustafa Eyceoz <[email protected]>

* Adding stateguards for dolomite and granite and model path check

Signed-off-by: Mustafa Eyceoz <[email protected]>

* Missing update

Signed-off-by: Mustafa Eyceoz <[email protected]>

* Clean up early validation checks and move to utils

Signed-off-by: Mustafa Eyceoz <[email protected]>

* Fix spelling mistake

Signed-off-by: Mustafa Eyceoz <[email protected]>

* Include AMD in flash attn check

Signed-off-by: Mustafa Eyceoz <[email protected]>

* Red-add is_padding_free with deprecation warning

Signed-off-by: Mustafa Eyceoz <[email protected]>

* Make use_dolomite default false

Signed-off-by: Mustafa Eyceoz <[email protected]>

* this is needed because the tag <MASK> is too common and some datasets will fail

Signed-off-by: Mustafa Eyceoz <[email protected]>

* added a warning in case the special tokens used for data processing are present in the dataset

Signed-off-by: Mustafa Eyceoz <[email protected]>

* added a warning in case the special tokens used for data processing are present in the dataset

Signed-off-by: Mustafa Eyceoz <[email protected]>

* Update valid data filter

Signed-off-by: Mustafa Eyceoz <[email protected]>

* Fix ruff formatting

Signed-off-by: Mustafa Eyceoz <[email protected]>

* Apply review feedback

Signed-off-by: Mustafa Eyceoz <[email protected]>

* Added comments

Signed-off-by: Mustafa Eyceoz <[email protected]>

---------

Signed-off-by: aldo pareja-cardona <[email protected]>
Signed-off-by: Mustafa Eyceoz <[email protected]>
Co-authored-by: aldo pareja-cardona <[email protected]>
Co-authored-by: Mustafa Eyceoz <[email protected]>
@RobotSail RobotSail force-pushed the log-print-dev-2 branch 9 times, most recently from e048e29 to ad43827 Compare October 25, 2024 18:57
adds basic smoketests for main_ds and data_process CLI args
@RobotSail RobotSail force-pushed the log-print-dev-2 branch 9 times, most recently from b585a1b to 8c3e237 Compare October 25, 2024 21:42
Copy link

This pull request has been automatically marked as stale because it has not had activity within 90 days. It will be automatically closed if no further activity occurs within 30 days.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants