dev commit #5

RobotSail · 2024-10-16T13:32:55Z

No description provided.

github-actions · 2024-10-16T13:35:08Z

Thank you for your contribution! Please make sure to review our contribution guidelines.

this commit adds a new E2E job meant to test integration of training library changes with the CLI's "full" train pipeline to prevent any regressions it also updates the relevant mergify configuration Signed-off-by: Nathan Weinberg <[email protected]>

e2e: replace old small job with new medium job

was being incorrectly labeled as 'small" Signed-off-by: Nathan Weinberg <[email protected]>

was still using the old AMI from the previous job Signed-off-by: Nathan Weinberg <[email protected]>

was still using the old instance type Signed-off-by: Nathan Weinberg <[email protected]>

fix: incorrect label for AWS medium runner

Currently, the training library does not exit when an error is encountered within the training loop (invoked through torchrun). This commit updates that functionality so we correctly return an exit code of 1 on child failure. Additionally, this commit also adds the `make fix` command which automatically fixes all trivial issues picked up on by ruff Signed-off-by: Oleg S <[email protected]>

chore: add exit code & tox fix

Signed-off-by: Nathan Weinberg <[email protected]>

ci: grant HF_TOKEN access to the medium-size E2E CI job

this commit adds a new workflow to the Training repo it will run a nightly cron job to test the current 'main' branch of Training against the current 'main' branch of the CLI (instructlab) Signed-off-by: Nathan Weinberg <[email protected]>

Signed-off-by: Nathan Weinberg <[email protected]>

ci: add large-size E2E CI job

Signed-off-by: Nathan Weinberg <[email protected]>

fix: add working directory config to steps in large E2E CI job

Signed-off-by: Nathan Weinberg <[email protected]>

fix: add remaining missing working directory configs

During development it's convenient to be able to run full distributed training, even on a smaller dataset, just to make sure that nothing obviously fails. This will also capture support for flash attention on the machine that it's run on, and for granite models. Signed-off-by: James Kunstle <[email protected]>

Signed-off-by: Nathan Weinberg <[email protected]>

ci: use org variable for AWS EC2 AMI in E2E CI jobs

also adds '-v' to 'pip install' so we can see environmental variable info for debugging issues related to installation Signed-off-by: Nathan Weinberg <[email protected]>

ci: convert med E2E CI job to L4 GPU

Updating the data collator for models with HF padding-free support, adding support for upcoming Granite HF model class, and updating flags/interface accordingly. ------------------------------------------------ * only compute lengths in the token dataset when it's not already present in the dataset Signed-off-by: aldo pareja-cardona <[email protected]> * Refactor padding function to support position_ids for FlashAttention - Added `supports_flash_attention` function to check GPU compatibility for FlashAttention. - Updated `make_collate_fn` to return `position_ids` instead of `attention_mask` when FlashAttention is supported. - Integrated the new padding logic into `setup_dataloader` to ensure compatibility with both Granite and non-Granite configurations. - Ensured backward compatibility by maintaining the original padding logic for GPUs that do not support FlashAttention. - Updated `main_ds.py` to use the new `supports_flash_attention` check for determining padding strategy. Signed-off-by: aldo pareja-cardona <[email protected]> * logging the global gradnorm now Signed-off-by: aldo pareja-cardona <[email protected]> * fixing deepspeed because it's not working with the scheduler we want Signed-off-by: aldo pareja-cardona <[email protected]> * fixing accelerate lr_scheduler Signed-off-by: aldo pareja-cardona <[email protected]> * fixing accelerate lr_scheduler Signed-off-by: aldo pareja-cardona <[email protected]> * samples seen was broken because now the samples are a single line Signed-off-by: aldo pareja-cardona <[email protected]> * find packing is wrong because when flash attention is supported padding should not be used when building the buckets Signed-off-by: aldo pareja-cardona <[email protected]> * black formatting Signed-off-by: aldo pareja-cardona <[email protected]> * it should not fail on granite 8b models anymore Signed-off-by: aldo pareja-cardona <[email protected]> * linting Signed-off-by: aldo pareja-cardona <[email protected]> * linting Signed-off-by: aldo pareja-cardona <[email protected]> * bug on padding when creating the multipack sampler Signed-off-by: aldo pareja-cardona <[email protected]> * linter Signed-off-by: aldo pareja-cardona <[email protected]> * linter Signed-off-by: aldo pareja-cardona <[email protected]> * Change old padding-free and granite flags to use_dolomite Signed-off-by: Mustafa Eyceoz <[email protected]> * Add safeguards and checks for flash attention when enabled/disabled Signed-off-by: Mustafa Eyceoz <[email protected]> * Rework flash attention checks for better modularity Signed-off-by: Mustafa Eyceoz <[email protected]> * Fix arg name Signed-off-by: Mustafa Eyceoz <[email protected]> * Update transformers to a version with Granite model class Signed-off-by: Mustafa Eyceoz <[email protected]> * Adding stateguards for dolomite and granite and model path check Signed-off-by: Mustafa Eyceoz <[email protected]> * Missing update Signed-off-by: Mustafa Eyceoz <[email protected]> * Clean up early validation checks and move to utils Signed-off-by: Mustafa Eyceoz <[email protected]> * Fix spelling mistake Signed-off-by: Mustafa Eyceoz <[email protected]> * Include AMD in flash attn check Signed-off-by: Mustafa Eyceoz <[email protected]> * Red-add is_padding_free with deprecation warning Signed-off-by: Mustafa Eyceoz <[email protected]> * Make use_dolomite default false Signed-off-by: Mustafa Eyceoz <[email protected]> * this is needed because the tag <MASK> is too common and some datasets will fail Signed-off-by: Mustafa Eyceoz <[email protected]> * added a warning in case the special tokens used for data processing are present in the dataset Signed-off-by: Mustafa Eyceoz <[email protected]> * added a warning in case the special tokens used for data processing are present in the dataset Signed-off-by: Mustafa Eyceoz <[email protected]> * Update valid data filter Signed-off-by: Mustafa Eyceoz <[email protected]> * Fix ruff formatting Signed-off-by: Mustafa Eyceoz <[email protected]> * Apply review feedback Signed-off-by: Mustafa Eyceoz <[email protected]> * Added comments Signed-off-by: Mustafa Eyceoz <[email protected]> --------- Signed-off-by: aldo pareja-cardona <[email protected]> Signed-off-by: Mustafa Eyceoz <[email protected]> Co-authored-by: aldo pareja-cardona <[email protected]> Co-authored-by: Mustafa Eyceoz <[email protected]>

adds basic smoketests for main_ds and data_process CLI args

Signed-off-by: Oleg S <[email protected]>

github-actions · 2025-01-24T01:57:01Z

This pull request has been automatically marked as stale because it has not had activity within 90 days. It will be automatically closed if no further activity occurs within 30 days.

nathan-weinberg and others added 12 commits October 16, 2024 15:41

Merge pull request instructlab#277 from nathan-weinberg/fix-e2e

a641a2c

e2e: replace old small job with new medium job

fix: incorrect label for AWS medium runner

bbb75f3

was being incorrectly labeled as 'small" Signed-off-by: Nathan Weinberg <[email protected]>

fix: add new AMI to job

7e3cefb

was still using the old AMI from the previous job Signed-off-by: Nathan Weinberg <[email protected]>

fix: correct instance type being used

036b901

was still using the old instance type Signed-off-by: Nathan Weinberg <[email protected]>

Merge pull request instructlab#279 from nathan-weinberg/fix-e2e-2

e680bd8

fix: incorrect label for AWS medium runner

Merge pull request instructlab#217 from RobotSail/propogate-failure

c0ee1aa

chore: add exit code & tox fix

ci: grant HF_TOKEN access to the medium-size E2E CI job

af89ded

Signed-off-by: Nathan Weinberg <[email protected]>

Merge pull request instructlab#285 from nathan-weinberg/hf-token

7b7894b

ci: grant HF_TOKEN access to the medium-size E2E CI job

ci: add large-size E2E CI job

e49809b

this commit adds a new workflow to the Training repo it will run a nightly cron job to test the current 'main' branch of Training against the current 'main' branch of the CLI (instructlab) Signed-off-by: Nathan Weinberg <[email protected]>

chore: minor CI tweaks

5189390

Signed-off-by: Nathan Weinberg <[email protected]>

RobotSail force-pushed the log-print-dev-2 branch 2 times, most recently from 0b599b4 to accaaae Compare October 23, 2024 13:21

mergify bot and others added 10 commits October 23, 2024 17:46

Merge pull request instructlab#290 from nathan-weinberg/large-job

f883cf1

ci: add large-size E2E CI job

fix: add working directory config to steps in large E2E CI job

3a7b7db

Signed-off-by: Nathan Weinberg <[email protected]>

Merge pull request instructlab#298 from nathan-weinberg/fix-large

806fd25

fix: add working directory config to steps in large E2E CI job

fix: add remaining missing working directory configs

890ebd9

Signed-off-by: Nathan Weinberg <[email protected]>

Merge pull request instructlab#299 from nathan-weinberg/fix-large-2

f7aada7

fix: add remaining missing working directory configs

ci: use org variable for AWS EC2 AMI in E2E CI jobs

391a9dc

Signed-off-by: Nathan Weinberg <[email protected]>

Merge pull request instructlab#302 from nathan-weinberg/ami

b54f32e

ci: use org variable for AWS EC2 AMI in E2E CI jobs

ci: convert med E2E CI job to L4 GPU

d912c2c

also adds '-v' to 'pip install' so we can see environmental variable info for debugging issues related to installation Signed-off-by: Nathan Weinberg <[email protected]>

Merge pull request instructlab#305 from nathan-weinberg/l4

ed8d6e2

ci: convert med E2E CI job to L4 GPU

RobotSail force-pushed the log-print-dev-2 branch 5 times, most recently from 1d0676f to d18659d Compare October 25, 2024 15:51

RobotSail force-pushed the log-print-dev-2 branch 3 times, most recently from 8038986 to 38c5c1c Compare October 25, 2024 17:00

RobotSail force-pushed the log-print-dev-2 branch 9 times, most recently from e048e29 to ad43827 Compare October 25, 2024 18:57

Merge pull request instructlab#280 from JamesKunstle/smoketests

466474a

adds basic smoketests for main_ds and data_process CLI args

RobotSail force-pushed the log-print-dev-2 branch 9 times, most recently from b585a1b to 8c3e237 Compare October 25, 2024 21:42

RobotSail added 5 commits October 25, 2024 17:43

create script to output loss curve

464b395

Signed-off-by: Oleg S <[email protected]>

add action for printing out the loss

3532013

Signed-off-by: Oleg S <[email protected]>

update action

bcf3545

update branch rules

cef0f5a

update script to use s3

24a5658

RobotSail force-pushed the log-print-dev-2 branch from 8c3e237 to 24a5658 Compare October 25, 2024 21:43

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

dev commit #5

dev commit #5

RobotSail commented Oct 16, 2024

github-actions bot commented Oct 16, 2024

github-actions bot commented Jan 24, 2025

dev commit #5

Are you sure you want to change the base?

dev commit #5

Conversation

RobotSail commented Oct 16, 2024

github-actions bot commented Oct 16, 2024

github-actions bot commented Jan 24, 2025