Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(test/neuron-training): Add New E2E Test Harness for Distributed BERT Training on Neuron (Trainium) #558

Merged
merged 8 commits into from
Jan 13, 2025

Conversation

mattcjo
Copy link
Contributor

@mattcjo mattcjo commented Jan 10, 2025

Issue #, if available:

Description of changes:
This pull request introduces a brand-new end-to-end (E2E) test that validates distributed BERT pretraining on AWS Trainium (Neuron) nodes using the MPI operator. Previously, there was no “neuron-training” test in the repository. Key aspects of the newly added test:

  1. Templated MPIJob Manifest

    • Dynamically renders a neuron-bert-training.yaml manifest with placeholders for node type, image, resource requests (NeuronCore, EFA), and replica counts.
  2. Two-Step “Job Wait” Logic

    • After applying the MPIJob, the harness first waits for the MPI operator to create the “launcher” Kubernetes Job.
    • Once the Job resource exists, the test then waits for it to reach a Succeeded state, ensuring the distributed BERT training completes successfully.
  3. Throughput and Epoch-Time Metrics

    • Upon success, the test gathers logs from the launcher pods.
    • The logs are parsed for rank-level throughput and epoch timing, which get aggregated and printed as final metrics.
  4. Fully Automated E2E Workflow

    • The harness covers everything from applying the device plugin manifests, ensuring the MPI operator is healthy, applying the training manifest, and finally validating that distributed BERT training worked on Neuron-based instances.

This new test provides confidence that multi-node Neuron clusters can successfully run an MPI-based BERT pretraining job end to end, including resource scheduling, environment setup, and performance metrics gathering.

Testing

go test -timeout 60m -v . -run TestNeuronTraining \
  -bertTrainingImage={{ACCOUNT_ID}}.dkr.ecr.us-east-1.amazonaws.com/aws-k8s-tester/neuron-training:latest \
  -efaEnabled=true \
  -nodeType=trn1.32xlarge
2025/01/10 05:06:39 Starting tests...
2025/01/10 05:06:39 Applying Neuron device plugin RBAC, Neuron device plugin, MPI operator, and EFA device plugin manifests.
2025/01/10 05:06:42 Successfully applied Neuron device plugin RBAC, Neuron device plugin, MPI operator, and EFA device plugin manifests.
2025/01/10 05:06:42 Waiting for MPI Operator deployment to be available.
2025/01/10 05:06:47 MPI Operator deployment is available.
2025/01/10 05:06:47 Waiting for Neuron Device Plugin daemonset to be ready.
2025/01/10 05:06:52 Neuron Device Plugin daemonset is ready.
2025/01/10 05:06:52 Waiting for EFA Device Plugin daemonset to be ready.
^[2025/01/10 05:06:57 EFA Device Plugin daemonset is ready.
2025/01/10 05:06:57 [INFO] Processing node ip-192-168-145-196.ec2.internal
2025/01/10 05:06:57 [INFO] Processing node ip-192-168-163-18.ec2.internal
2025/01/10 05:06:57 [INFO] Total Nodes: 2
2025/01/10 05:06:57 [INFO] Total Neuron Count: 32, Neuron Per Node: 16
2025/01/10 05:06:57 [INFO] Total Neuron Core Count: 64, Neuron Core Per Node: 32
2025/01/10 05:06:57 [INFO] Total EFA Count: 16, EFA Per Node: 8
=== RUN   TestNeuronTraining
=== RUN   TestNeuronTraining/neuron-training
2025/01/10 05:06:57 Applying rendered Neuron training manifest.
2025/01/10 05:06:57 Successfully applied Neuron training manifest.
=== RUN   TestNeuronTraining/neuron-training/Neuron_training_Job_succeeds
2025/01/10 05:06:57 Waiting for the 'neuron-training-launcher' Job resource to be created...
2025/01/10 05:07:02 Job 'neuron-training-launcher' is created in the cluster.
2025/01/10 05:07:02 Waiting for 'neuron-training-launcher' Job to succeed...
2025/01/10 05:24:08 Job 'neuron-training-launcher' succeeded!
2025/01/10 05:24:08 == Raw Logs from the launcher pods ==
2025/01/10 05:24:08 Launcher: whoami => ubuntu
Launcher: Starting SSH...
 * Starting OpenBSD Secure Shell server sshd
   ...done.
Launcher: Running mpirun for BERT training...
[sudo] password for ubuntu: Warning: Permanently added 'neuron-training-worker-0.neuron-training.default.svc,192.168.136.5' (ECDSA) to the list of known hosts.
Warning: Permanently added 'neuron-training-worker-1.neuron-training.default.svc,192.168.137.41' (ECDSA) to the list of known hosts.
[1,1]<stdout>:Starting train.py with rank=1, world_size=2
[1,1]<stdout>:Rank 1 using device: xla:0
[1,1]<stderr>:/usr/local/lib/python3.10/site-packages/huggingface_hub/file_download.py:795: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.
[1,1]<stderr>:  warnings.warn(
[1,0]<stdout>:Starting train.py with rank=0, world_size=2
[1,0]<stdout>:Rank 0 using device: xla:0
[1,0]<stderr>:/usr/local/lib/python3.10/site-packages/huggingface_hub/file_download.py:795: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.
[1,0]<stderr>:  warnings.warn(
[1,0]<stdout>:Rank 0: Model & tokenizer loaded.
[1,0]<stdout>:Creating dummy data: 100 samples, max_length=128
[1,1]<stdout>:Rank 1: Model & tokenizer loaded.
[1,1]<stdout>:Creating dummy data: 100 samples, max_length=128
[1,0]<stdout>:Rank 0 - Starting training for 5 epochs...
[1,0]<stdout>:Rank 0 - Epoch 1/5
[1,1]<stdout>:Rank 1 - Starting training for 5 epochs...
[1,1]<stdout>:Rank 1 - Epoch 1/5
[1,0]<stdout>:2025-01-10 05:07:14.000716:  38  INFO ||NEURON_CACHE||: Compile cache path: /var/tmp/neuron-compile-cache
[1,0]<stdout>:2025-01-10 05:07:14.000718:  38  INFO ||NEURON_CC_WRAPPER||: Call compiler with cmd: neuronx-cc compile --target=trn1 --framework=XLA /tmp/ubuntu/neuroncc_compile_workdir/6515ac7a-7a98-481c-a258-0e9348cd3e0b/model.MODULE_1878053911276658100+d7517139.hlo_module.pb --output /tmp/ubuntu/neuroncc_compile_workdir/6515ac7a-7a98-481c-a258-0e9348cd3e0b/model.MODULE_1878053911276658100+d7517139.neff --verbose=35
[1,1]<stdout>:2025-01-10 05:07:14.000768:  38  INFO ||NEURON_CACHE||: Compile cache path: /var/tmp/neuron-compile-cache
[1,1]<stdout>:2025-01-10 05:07:14.000770:  38  INFO ||NEURON_CC_WRAPPER||: Call compiler with cmd: neuronx-cc compile --target=trn1 --framework=XLA /tmp/ubuntu/neuroncc_compile_workdir/af10be9d-5767-4432-b208-e44706bac50a/model.MODULE_1878053911276658100+d7517139.hlo_module.pb --output /tmp/ubuntu/neuroncc_compile_workdir/af10be9d-5767-4432-b208-e44706bac50a/model.MODULE_1878053911276658100+d7517139.neff --verbose=35
[1,0]<stdout>:.[1,1]<stdout>:.[1,0]<stdout>:.[1,1]<stdout>:.[1,0]<stdout>:.[1,1]<stdout>:.[1,0]<stdout>:.[1,1]<stdout>:.[1,0]<stdout>:.[1,1]<stdout>:.[1,0]<stdout>:.[1,1]<stdout>:.[1,0]<stdout>:.[1,1]<stdout>:.[1,0]<stdout>:.[1,1]<stdout>:.[1,0]<stdout>:.[1,1]<stdout>:.[1,0]<stdout>:.[1,1]<stdout>:.[1,0]<stdout>:.[1,1]<stdout>:.[1,0]<stdout>:.[1,1]<stdout>:.[1,0]<stdout>:.[1,1]<stdout>:.[1,0]<stdout>:.[1,1]<stdout>:.[1,1]<stdout>:
[1,1]<stdout>:Compiler status PASS
[1,0]<stdout>:.[1,0]<stdout>:
[1,0]<stdout>:Compiler status PASS
[1,1]<stdout>:2025-01-10 05:12:12.000837:  38  INFO ||NEURON_CACHE||: Compile cache path: /var/tmp/neuron-compile-cache
[1,1]<stdout>:2025-01-10 05:12:12.000838:  38  INFO ||NEURON_CC_WRAPPER||: Call compiler with cmd: neuronx-cc compile --target=trn1 --framework=XLA /tmp/ubuntu/neuroncc_compile_workdir/75644802-fd80-4c04-a4dc-f25676e4eb84/model.MODULE_14279532823474606908+d7517139.hlo_module.pb --output /tmp/ubuntu/neuroncc_compile_workdir/75644802-fd80-4c04-a4dc-f25676e4eb84/model.MODULE_14279532823474606908+d7517139.neff --verbose=35
[1,1]<stdout>:.[1,0]<stdout>:2025-01-10 05:12:30.000743:  38  INFO ||NEURON_CACHE||: Compile cache path: /var/tmp/neuron-compile-cache
[1,0]<stdout>:2025-01-10 05:12:30.000745:  38  INFO ||NEURON_CC_WRAPPER||: Call compiler with cmd: neuronx-cc compile --target=trn1 --framework=XLA /tmp/ubuntu/neuroncc_compile_workdir/fabb4467-e8c9-4a2e-99b4-db92a885176d/model.MODULE_14279532823474606908+d7517139.hlo_module.pb --output /tmp/ubuntu/neuroncc_compile_workdir/fabb4467-e8c9-4a2e-99b4-db92a885176d/model.MODULE_14279532823474606908+d7517139.neff --verbose=35
[1,0]<stdout>:.[1,1]<stdout>:.[1,0]<stdout>:.[1,1]<stdout>:.[1,0]<stdout>:.[1,1]<stdout>:.[1,0]<stdout>:.[1,1]<stdout>:.[1,0]<stdout>:.[1,1]<stdout>:.[1,0]<stdout>:.[1,1]<stdout>:.[1,0]<stdout>:.[1,1]<stdout>:.[1,0]<stdout>:.[1,1]<stdout>:.[1,0]<stdout>:.[1,1]<stdout>:.[1,0]<stdout>:.[1,1]<stdout>:.[1,0]<stdout>:.[1,1]<stdout>:.[1,0]<stdout>:.[1,1]<stdout>:.[1,0]<stdout>:.[1,1]<stdout>:.[1,0]<stdout>:.[1,1]<stdout>:.[1,0]<stdout>:.[1,1]<stdout>:.[1,0]<stdout>:.[1,1]<stdout>:.[1,1]<stdout>:
[1,1]<stdout>:Compiler status PASS
[1,0]<stdout>:.[1,0]<stdout>:
[1,0]<stdout>:Compiler status PASS
[1,1]<stdout>:2025-01-10 05:18:11.000729:  38  INFO ||NEURON_CACHE||: Compile cache path: /var/tmp/neuron-compile-cache
[1,1]<stdout>:2025-01-10 05:18:11.000731:  38  INFO ||NEURON_CC_WRAPPER||: Call compiler with cmd: neuronx-cc compile --target=trn1 --framework=XLA /tmp/ubuntu/neuroncc_compile_workdir/88d605e5-f0d4-4089-ba0d-57428e272f8d/model.MODULE_13495939635158976076+d7517139.hlo_module.pb --output /tmp/ubuntu/neuroncc_compile_workdir/88d605e5-f0d4-4089-ba0d-57428e272f8d/model.MODULE_13495939635158976076+d7517139.neff --verbose=35
mpile_workdir/fde07ac1-6d2b-44f4-8313-87574455b56e/model.MODULE_13495939635158976076+d7517139.hlo_module.pb --output /tmp/ubuntu/neuroncc_compile_workdir/fde07ac1-6d2b-44f4-8313-87574455b56e/model.MODULE_13495939635158976076+d7517139.neff --verbose=35
[1,0]<stdout>:.[1,1]<stdout>:.[1,0]<stdout>:.[1,1]<stdout>:.[1,0]<stdout>:.[1,1]<stdout>:.[1,0]<stdout>:.[1,1]<stdout>:.[1,0]<stdout>:.[1,1]<stdout>:.[1,0]<stdout>:.[1,1]<stdout>:.[1,0]<stdout>:.[1,1]<stdout>:.[1,0]<stdout>:.[1,1]<stdout>:.[1,0]<stdout>:.[1,1]<stdout>:.[1,0]<stdout>:.[1,1]<stdout>:.[1,0]<stdout>:.[1,1]<stdout>:.[1,0]<stdout>:.[1,1]<stdout>:.[1,0]<stdout>:.[1,1]<stdout>:.[1,0]<stdout>:.[1,1]<stdout>:.[1,1]<stdout>:
[1,1]<stdout>:Compiler status PASS
[1,0]<stdout>:.[1,1]<stdout>:Rank 1 - Epoch 1 done in 979.31s
[1,1]<stdout>:Rank 1 - Epoch 2/5
[1,1]<stdout>:Rank 1 - Epoch 2 done in 0.89s
[1,1]<stdout>:Rank 1 - Epoch 3/5
[1,1]<stdout>:Rank 1 - Epoch 3 done in 0.88s
[1,1]<stdout>:Rank 1 - Epoch 4/5
[1,1]<stdout>:Rank 1 - Epoch 4 done in 0.88s
[1,1]<stdout>:Rank 1 - Epoch 5/5
[1,1]<stdout>:Rank 1 - Epoch 5 done in 0.88s
[1,1]<stdout>:Rank 1 - All epochs complete in 982.84s
[1,1]<stdout>:Rank 1 - local_samples=250.0, total_time=982.84s, local_throughput=0.25 samples/s, avg_epoch_time=196.57s
[1,1]<stdout>:Rank 1 training complete. Exiting main().
[1,0]<stdout>:
[1,0]<stdout>:Compiler status PASS
[1,0]<stdout>:Rank 0 - Epoch 1 done in 1004.21s
[1,0]<stdout>:Rank 0 - Epoch 2/5
[1,0]<stdout>:Rank 0 - Epoch 2 done in 0.87s
[1,0]<stdout>:Rank 0 - Epoch 3/5
[1,0]<stdout>:Rank 0 - Epoch 3 done in 0.87s
[1,0]<stdout>:Rank 0 - Epoch 4/5
[1,0]<stdout>:Rank 0 - Epoch 4 done in 0.86s
[1,0]<stdout>:Rank 0 - Epoch 5/5
[1,0]<stdout>:Rank 0 - Epoch 5 done in 0.86s
[1,0]<stdout>:Rank 0 - All epochs complete in 1007.67s
[1,0]<stdout>:Rank 0 - local_samples=250.0, total_time=1007.67s, local_throughput=0.25 samples/s, avg_epoch_time=201.53s
[1,0]<stdout>:Rank 0 training complete. Exiting main().

2025/01/10 05:24:08 No throughput lines found. Possibly missing in logs.
2025/01/10 05:24:08 No epoch time lines found. Possibly missing in logs.
--- PASS: TestNeuronTraining (1030.58s)
    --- PASS: TestNeuronTraining/neuron-training (1030.58s)
        --- PASS: TestNeuronTraining/neuron-training/Neuron_training_Job_succeeds (1030.37s)
PASS
2025/01/10 05:24:08 Deleting Neuron device plugin, MPI operator, and EFA device plugin manifests.
2025/01/10 05:24:10 Successfully deleted Neuron device plugin, MPI operator, and EFA device plugin manifests.
2025/01/10 05:24:10 Tests finished with exit code 0
ok      github.com/aws/aws-k8s-tester/test/cases/neuron-training        1051.048s

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

@mselim00
Copy link
Contributor

mselim00 commented Jan 13, 2025

With changes:

$ go test -v ./test/cases/neuron-training/... -tags=e2e -run TestBertTraining --timeout=0 -args -bertTrainingImage=<ACCOUNT_ID>.dkr.ecr.us-west-2.amazonaws.com/aws-k8s-tester/neuron-training:latest --efaEnabled=true --nodeType=trn1.32xlarge
2025/01/13 18:46:35 Starting tests...
2025/01/13 18:46:35 Applying Neuron device plugin RBAC, Neuron device plugin, MPI operator, and EFA device plugin manifests.
2025/01/13 18:46:36 Successfully applied Neuron device plugin RBAC, Neuron device plugin, MPI operator, and EFA device plugin manifests.
2025/01/13 18:46:36 Waiting for MPI Operator deployment to be available.
2025/01/13 18:46:41 MPI Operator deployment is available.
2025/01/13 18:46:41 Waiting for Neuron Device Plugin daemonset to be ready.
2025/01/13 18:46:46 Neuron Device Plugin daemonset is ready.
2025/01/13 18:46:46 Waiting for EFA Device Plugin daemonset to be ready.
2025/01/13 18:46:51 EFA Device Plugin daemonset is ready.
2025/01/13 18:46:51 [INFO] Processing node ip-172-31-56-35.us-west-2.compute.internal
2025/01/13 18:46:51 [INFO] Processing node ip-172-31-59-221.us-west-2.compute.internal
2025/01/13 18:46:51 [INFO] Total Nodes: 2
2025/01/13 18:46:51 [INFO] Total Neuron Count: 32, Neuron Per Node: 16
2025/01/13 18:46:51 [INFO] Total Neuron Core Count: 64, Neuron Core Per Node: 32
2025/01/13 18:46:51 [INFO] Total EFA Count: 16, EFA Per Node: 8
=== RUN   TestBertTraining
=== RUN   TestBertTraining/neuron-training
2025/01/13 18:46:51 Applying rendered Neuron training manifest.
2025/01/13 18:46:51 Successfully applied Neuron training manifest.
=== RUN   TestBertTraining/neuron-training/Neuron_training_Job_succeeds
2025/01/13 18:46:51 Waiting for the 'neuron-training-launcher' Job resource to be created...
2025/01/13 18:46:56 Job 'neuron-training-launcher' is created in the cluster.
2025/01/13 18:46:56 Waiting for 'neuron-training-launcher' Job to succeed...
2025/01/13 19:16:31 Job 'neuron-training-launcher' succeeded!
2025/01/13 19:16:31 == Raw Logs from the launcher pods ==
2025/01/13 19:16:31 Launcher: whoami => ubuntu
Launcher: Starting SSH...
 * Starting OpenBSD Secure Shell server sshd
   ...done.
Launcher: Running mpirun for BERT training...
[sudo] password for ubuntu: Warning: Permanently added 'neuron-training-worker-0.neuron-training.default.svc' (ED25519) to the list of known hosts.
Warning: Permanently added 'neuron-training-worker-1.neuron-training.default.svc' (ED25519) to the list of known hosts.
[1,1]<stderr>:WARNING:root:Found libneuronpjrt.so. Setting PJRT_DEVICE=NEURON.
[1,0]<stderr>:WARNING:root:Found libneuronpjrt.so. Setting PJRT_DEVICE=NEURON.
[1,1]<stderr>:/usr/local/lib/python3.10/site-packages/transformers/utils/generic.py:441: FutureWarning: `torch.utils._pytree._register_pytree_node` is deprecated. Please use `torch.utils._pytree.register_pytree_node` instead.
[1,1]<stderr>:  _torch_pytree._register_pytree_node(
[1,1]<stderr>:/usr/local/lib/python3.10/site-packages/transformers/utils/generic.py:309: FutureWarning: `torch.utils._pytree._register_pytree_node` is deprecated. Please use `torch.utils._pytree.register_pytree_node` instead.
[1,1]<stderr>:  _torch_pytree._register_pytree_node(
[1,0]<stderr>:/usr/local/lib/python3.10/site-packages/transformers/utils/generic.py:441: FutureWarning: `torch.utils._pytree._register_pytree_node` is deprecated. Please use `torch.utils._pytree.register_pytree_node` instead.
[1,0]<stderr>:  _torch_pytree._register_pytree_node(
[1,0]<stderr>:/usr/local/lib/python3.10/site-packages/transformers/utils/generic.py:309: FutureWarning: `torch.utils._pytree._register_pytree_node` is deprecated. Please use `torch.utils._pytree.register_pytree_node` instead.
[1,0]<stderr>:  _torch_pytree._register_pytree_node(
[1,1]<stdout>:Starting train.py with rank=1, world_size=2
[1,1]<stderr>:/usr/local/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:632: UserWarning: Set seed for `privateuseone` device does not take effect, please add API's `_is_in_bad_fork` and `manual_seed_all` to `privateuseone` device module.
[1,1]<stderr>:  return fn(*args, **kwargs)
[1,1]<stdout>:[Rank 1] using device: xla:0
[1,1]<stderr>:/usr/local/lib/python3.10/site-packages/huggingface_hub/file_download.py:795: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.
[1,1]<stderr>:  warnings.warn(
[1,0]<stdout>:Starting train.py with rank=0, world_size=2
[1,0]<stderr>:/usr/local/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:632: UserWarning: Set seed for `privateuseone` device does not take effect, please add API's `_is_in_bad_fork` and `manual_seed_all` to `privateuseone` device module.
[1,0]<stderr>:  return fn(*args, **kwargs)
[1,0]<stdout>:[Rank 0] using device: xla:0
[1,0]<stderr>:/usr/local/lib/python3.10/site-packages/huggingface_hub/file_download.py:795: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.
[1,0]<stderr>:  warnings.warn(
[1,0]<stdout>:[Rank 0]: Model & tokenizer loaded.
[1,0]<stdout>:Creating dummy data: 1000 samples, max_length=128
[1,1]<stdout>:[Rank 1]: Model & tokenizer loaded.
[1,1]<stdout>:Creating dummy data: 1000 samples, max_length=128
[1,0]<stdout>:Rank 0 - Starting warmup
[1,1]<stdout>:Rank 1 - Starting warmup
[1,0]<stdout>:2025-01-13 18:51:00.000853:  37  INFO ||NEURON_CC_WRAPPER||: Call compiler with cmd: neuronx-cc compile --framework=XLA /tmp/ubuntu/neuroncc_compile_workdir/e307b151-a4e9-4b8b-a657-5a8f1f9a3703/model.MODULE_13709537945729123956+e30acd3a.hlo_module.pb --output /tmp/ubuntu/neuroncc_compile_workdir/e307b151-a4e9-4b8b-a657-5a8f1f9a3703/model.MODULE_13709537945729123956+e30acd3a.neff --target=trn1 --verbose=35
[1,1]<stdout>:2025-01-13 18:51:00.000907:  37  INFO ||NEURON_CC_WRAPPER||: Call compiler with cmd: neuronx-cc compile --framework=XLA /tmp/ubuntu/neuroncc_compile_workdir/c8280224-a507-4ead-a50b-4f1b79d44ae1/model.MODULE_13709537945729123956+e30acd3a.hlo_module.pb --output /tmp/ubuntu/neuroncc_compile_workdir/c8280224-a507-4ead-a50b-4f1b79d44ae1/model.MODULE_13709537945729123956+e30acd3a.neff --target=trn1 --verbose=35
[1,0]<stdout>:.[1,1]<stdout>:.[1,0]<stdout>:.[1,1]<stdout>:.[1,0]<stdout>:.[1,1]<stdout>:.[1,0]<stdout>:.[1,1]<stdout>:.[1,0]<stdout>:.[1,1]<stdout>:.[1,0]<stdout>:.[1,1]<stdout>:.[1,0]<stdout>:.[1,1]<stdout>:.[1,0]<stdout>:.[1,1]<stdout>:.[1,0]<stdout>:.[1,1]<stdout>:.[1,0]<stdout>:.[1,1]<stdout>:.[1,0]<stdout>:.[1,1]<stdout>:.[1,0]<stdout>:.[1,1]<stdout>:.[1,0]<stdout>:.[1,1]<stdout>:.[1,0]<stdout>:.[1,1]<stdout>:.[1,0]<stdout>:.[1,1]<stdout>:.[1,0]<stdout>:.[1,1]<stdout>:.[1,0]<stdout>:.[1,1]<stdout>:.[1,0]<stdout>:.[1,1]<stdout>:.[1,0]<stdout>:.[1,1]<stdout>:.[1,0]<stdout>:.[1,1]<stdout>:.[1,0]<stdout>:.[1,1]<stdout>:.[1,0]<stdout>:
[1,0]<stdout>:Compiler status PASS
[1,1]<stdout>:
[1,1]<stdout>:Compiler status PASS
[1,0]<stdout>:2025-01-13 18:59:27.000331:  37  INFO ||NEURON_CC_WRAPPER||: Call compiler with cmd: neuronx-cc compile --framework=XLA /tmp/ubuntu/neuroncc_compile_workdir/2c334b06-0b0b-4530-b635-269a807dc564/model.MODULE_3375684979178417899+e30acd3a.hlo_module.pb --output /tmp/ubuntu/neuroncc_compile_workdir/2c334b06-0b0b-4530-b635-269a807dc564/model.MODULE_3375684979178417899+e30acd3a.neff --target=trn1 --verbose=35
[1,0]<stdout>:.[1,1]<stdout>:2025-01-13 18:59:31.000879:  37  INFO ||NEURON_CC_WRAPPER||: Call compiler with cmd: neuronx-cc compile --framework=XLA /tmp/ubuntu/neuroncc_compile_workdir/f326751b-b38e-4e6f-9baa-1f3c46f5446f/model.MODULE_3375684979178417899+e30acd3a.hlo_module.pb --output /tmp/ubuntu/neuroncc_compile_workdir/f326751b-b38e-4e6f-9baa-1f3c46f5446f/model.MODULE_3375684979178417899+e30acd3a.neff --target=trn1 --verbose=35
[1,1]<stdout>:.[1,0]<stdout>:.[1,1]<stdout>:.[1,0]<stdout>:.[1,1]<stdout>:.[1,0]<stdout>:.[1,1]<stdout>:.[1,0]<stdout>:.[1,1]<stdout>:.[1,0]<stdout>:.[1,1]<stdout>:.[1,0]<stdout>:.[1,1]<stdout>:.[1,0]<stdout>:.[1,1]<stdout>:.[1,0]<stdout>:.[1,1]<stdout>:.[1,0]<stdout>:.[1,1]<stdout>:.[1,0]<stdout>:.[1,1]<stdout>:.[1,0]<stdout>:.[1,1]<stdout>:.[1,0]<stdout>:.[1,1]<stdout>:.[1,0]<stdout>:.[1,1]<stdout>:.[1,0]<stdout>:.[1,1]<stdout>:.[1,0]<stdout>:.[1,1]<stdout>:.[1,0]<stdout>:.[1,1]<stdout>:.[1,0]<stdout>:.[1,1]<stdout>:.[1,0]<stdout>:.[1,1]<stdout>:.[1,0]<stdout>:.[1,1]<stdout>:.[1,0]<stdout>:.[1,1]<stdout>:.[1,0]<stdout>:.[1,1]<stdout>:.[1,0]<stdout>:.[1,1]<stdout>:.[1,0]<stdout>:.[1,1]<stdout>:
[1,1]<stdout>:Compiler status PASS
[1,0]<stdout>:
[1,0]<stdout>:Compiler status PASS
[1,1]<stdout>:[Rank 1] - Epoch 0, Step 10, Loss=5.5402
[1,1]<stdout>:2025-01-13 19:08:45.000297:  37  INFO ||NEURON_CC_WRAPPER||: Call compiler with cmd: neuronx-cc compile --framework=XLA /tmp/ubuntu/neuroncc_compile_workdir/cccb0c98-079a-4c5e-a979-0479f94cc213/model.MODULE_12564306245150408481+e30acd3a.hlo_module.pb --output /tmp/ubuntu/neuroncc_compile_workdir/cccb0c98-079a-4c5e-a979-0479f94cc213/model.MODULE_12564306245150408481+e30acd3a.neff --target=trn1 --verbose=35
[1,1]<stdout>:.[1,0]<stdout>:[Rank 0] - Epoch 0, Step 10, Loss=7.5797
[1,0]<stdout>:2025-01-13 19:08:49.000708:  37  INFO ||NEURON_CC_WRAPPER||: Call compiler with cmd: neuronx-cc compile --framework=XLA /tmp/ubuntu/neuroncc_compile_workdir/728c65e8-96ed-4230-899e-9afacf343ade/model.MODULE_12564306245150408481+e30acd3a.hlo_module.pb --output /tmp/ubuntu/neuroncc_compile_workdir/728c65e8-96ed-4230-899e-9afacf343ade/model.MODULE_12564306245150408481+e30acd3a.neff --target=trn1 --verbose=35
[1,0]<stdout>:.[1,1]<stdout>:.[1,0]<stdout>:.[1,1]<stdout>:.[1,0]<stdout>:.[1,1]<stdout>:.[1,0]<stdout>:.[1,1]<stdout>:.[1,0]<stdout>:.[1,1]<stdout>:.[1,0]<stdout>:.[1,1]<stdout>:.[1,0]<stdout>:.[1,1]<stdout>:.[1,0]<stdout>:.[1,1]<stdout>:.[1,0]<stdout>:.[1,1]<stdout>:.[1,0]<stdout>:.[1,1]<stdout>:.[1,0]<stdout>:.[1,1]<stdout>:.[1,0]<stdout>:.[1,1]<stdout>:.[1,0]<stdout>:.[1,1]<stdout>:.[1,0]<stdout>:.[1,1]<stdout>:.[1,0]<stdout>:.[1,1]<stdout>:.[1,0]<stdout>:.[1,1]<stdout>:.[1,0]<stdout>:.[1,1]<stdout>:.[1,0]<stdout>:.[1,1]<stdout>:.[1,1]<stdout>:
[1,1]<stdout>:Compiler status PASS
[1,0]<stdout>:.[1,0]<stdout>:
[1,0]<stdout>:Compiler status PASS
[1,1]<stdout>:Rank 1 - Finished warmup in 1490.44s
[1,1]<stdout>:Rank 1 - Starting training for 5 epochs...
[1,1]<stdout>:[Rank 1] - Epoch 1/5
[1,1]<stdout>:[Rank 1] - Epoch 1, Step 10, Loss=3.5816
[1,1]<stdout>:[Rank 1] - Epoch 1 done in 3.91s
[1,1]<stdout>:[Rank 1] - Epoch 2/5
[1,1]<stdout>:[Rank 1] - Epoch 2, Step 10, Loss=3.4553
[1,1]<stdout>:[Rank 1] - Epoch 2 done in 3.88s
[1,1]<stdout>:[Rank 1] - Epoch 3/5
[1,1]<stdout>:[Rank 1] - Epoch 3, Step 10, Loss=3.4267
[1,1]<stdout>:[Rank 1] - Epoch 3 done in 3.89s
[1,1]<stdout>:[Rank 1] - Epoch 4/5
[1,1]<stdout>:[Rank 1] - Epoch 4, Step 10, Loss=3.4234
[1,1]<stdout>:[Rank 1] - Epoch 4 done in 3.90s
[1,1]<stdout>:[Rank 1] - Epoch 5/5
[1,0]<stdout>:Rank 0 - Finished warmup in 1506.43s
[1,0]<stdout>:Rank 0 - Starting training for 5 epochs...
[1,0]<stdout>:[Rank 0] - Epoch 1/5
[1,1]<stdout>:[Rank 1] - Epoch 5, Step 10, Loss=3.3973
[1,0]<stdout>:[Rank 0] - Epoch 1, Step 10, Loss=3.7818
[1,1]<stdout>:[Rank 1] - Epoch 5 done in 3.89s
[1,1]<stdout>:[Rank 1] - All epochs complete in 19.48s
[1,1]<stdout>:[Rank 1] - local_samples=2500.0, total_time=19.48s, local_throughput=128.36 samples/s, local_avg_epoch_time=3.90s
[1,1]<stdout>:[Rank 1] training complete. Exiting main().
[1,0]<stdout>:[Rank 0] - Epoch 1 done in 3.92s
[1,0]<stdout>:[Rank 0] - Epoch 2/5
[1,0]<stdout>:[Rank 0] - Epoch 2, Step 10, Loss=3.4994
[1,0]<stdout>:[Rank 0] - Epoch 2 done in 3.90s
[1,0]<stdout>:[Rank 0] - Epoch 3/5
[1,0]<stdout>:[Rank 0] - Epoch 3, Step 10, Loss=3.4227
[1,0]<stdout>:[Rank 0] - Epoch 3 done in 3.91s
[1,0]<stdout>:[Rank 0] - Epoch 4/5
[1,0]<stdout>:[Rank 0] - Epoch 4, Step 10, Loss=3.4861
[1,0]<stdout>:[Rank 0] - Epoch 4 done in 3.91s
[1,0]<stdout>:[Rank 0] - Epoch 5/5
[1,0]<stdout>:[Rank 0] - Epoch 5, Step 10, Loss=3.4878
[1,0]<stdout>:[Rank 0] - Epoch 5 done in 3.90s
[1,0]<stdout>:[Rank 0] - All epochs complete in 19.54s
[1,0]<stdout>:[Rank 0] - local_samples=2500.0, total_time=19.54s, local_throughput=127.93 samples/s, local_avg_epoch_time=3.91s
[1,0]<stdout>:[Rank 0] training complete. Exiting main().

2025/01/13 19:16:31 Parsed throughput from 2 ranks. Total=256.29 samples/s, Average=128.15 samples/s
2025/01/13 19:16:31 Average Throughput: 128.15 samples/second
2025/01/13 19:16:31 Parsed average epoch time from 2 ranks. Sum=7.81s, Average=3.91s
--- PASS: TestBertTraining (1780.08s)
    --- PASS: TestBertTraining/neuron-training (1780.08s)
        --- PASS: TestBertTraining/neuron-training/Neuron_training_Job_succeeds (1780.05s)
PASS
2025/01/13 19:16:31 Deleting Neuron device plugin, MPI operator, and EFA device plugin manifests.
2025/01/13 19:16:32 Successfully deleted Neuron device plugin, MPI operator, and EFA device plugin manifests.
2025/01/13 19:16:32 Tests finished with exit code 0
ok      github.com/aws/aws-k8s-tester/test/cases/neuron-training        1796.990s

@mselim00 mselim00 changed the title [WIP] (feat/test) Add New E2E Test Harness for Distributed BERT Training on Neuron (Trainium) feat(test/neuron-training): Add New E2E Test Harness for Distributed BERT Training on Neuron (Trainium) Jan 13, 2025
Copy link
Contributor

@mselim00 mselim00 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@mselim00 mselim00 merged commit 8688c5f into aws:main Jan 13, 2025
7 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants