feat(test/neuron-training): Add New E2E Test Harness for Distributed BERT Training on Neuron (Trainium) #558

mattcjo · 2025-01-10T05:42:57Z

Issue #, if available:

Description of changes:
This pull request introduces a brand-new end-to-end (E2E) test that validates distributed BERT pretraining on AWS Trainium (Neuron) nodes using the MPI operator. Previously, there was no “neuron-training” test in the repository. Key aspects of the newly added test:

Templated MPIJob Manifest
- Dynamically renders a neuron-bert-training.yaml manifest with placeholders for node type, image, resource requests (NeuronCore, EFA), and replica counts.
Two-Step “Job Wait” Logic
- After applying the MPIJob, the harness first waits for the MPI operator to create the “launcher” Kubernetes Job.
- Once the Job resource exists, the test then waits for it to reach a Succeeded state, ensuring the distributed BERT training completes successfully.
Throughput and Epoch-Time Metrics
- Upon success, the test gathers logs from the launcher pods.
- The logs are parsed for rank-level throughput and epoch timing, which get aggregated and printed as final metrics.
Fully Automated E2E Workflow
- The harness covers everything from applying the device plugin manifests, ensuring the MPI operator is healthy, applying the training manifest, and finally validating that distributed BERT training worked on Neuron-based instances.

This new test provides confidence that multi-node Neuron clusters can successfully run an MPI-based BERT pretraining job end to end, including resource scheduling, environment setup, and performance metrics gathering.

Testing

go test -timeout 60m -v . -run TestNeuronTraining \
  -bertTrainingImage={{ACCOUNT_ID}}.dkr.ecr.us-east-1.amazonaws.com/aws-k8s-tester/neuron-training:latest \
  -efaEnabled=true \
  -nodeType=trn1.32xlarge
2025/01/10 05:06:39 Starting tests...
2025/01/10 05:06:39 Applying Neuron device plugin RBAC, Neuron device plugin, MPI operator, and EFA device plugin manifests.
2025/01/10 05:06:42 Successfully applied Neuron device plugin RBAC, Neuron device plugin, MPI operator, and EFA device plugin manifests.
2025/01/10 05:06:42 Waiting for MPI Operator deployment to be available.
2025/01/10 05:06:47 MPI Operator deployment is available.
2025/01/10 05:06:47 Waiting for Neuron Device Plugin daemonset to be ready.
2025/01/10 05:06:52 Neuron Device Plugin daemonset is ready.
2025/01/10 05:06:52 Waiting for EFA Device Plugin daemonset to be ready.
^[2025/01/10 05:06:57 EFA Device Plugin daemonset is ready.
2025/01/10 05:06:57 [INFO] Processing node ip-192-168-145-196.ec2.internal
2025/01/10 05:06:57 [INFO] Processing node ip-192-168-163-18.ec2.internal
2025/01/10 05:06:57 [INFO] Total Nodes: 2
2025/01/10 05:06:57 [INFO] Total Neuron Count: 32, Neuron Per Node: 16
2025/01/10 05:06:57 [INFO] Total Neuron Core Count: 64, Neuron Core Per Node: 32
2025/01/10 05:06:57 [INFO] Total EFA Count: 16, EFA Per Node: 8
=== RUN   TestNeuronTraining
=== RUN   TestNeuronTraining/neuron-training
2025/01/10 05:06:57 Applying rendered Neuron training manifest.
2025/01/10 05:06:57 Successfully applied Neuron training manifest.
=== RUN   TestNeuronTraining/neuron-training/Neuron_training_Job_succeeds
2025/01/10 05:06:57 Waiting for the 'neuron-training-launcher' Job resource to be created...
2025/01/10 05:07:02 Job 'neuron-training-launcher' is created in the cluster.
2025/01/10 05:07:02 Waiting for 'neuron-training-launcher' Job to succeed...
2025/01/10 05:24:08 Job 'neuron-training-launcher' succeeded!
2025/01/10 05:24:08 == Raw Logs from the launcher pods ==
2025/01/10 05:24:08 Launcher: whoami => ubuntu
Launcher: Starting SSH...
 * Starting OpenBSD Secure Shell server sshd
   ...done.
Launcher: Running mpirun for BERT training...
[sudo] password for ubuntu: Warning: Permanently added 'neuron-training-worker-0.neuron-training.default.svc,192.168.136.5' (ECDSA) to the list of known hosts.
Warning: Permanently added 'neuron-training-worker-1.neuron-training.default.svc,192.168.137.41' (ECDSA) to the list of known hosts.
[1,1]<stdout>:Starting train.py with rank=1, world_size=2
[1,1]<stdout>:Rank 1 using device: xla:0
[1,1]<stderr>:/usr/local/lib/python3.10/site-packages/huggingface_hub/file_download.py:795: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.
[1,1]<stderr>:  warnings.warn(
[1,0]<stdout>:Starting train.py with rank=0, world_size=2
[1,0]<stdout>:Rank 0 using device: xla:0
[1,0]<stderr>:/usr/local/lib/python3.10/site-packages/huggingface_hub/file_download.py:795: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.
[1,0]<stderr>:  warnings.warn(
[1,0]<stdout>:Rank 0: Model & tokenizer loaded.
[1,0]<stdout>:Creating dummy data: 100 samples, max_length=128
[1,1]<stdout>:Rank 1: Model & tokenizer loaded.
[1,1]<stdout>:Creating dummy data: 100 samples, max_length=128
[1,0]<stdout>:Rank 0 - Starting training for 5 epochs...
[1,0]<stdout>:Rank 0 - Epoch 1/5
[1,1]<stdout>:Rank 1 - Starting training for 5 epochs...
[1,1]<stdout>:Rank 1 - Epoch 1/5
[1,0]<stdout>:2025-01-10 05:07:14.000716:  38  INFO ||NEURON_CACHE||: Compile cache path: /var/tmp/neuron-compile-cache
[1,0]<stdout>:2025-01-10 05:07:14.000718:  38  INFO ||NEURON_CC_WRAPPER||: Call compiler with cmd: neuronx-cc compile --target=trn1 --framework=XLA /tmp/ubuntu/neuroncc_compile_workdir/6515ac7a-7a98-481c-a258-0e9348cd3e0b/model.MODULE_1878053911276658100+d7517139.hlo_module.pb --output /tmp/ubuntu/neuroncc_compile_workdir/6515ac7a-7a98-481c-a258-0e9348cd3e0b/model.MODULE_1878053911276658100+d7517139.neff --verbose=35
[1,1]<stdout>:2025-01-10 05:07:14.000768:  38  INFO ||NEURON_CACHE||: Compile cache path: /var/tmp/neuron-compile-cache
[1,1]<stdout>:2025-01-10 05:07:14.000770:  38  INFO ||NEURON_CC_WRAPPER||: Call compiler with cmd: neuronx-cc compile --target=trn1 --framework=XLA /tmp/ubuntu/neuroncc_compile_workdir/af10be9d-5767-4432-b208-e44706bac50a/model.MODULE_1878053911276658100+d7517139.hlo_module.pb --output /tmp/ubuntu/neuroncc_compile_workdir/af10be9d-5767-4432-b208-e44706bac50a/model.MODULE_1878053911276658100+d7517139.neff --verbose=35
[1,0]<stdout>:.[1,1]<stdout>:.[1,0]<stdout>:.[1,1]<stdout>:.[1,0]<stdout>:.[1,1]<stdout>:.[1,0]<stdout>:.[1,1]<stdout>:.[1,0]<stdout>:.[1,1]<stdout>:.[1,0]<stdout>:.[1,1]<stdout>:.[1,0]<stdout>:.[1,1]<stdout>:.[1,0]<stdout>:.[1,1]<stdout>:.[1,0]<stdout>:.[1,1]<stdout>:.[1,0]<stdout>:.[1,1]<stdout>:.[1,0]<stdout>:.[1,1]<stdout>:.[1,0]<stdout>:.[1,1]<stdout>:.[1,0]<stdout>:.[1,1]<stdout>:.[1,0]<stdout>:.[1,1]<stdout>:.[1,1]<stdout>:
[1,1]<stdout>:Compiler status PASS
[1,0]<stdout>:.[1,0]<stdout>:
[1,0]<stdout>:Compiler status PASS
[1,1]<stdout>:2025-01-10 05:12:12.000837:  38  INFO ||NEURON_CACHE||: Compile cache path: /var/tmp/neuron-compile-cache
[1,1]<stdout>:2025-01-10 05:12:12.000838:  38  INFO ||NEURON_CC_WRAPPER||: Call compiler with cmd: neuronx-cc compile --target=trn1 --framework=XLA /tmp/ubuntu/neuroncc_compile_workdir/75644802-fd80-4c04-a4dc-f25676e4eb84/model.MODULE_14279532823474606908+d7517139.hlo_module.pb --output /tmp/ubuntu/neuroncc_compile_workdir/75644802-fd80-4c04-a4dc-f25676e4eb84/model.MODULE_14279532823474606908+d7517139.neff --verbose=35
[1,1]<stdout>:.[1,0]<stdout>:2025-01-10 05:12:30.000743:  38  INFO ||NEURON_CACHE||: Compile cache path: /var/tmp/neuron-compile-cache
[1,0]<stdout>:2025-01-10 05:12:30.000745:  38  INFO ||NEURON_CC_WRAPPER||: Call compiler with cmd: neuronx-cc compile --target=trn1 --framework=XLA /tmp/ubuntu/neuroncc_compile_workdir/fabb4467-e8c9-4a2e-99b4-db92a885176d/model.MODULE_14279532823474606908+d7517139.hlo_module.pb --output /tmp/ubuntu/neuroncc_compile_workdir/fabb4467-e8c9-4a2e-99b4-db92a885176d/model.MODULE_14279532823474606908+d7517139.neff --verbose=35
[1,0]<stdout>:.[1,1]<stdout>:.[1,0]<stdout>:.[1,1]<stdout>:.[1,0]<stdout>:.[1,1]<stdout>:.[1,0]<stdout>:.[1,1]<stdout>:.[1,0]<stdout>:.[1,1]<stdout>:.[1,0]<stdout>:.[1,1]<stdout>:.[1,0]<stdout>:.[1,1]<stdout>:.[1,0]<stdout>:.[1,1]<stdout>:.[1,0]<stdout>:.[1,1]<stdout>:.[1,0]<stdout>:.[1,1]<stdout>:.[1,0]<stdout>:.[1,1]<stdout>:.[1,0]<stdout>:.[1,1]<stdout>:.[1,0]<stdout>:.[1,1]<stdout>:.[1,0]<stdout>:.[1,1]<stdout>:.[1,0]<stdout>:.[1,1]<stdout>:.[1,0]<stdout>:.[1,1]<stdout>:.[1,1]<stdout>:
[1,1]<stdout>:Compiler status PASS
[1,0]<stdout>:.[1,0]<stdout>:
[1,0]<stdout>:Compiler status PASS
[1,1]<stdout>:2025-01-10 05:18:11.000729:  38  INFO ||NEURON_CACHE||: Compile cache path: /var/tmp/neuron-compile-cache
[1,1]<stdout>:2025-01-10 05:18:11.000731:  38  INFO ||NEURON_CC_WRAPPER||: Call compiler with cmd: neuronx-cc compile --target=trn1 --framework=XLA /tmp/ubuntu/neuroncc_compile_workdir/88d605e5-f0d4-4089-ba0d-57428e272f8d/model.MODULE_13495939635158976076+d7517139.hlo_module.pb --output /tmp/ubuntu/neuroncc_compile_workdir/88d605e5-f0d4-4089-ba0d-57428e272f8d/model.MODULE_13495939635158976076+d7517139.neff --verbose=35
mpile_workdir/fde07ac1-6d2b-44f4-8313-87574455b56e/model.MODULE_13495939635158976076+d7517139.hlo_module.pb --output /tmp/ubuntu/neuroncc_compile_workdir/fde07ac1-6d2b-44f4-8313-87574455b56e/model.MODULE_13495939635158976076+d7517139.neff --verbose=35
[1,0]<stdout>:.[1,1]<stdout>:.[1,0]<stdout>:.[1,1]<stdout>:.[1,0]<stdout>:.[1,1]<stdout>:.[1,0]<stdout>:.[1,1]<stdout>:.[1,0]<stdout>:.[1,1]<stdout>:.[1,0]<stdout>:.[1,1]<stdout>:.[1,0]<stdout>:.[1,1]<stdout>:.[1,0]<stdout>:.[1,1]<stdout>:.[1,0]<stdout>:.[1,1]<stdout>:.[1,0]<stdout>:.[1,1]<stdout>:.[1,0]<stdout>:.[1,1]<stdout>:.[1,0]<stdout>:.[1,1]<stdout>:.[1,0]<stdout>:.[1,1]<stdout>:.[1,0]<stdout>:.[1,1]<stdout>:.[1,1]<stdout>:
[1,1]<stdout>:Compiler status PASS
[1,0]<stdout>:.[1,1]<stdout>:Rank 1 - Epoch 1 done in 979.31s
[1,1]<stdout>:Rank 1 - Epoch 2/5
[1,1]<stdout>:Rank 1 - Epoch 2 done in 0.89s
[1,1]<stdout>:Rank 1 - Epoch 3/5
[1,1]<stdout>:Rank 1 - Epoch 3 done in 0.88s
[1,1]<stdout>:Rank 1 - Epoch 4/5
[1,1]<stdout>:Rank 1 - Epoch 4 done in 0.88s
[1,1]<stdout>:Rank 1 - Epoch 5/5
[1,1]<stdout>:Rank 1 - Epoch 5 done in 0.88s
[1,1]<stdout>:Rank 1 - All epochs complete in 982.84s
[1,1]<stdout>:Rank 1 - local_samples=250.0, total_time=982.84s, local_throughput=0.25 samples/s, avg_epoch_time=196.57s
[1,1]<stdout>:Rank 1 training complete. Exiting main().
[1,0]<stdout>:
[1,0]<stdout>:Compiler status PASS
[1,0]<stdout>:Rank 0 - Epoch 1 done in 1004.21s
[1,0]<stdout>:Rank 0 - Epoch 2/5
[1,0]<stdout>:Rank 0 - Epoch 2 done in 0.87s
[1,0]<stdout>:Rank 0 - Epoch 3/5
[1,0]<stdout>:Rank 0 - Epoch 3 done in 0.87s
[1,0]<stdout>:Rank 0 - Epoch 4/5
[1,0]<stdout>:Rank 0 - Epoch 4 done in 0.86s
[1,0]<stdout>:Rank 0 - Epoch 5/5
[1,0]<stdout>:Rank 0 - Epoch 5 done in 0.86s
[1,0]<stdout>:Rank 0 - All epochs complete in 1007.67s
[1,0]<stdout>:Rank 0 - local_samples=250.0, total_time=1007.67s, local_throughput=0.25 samples/s, avg_epoch_time=201.53s
[1,0]<stdout>:Rank 0 training complete. Exiting main().

2025/01/10 05:24:08 No throughput lines found. Possibly missing in logs.
2025/01/10 05:24:08 No epoch time lines found. Possibly missing in logs.
--- PASS: TestNeuronTraining (1030.58s)
    --- PASS: TestNeuronTraining/neuron-training (1030.58s)
        --- PASS: TestNeuronTraining/neuron-training/Neuron_training_Job_succeeds (1030.37s)
PASS
2025/01/10 05:24:08 Deleting Neuron device plugin, MPI operator, and EFA device plugin manifests.
2025/01/10 05:24:10 Successfully deleted Neuron device plugin, MPI operator, and EFA device plugin manifests.
2025/01/10 05:24:10 Tests finished with exit code 0
ok      github.com/aws/aws-k8s-tester/test/cases/neuron-training        1051.048s

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

mselim00 · 2025-01-13T19:22:02Z

With changes:

$ go test -v ./test/cases/neuron-training/... -tags=e2e -run TestBertTraining --timeout=0 -args -bertTrainingImage=<ACCOUNT_ID>.dkr.ecr.us-west-2.amazonaws.com/aws-k8s-tester/neuron-training:latest --efaEnabled=true --nodeType=trn1.32xlarge
2025/01/13 18:46:35 Starting tests...
2025/01/13 18:46:35 Applying Neuron device plugin RBAC, Neuron device plugin, MPI operator, and EFA device plugin manifests.
2025/01/13 18:46:36 Successfully applied Neuron device plugin RBAC, Neuron device plugin, MPI operator, and EFA device plugin manifests.
2025/01/13 18:46:36 Waiting for MPI Operator deployment to be available.
2025/01/13 18:46:41 MPI Operator deployment is available.
2025/01/13 18:46:41 Waiting for Neuron Device Plugin daemonset to be ready.
2025/01/13 18:46:46 Neuron Device Plugin daemonset is ready.
2025/01/13 18:46:46 Waiting for EFA Device Plugin daemonset to be ready.
2025/01/13 18:46:51 EFA Device Plugin daemonset is ready.
2025/01/13 18:46:51 [INFO] Processing node ip-172-31-56-35.us-west-2.compute.internal
2025/01/13 18:46:51 [INFO] Processing node ip-172-31-59-221.us-west-2.compute.internal
2025/01/13 18:46:51 [INFO] Total Nodes: 2
2025/01/13 18:46:51 [INFO] Total Neuron Count: 32, Neuron Per Node: 16
2025/01/13 18:46:51 [INFO] Total Neuron Core Count: 64, Neuron Core Per Node: 32
2025/01/13 18:46:51 [INFO] Total EFA Count: 16, EFA Per Node: 8
=== RUN   TestBertTraining
=== RUN   TestBertTraining/neuron-training
2025/01/13 18:46:51 Applying rendered Neuron training manifest.
2025/01/13 18:46:51 Successfully applied Neuron training manifest.
=== RUN   TestBertTraining/neuron-training/Neuron_training_Job_succeeds
2025/01/13 18:46:51 Waiting for the 'neuron-training-launcher' Job resource to be created...
2025/01/13 18:46:56 Job 'neuron-training-launcher' is created in the cluster.
2025/01/13 18:46:56 Waiting for 'neuron-training-launcher' Job to succeed...
2025/01/13 19:16:31 Job 'neuron-training-launcher' succeeded!
2025/01/13 19:16:31 == Raw Logs from the launcher pods ==
2025/01/13 19:16:31 Launcher: whoami => ubuntu
Launcher: Starting SSH...
 * Starting OpenBSD Secure Shell server sshd
   ...done.
Launcher: Running mpirun for BERT training...
[sudo] password for ubuntu: Warning: Permanently added 'neuron-training-worker-0.neuron-training.default.svc' (ED25519) to the list of known hosts.
Warning: Permanently added 'neuron-training-worker-1.neuron-training.default.svc' (ED25519) to the list of known hosts.
[1,1]<stderr>:WARNING:root:Found libneuronpjrt.so. Setting PJRT_DEVICE=NEURON.
[1,0]<stderr>:WARNING:root:Found libneuronpjrt.so. Setting PJRT_DEVICE=NEURON.
[1,1]<stderr>:/usr/local/lib/python3.10/site-packages/transformers/utils/generic.py:441: FutureWarning: `torch.utils._pytree._register_pytree_node` is deprecated. Please use `torch.utils._pytree.register_pytree_node` instead.
[1,1]<stderr>:  _torch_pytree._register_pytree_node(
[1,1]<stderr>:/usr/local/lib/python3.10/site-packages/transformers/utils/generic.py:309: FutureWarning: `torch.utils._pytree._register_pytree_node` is deprecated. Please use `torch.utils._pytree.register_pytree_node` instead.
[1,1]<stderr>:  _torch_pytree._register_pytree_node(
[1,0]<stderr>:/usr/local/lib/python3.10/site-packages/transformers/utils/generic.py:441: FutureWarning: `torch.utils._pytree._register_pytree_node` is deprecated. Please use `torch.utils._pytree.register_pytree_node` instead.
[1,0]<stderr>:  _torch_pytree._register_pytree_node(
[1,0]<stderr>:/usr/local/lib/python3.10/site-packages/transformers/utils/generic.py:309: FutureWarning: `torch.utils._pytree._register_pytree_node` is deprecated. Please use `torch.utils._pytree.register_pytree_node` instead.
[1,0]<stderr>:  _torch_pytree._register_pytree_node(
[1,1]<stdout>:Starting train.py with rank=1, world_size=2
[1,1]<stderr>:/usr/local/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:632: UserWarning: Set seed for `privateuseone` device does not take effect, please add API's `_is_in_bad_fork` and `manual_seed_all` to `privateuseone` device module.
[1,1]<stderr>:  return fn(*args, **kwargs)
[1,1]<stdout>:[Rank 1] using device: xla:0
[1,1]<stderr>:/usr/local/lib/python3.10/site-packages/huggingface_hub/file_download.py:795: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.
[1,1]<stderr>:  warnings.warn(
[1,0]<stdout>:Starting train.py with rank=0, world_size=2
[1,0]<stderr>:/usr/local/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:632: UserWarning: Set seed for `privateuseone` device does not take effect, please add API's `_is_in_bad_fork` and `manual_seed_all` to `privateuseone` device module.
[1,0]<stderr>:  return fn(*args, **kwargs)
[1,0]<stdout>:[Rank 0] using device: xla:0
[1,0]<stderr>:/usr/local/lib/python3.10/site-packages/huggingface_hub/file_download.py:795: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.
[1,0]<stderr>:  warnings.warn(
[1,0]<stdout>:[Rank 0]: Model & tokenizer loaded.
[1,0]<stdout>:Creating dummy data: 1000 samples, max_length=128
[1,1]<stdout>:[Rank 1]: Model & tokenizer loaded.
[1,1]<stdout>:Creating dummy data: 1000 samples, max_length=128
[1,0]<stdout>:Rank 0 - Starting warmup
[1,1]<stdout>:Rank 1 - Starting warmup
[1,0]<stdout>:2025-01-13 18:51:00.000853:  37  INFO ||NEURON_CC_WRAPPER||: Call compiler with cmd: neuronx-cc compile --framework=XLA /tmp/ubuntu/neuroncc_compile_workdir/e307b151-a4e9-4b8b-a657-5a8f1f9a3703/model.MODULE_13709537945729123956+e30acd3a.hlo_module.pb --output /tmp/ubuntu/neuroncc_compile_workdir/e307b151-a4e9-4b8b-a657-5a8f1f9a3703/model.MODULE_13709537945729123956+e30acd3a.neff --target=trn1 --verbose=35
[1,1]<stdout>:2025-01-13 18:51:00.000907:  37  INFO ||NEURON_CC_WRAPPER||: Call compiler with cmd: neuronx-cc compile --framework=XLA /tmp/ubuntu/neuroncc_compile_workdir/c8280224-a507-4ead-a50b-4f1b79d44ae1/model.MODULE_13709537945729123956+e30acd3a.hlo_module.pb --output /tmp/ubuntu/neuroncc_compile_workdir/c8280224-a507-4ead-a50b-4f1b79d44ae1/model.MODULE_13709537945729123956+e30acd3a.neff --target=trn1 --verbose=35
[1,0]<stdout>:.[1,1]<stdout>:.[1,0]<stdout>:.[1,1]<stdout>:.[1,0]<stdout>:.[1,1]<stdout>:.[1,0]<stdout>:.[1,1]<stdout>:.[1,0]<stdout>:.[1,1]<stdout>:.[1,0]<stdout>:.[1,1]<stdout>:.[1,0]<stdout>:.[1,1]<stdout>:.[1,0]<stdout>:.[1,1]<stdout>:.[1,0]<stdout>:.[1,1]<stdout>:.[1,0]<stdout>:.[1,1]<stdout>:.[1,0]<stdout>:.[1,1]<stdout>:.[1,0]<stdout>:.[1,1]<stdout>:.[1,0]<stdout>:.[1,1]<stdout>:.[1,0]<stdout>:.[1,1]<stdout>:.[1,0]<stdout>:.[1,1]<stdout>:.[1,0]<stdout>:.[1,1]<stdout>:.[1,0]<stdout>:.[1,1]<stdout>:.[1,0]<stdout>:.[1,1]<stdout>:.[1,0]<stdout>:.[1,1]<stdout>:.[1,0]<stdout>:.[1,1]<stdout>:.[1,0]<stdout>:.[1,1]<stdout>:.[1,0]<stdout>:
[1,0]<stdout>:Compiler status PASS
[1,1]<stdout>:
[1,1]<stdout>:Compiler status PASS
[1,0]<stdout>:2025-01-13 18:59:27.000331:  37  INFO ||NEURON_CC_WRAPPER||: Call compiler with cmd: neuronx-cc compile --framework=XLA /tmp/ubuntu/neuroncc_compile_workdir/2c334b06-0b0b-4530-b635-269a807dc564/model.MODULE_3375684979178417899+e30acd3a.hlo_module.pb --output /tmp/ubuntu/neuroncc_compile_workdir/2c334b06-0b0b-4530-b635-269a807dc564/model.MODULE_3375684979178417899+e30acd3a.neff --target=trn1 --verbose=35
[1,0]<stdout>:.[1,1]<stdout>:2025-01-13 18:59:31.000879:  37  INFO ||NEURON_CC_WRAPPER||: Call compiler with cmd: neuronx-cc compile --framework=XLA /tmp/ubuntu/neuroncc_compile_workdir/f326751b-b38e-4e6f-9baa-1f3c46f5446f/model.MODULE_3375684979178417899+e30acd3a.hlo_module.pb --output /tmp/ubuntu/neuroncc_compile_workdir/f326751b-b38e-4e6f-9baa-1f3c46f5446f/model.MODULE_3375684979178417899+e30acd3a.neff --target=trn1 --verbose=35
[1,1]<stdout>:.[1,0]<stdout>:.[1,1]<stdout>:.[1,0]<stdout>:.[1,1]<stdout>:.[1,0]<stdout>:.[1,1]<stdout>:.[1,0]<stdout>:.[1,1]<stdout>:.[1,0]<stdout>:.[1,1]<stdout>:.[1,0]<stdout>:.[1,1]<stdout>:.[1,0]<stdout>:.[1,1]<stdout>:.[1,0]<stdout>:.[1,1]<stdout>:.[1,0]<stdout>:.[1,1]<stdout>:.[1,0]<stdout>:.[1,1]<stdout>:.[1,0]<stdout>:.[1,1]<stdout>:.[1,0]<stdout>:.[1,1]<stdout>:.[1,0]<stdout>:.[1,1]<stdout>:.[1,0]<stdout>:.[1,1]<stdout>:.[1,0]<stdout>:.[1,1]<stdout>:.[1,0]<stdout>:.[1,1]<stdout>:.[1,0]<stdout>:.[1,1]<stdout>:.[1,0]<stdout>:.[1,1]<stdout>:.[1,0]<stdout>:.[1,1]<stdout>:.[1,0]<stdout>:.[1,1]<stdout>:.[1,0]<stdout>:.[1,1]<stdout>:.[1,0]<stdout>:.[1,1]<stdout>:.[1,0]<stdout>:.[1,1]<stdout>:
[1,1]<stdout>:Compiler status PASS
[1,0]<stdout>:
[1,0]<stdout>:Compiler status PASS
[1,1]<stdout>:[Rank 1] - Epoch 0, Step 10, Loss=5.5402
[1,1]<stdout>:2025-01-13 19:08:45.000297:  37  INFO ||NEURON_CC_WRAPPER||: Call compiler with cmd: neuronx-cc compile --framework=XLA /tmp/ubuntu/neuroncc_compile_workdir/cccb0c98-079a-4c5e-a979-0479f94cc213/model.MODULE_12564306245150408481+e30acd3a.hlo_module.pb --output /tmp/ubuntu/neuroncc_compile_workdir/cccb0c98-079a-4c5e-a979-0479f94cc213/model.MODULE_12564306245150408481+e30acd3a.neff --target=trn1 --verbose=35
[1,1]<stdout>:.[1,0]<stdout>:[Rank 0] - Epoch 0, Step 10, Loss=7.5797
[1,0]<stdout>:2025-01-13 19:08:49.000708:  37  INFO ||NEURON_CC_WRAPPER||: Call compiler with cmd: neuronx-cc compile --framework=XLA /tmp/ubuntu/neuroncc_compile_workdir/728c65e8-96ed-4230-899e-9afacf343ade/model.MODULE_12564306245150408481+e30acd3a.hlo_module.pb --output /tmp/ubuntu/neuroncc_compile_workdir/728c65e8-96ed-4230-899e-9afacf343ade/model.MODULE_12564306245150408481+e30acd3a.neff --target=trn1 --verbose=35
[1,0]<stdout>:.[1,1]<stdout>:.[1,0]<stdout>:.[1,1]<stdout>:.[1,0]<stdout>:.[1,1]<stdout>:.[1,0]<stdout>:.[1,1]<stdout>:.[1,0]<stdout>:.[1,1]<stdout>:.[1,0]<stdout>:.[1,1]<stdout>:.[1,0]<stdout>:.[1,1]<stdout>:.[1,0]<stdout>:.[1,1]<stdout>:.[1,0]<stdout>:.[1,1]<stdout>:.[1,0]<stdout>:.[1,1]<stdout>:.[1,0]<stdout>:.[1,1]<stdout>:.[1,0]<stdout>:.[1,1]<stdout>:.[1,0]<stdout>:.[1,1]<stdout>:.[1,0]<stdout>:.[1,1]<stdout>:.[1,0]<stdout>:.[1,1]<stdout>:.[1,0]<stdout>:.[1,1]<stdout>:.[1,0]<stdout>:.[1,1]<stdout>:.[1,0]<stdout>:.[1,1]<stdout>:.[1,1]<stdout>:
[1,1]<stdout>:Compiler status PASS
[1,0]<stdout>:.[1,0]<stdout>:
[1,0]<stdout>:Compiler status PASS
[1,1]<stdout>:Rank 1 - Finished warmup in 1490.44s
[1,1]<stdout>:Rank 1 - Starting training for 5 epochs...
[1,1]<stdout>:[Rank 1] - Epoch 1/5
[1,1]<stdout>:[Rank 1] - Epoch 1, Step 10, Loss=3.5816
[1,1]<stdout>:[Rank 1] - Epoch 1 done in 3.91s
[1,1]<stdout>:[Rank 1] - Epoch 2/5
[1,1]<stdout>:[Rank 1] - Epoch 2, Step 10, Loss=3.4553
[1,1]<stdout>:[Rank 1] - Epoch 2 done in 3.88s
[1,1]<stdout>:[Rank 1] - Epoch 3/5
[1,1]<stdout>:[Rank 1] - Epoch 3, Step 10, Loss=3.4267
[1,1]<stdout>:[Rank 1] - Epoch 3 done in 3.89s
[1,1]<stdout>:[Rank 1] - Epoch 4/5
[1,1]<stdout>:[Rank 1] - Epoch 4, Step 10, Loss=3.4234
[1,1]<stdout>:[Rank 1] - Epoch 4 done in 3.90s
[1,1]<stdout>:[Rank 1] - Epoch 5/5
[1,0]<stdout>:Rank 0 - Finished warmup in 1506.43s
[1,0]<stdout>:Rank 0 - Starting training for 5 epochs...
[1,0]<stdout>:[Rank 0] - Epoch 1/5
[1,1]<stdout>:[Rank 1] - Epoch 5, Step 10, Loss=3.3973
[1,0]<stdout>:[Rank 0] - Epoch 1, Step 10, Loss=3.7818
[1,1]<stdout>:[Rank 1] - Epoch 5 done in 3.89s
[1,1]<stdout>:[Rank 1] - All epochs complete in 19.48s
[1,1]<stdout>:[Rank 1] - local_samples=2500.0, total_time=19.48s, local_throughput=128.36 samples/s, local_avg_epoch_time=3.90s
[1,1]<stdout>:[Rank 1] training complete. Exiting main().
[1,0]<stdout>:[Rank 0] - Epoch 1 done in 3.92s
[1,0]<stdout>:[Rank 0] - Epoch 2/5
[1,0]<stdout>:[Rank 0] - Epoch 2, Step 10, Loss=3.4994
[1,0]<stdout>:[Rank 0] - Epoch 2 done in 3.90s
[1,0]<stdout>:[Rank 0] - Epoch 3/5
[1,0]<stdout>:[Rank 0] - Epoch 3, Step 10, Loss=3.4227
[1,0]<stdout>:[Rank 0] - Epoch 3 done in 3.91s
[1,0]<stdout>:[Rank 0] - Epoch 4/5
[1,0]<stdout>:[Rank 0] - Epoch 4, Step 10, Loss=3.4861
[1,0]<stdout>:[Rank 0] - Epoch 4 done in 3.91s
[1,0]<stdout>:[Rank 0] - Epoch 5/5
[1,0]<stdout>:[Rank 0] - Epoch 5, Step 10, Loss=3.4878
[1,0]<stdout>:[Rank 0] - Epoch 5 done in 3.90s
[1,0]<stdout>:[Rank 0] - All epochs complete in 19.54s
[1,0]<stdout>:[Rank 0] - local_samples=2500.0, total_time=19.54s, local_throughput=127.93 samples/s, local_avg_epoch_time=3.91s
[1,0]<stdout>:[Rank 0] training complete. Exiting main().

2025/01/13 19:16:31 Parsed throughput from 2 ranks. Total=256.29 samples/s, Average=128.15 samples/s
2025/01/13 19:16:31 Average Throughput: 128.15 samples/second
2025/01/13 19:16:31 Parsed average epoch time from 2 ranks. Sum=7.81s, Average=3.91s
--- PASS: TestBertTraining (1780.08s)
    --- PASS: TestBertTraining/neuron-training (1780.08s)
        --- PASS: TestBertTraining/neuron-training/Neuron_training_Job_succeeds (1780.05s)
PASS
2025/01/13 19:16:31 Deleting Neuron device plugin, MPI operator, and EFA device plugin manifests.
2025/01/13 19:16:32 Successfully deleted Neuron device plugin, MPI operator, and EFA device plugin manifests.
2025/01/13 19:16:32 Tests finished with exit code 0
ok      github.com/aws/aws-k8s-tester/test/cases/neuron-training        1796.990s

mselim00

lgtm

mattcjo and others added 8 commits January 9, 2025 19:17

Add neuron-training with working

4560d5b

Merge branch 'aws:main' into benchmark/neuron-training

62d35ee

Pin neuron python versions

bfe71ff

Update logging in Neuron train.py

93b18c5

Update Neuron BERT training test to pass nodeType, and remove unused var

69a6007

Merge branch 'aws:main' into benchmark/neuron-training

05dbf30

Remove manifest used for testing

7d584b5

Bump versions, some reformatting, and tweak inputs

d6268ed

mselim00 changed the title ~~[WIP] (feat/test) Add New E2E Test Harness for Distributed BERT Training on Neuron (Trainium)~~ feat(test/neuron-training): Add New E2E Test Harness for Distributed BERT Training on Neuron (Trainium) Jan 13, 2025

mselim00 approved these changes Jan 13, 2025

View reviewed changes

mselim00 merged commit 8688c5f into aws:main Jan 13, 2025
7 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(test/neuron-training): Add New E2E Test Harness for Distributed BERT Training on Neuron (Trainium) #558

feat(test/neuron-training): Add New E2E Test Harness for Distributed BERT Training on Neuron (Trainium) #558

mattcjo commented Jan 10, 2025 •

edited

Loading

mselim00 commented Jan 13, 2025 •

edited

Loading

mselim00 left a comment

feat(test/neuron-training): Add New E2E Test Harness for Distributed BERT Training on Neuron (Trainium) #558

feat(test/neuron-training): Add New E2E Test Harness for Distributed BERT Training on Neuron (Trainium) #558

Conversation

mattcjo commented Jan 10, 2025 • edited Loading

mselim00 commented Jan 13, 2025 • edited Loading

mselim00 left a comment

Choose a reason for hiding this comment

mattcjo commented Jan 10, 2025 •

edited

Loading

mselim00 commented Jan 13, 2025 •

edited

Loading