Parallelize neuron training processes for each neuron core #566

mselim00 · 2025-01-21T23:18:29Z

Enables full multi-processing across all neuron cores, and corrects an earlier issue where world size wasn't being correctly determined (i.e., each process was in its own process group). Changes from mpirun to torchrun.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

mselim00 · 2025-01-27T18:21:26Z

Will publish test results in a bit

test/cases/neuron-training/bert_training_test.go

mselim00 · 2025-01-27T20:08:10Z

Ran the test on 2 nodes locally with go test

2025/01/27 19:50:34 Parsed throughput from 56 ranks. Total=2798.46 samples/s, Average=49.97 samples/s
2025/01/27 19:50:34 Average Throughput: 49.97 samples/second
2025/01/27 19:50:34 Parsed average epoch time from 56 ranks. Sum=17.36s, Average=0.31s
--- PASS: TestBertTraining (675.16s)
    --- PASS: TestBertTraining/bert-training (675.16s)
        --- PASS: TestBertTraining/bert-training/Neuron_training_Job_succeeds (675.14s)
PASS

There's some issue collecting the throughput info. This run showed 56 ranks, others show some other random number. I thought this was a RegEx issue, which I've fixed, but we still see the problem. Might work on this separately though, the ranks parsed from are probably representative of the group, and afaict nvidia training currently only parses from the master proc.

test/cases/neuron-training/bert_training_test.go

mattcjo · 2025-01-27T23:17:15Z

test/cases/neuron-training/manifests/training-comm-service.yaml

+apiVersion: v1
+kind: Service
+metadata:
+  name: training
+  labels:
+    app: training
+spec:
+  clusterIP: None
+  selector:
+    job-name: bert-training


Is explicit service creation required for torchrun?

yeah this service is required so we can dynamically determine the master node's IP with bert-training-0.training in the job spec

mattcjo · 2025-01-27T23:19:05Z

Ran the test on 2 nodes locally with go test
2025/01/27 19:50:34 Parsed throughput from 56 ranks. Total=2798.46 samples/s, Average=49.97 samples/s
2025/01/27 19:50:34 Average Throughput: 49.97 samples/second
2025/01/27 19:50:34 Parsed average epoch time from 56 ranks. Sum=17.36s, Average=0.31s
--- PASS: TestBertTraining (675.16s)
    --- PASS: TestBertTraining/bert-training (675.16s)
        --- PASS: TestBertTraining/bert-training/Neuron_training_Job_succeeds (675.14s)
PASS
There's some issue collecting the throughput info. This run showed 56 ranks, others show some other random number. I thought this was a RegEx issue, which I've fixed, but we still see the problem. Might work on this separately though, the ranks parsed from are probably representative of the group, and afaict nvidia training currently only parses from the master proc.

@mselim00 This is slightly concerning. Are you able to confirm expected number of processes is running even if metrics seem off?

mselim00 · 2025-01-27T23:27:15Z

Ran the test on 2 nodes locally with go test
2025/01/27 19:50:34 Parsed throughput from 56 ranks. Total=2798.46 samples/s, Average=49.97 samples/s
2025/01/27 19:50:34 Average Throughput: 49.97 samples/second
2025/01/27 19:50:34 Parsed average epoch time from 56 ranks. Sum=17.36s, Average=0.31s
--- PASS: TestBertTraining (675.16s)
    --- PASS: TestBertTraining/bert-training (675.16s)
        --- PASS: TestBertTraining/bert-training/Neuron_training_Job_succeeds (675.14s)
PASS
There's some issue collecting the throughput info. This run showed 56 ranks, others show some other random number. I thought this was a RegEx issue, which I've fixed, but we still see the problem. Might work on this separately though, the ranks parsed from are probably representative of the group, and afaict nvidia training currently only parses from the master proc.
@mselim00 This is slightly concerning. Are you able to confirm expected number of processes is running even if metrics seem off?

Yep, I manually checked that we have logs from all 64 ranks, that all of them print those metrics, and that all of them print the training complete log line. I'm not sure as to the root cause atm, just know that it's probably not a RegEx issue at this point.

mattcjo

LGTM. Approving since CI check failure is unrelated. Merge once fixed.

mselim00 · 2025-01-28T06:37:35Z

Fixed parsing... it was a regex issue, sort of. the match rule just didn't account for processes printing to the same line

2025/01/28 06:27:34 Parsed throughput from 64 ranks. Total=3446.16 samples/s, Average=53.85 samples/s
2025/01/28 06:27:34 Average Throughput: 53.85 samples/second
2025/01/28 06:27:34 Parsed average epoch time from 64 ranks. Sum=18.56s, Average=0.29s
--- PASS: TestBertTraining (896.10s)
    --- PASS: TestBertTraining/bert-training (896.10s)
        --- PASS: TestBertTraining/bert-training/Neuron_training_Job_succeeds (895.71s)
PASS
2025/01/28 06:27:34 Deleting Neuron device plugin and EFA device plugin manifests.
2025/01/28 06:27:35 Successfully deleted Neuron device plugin and EFA device plugin manifests.
2025/01/28 06:27:35 Tests finished with exit code 0
ok      github.com/aws/aws-k8s-tester/test/cases/neuron-training        908.868s

mselim00 · 2025-01-28T17:33:49Z

Force pushed #570 to this pr to unblock the build

mattcjo · 2025-01-28T18:27:03Z

test/cases/neuron-training/bert_training_test.go

+func aggregateMetricFromLogs(metricRegex *regexp.Regexp, logs string) (avg float64, sum float64, count int) {
+	matches := metricRegex.FindAllStringSubmatch(logs, -1)
+	for _, match := range matches {
+		val, err := strconv.ParseFloat(match[1], 64)
+		if err == nil {
+			sum += val
+			count++


This is nice.

mselim00 force-pushed the neuron-training branch 2 times, most recently from 874c16d to 57d3591 Compare January 27, 2025 18:20

mselim00 requested review from mattcjo and wwvela January 27, 2025 18:22

wwvela reviewed Jan 27, 2025

View reviewed changes

test/cases/neuron-training/bert_training_test.go Show resolved Hide resolved

wwvela reviewed Jan 27, 2025

View reviewed changes

test/cases/neuron-training/bert_training_test.go Outdated Show resolved Hide resolved

mattcjo reviewed Jan 27, 2025

View reviewed changes

mattcjo approved these changes Jan 28, 2025

View reviewed changes

mselim00 changed the title ~~[WIP] Parallelize training processes for each neuron core~~ Parallelize neuron training processes for each neuron core Jan 28, 2025

mselim00 mentioned this pull request Jan 28, 2025

Fix kubetest2 build by replacing opencensus vanity url #569

Closed

mselim00 force-pushed the neuron-training branch 2 times, most recently from bf4a484 to 8d635dc Compare January 28, 2025 06:36

mselim00 force-pushed the neuron-training branch 3 times, most recently from 74ffd5d to beea290 Compare January 28, 2025 06:57

mselim00 requested a review from mattcjo January 28, 2025 16:54

mselim00 added 9 commits January 28, 2025 17:32

Run 1 neuron training proc per neuron core

6d21b0f

Switch to elastic launch, scale process count to neuron core count

1db3d79

Enable worker <-> master collective communication

bd24aad

Fix epoch time parsing, bump sdk version, cleanup

edc0788

Fix rank parsing, clean up excess logs

5d9a6fa

Fix average epoch/throughput regexs for multiprocessing

3479c1e

Rename neuron training to bert training

6512d72

Formatting fix

efaaf17

Fix metrics parsing and unspecified nodetype handling

dbcf2c2

mselim00 force-pushed the neuron-training branch from beea290 to dbcf2c2 Compare January 28, 2025 17:33

mattcjo approved these changes Jan 28, 2025

View reviewed changes

mselim00 merged commit 0bdfd85 into aws:main Jan 28, 2025
9 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Parallelize neuron training processes for each neuron core #566

Parallelize neuron training processes for each neuron core #566

mselim00 commented Jan 21, 2025 •

edited

Loading

mselim00 commented Jan 27, 2025

mselim00 commented Jan 27, 2025

mattcjo Jan 27, 2025

mselim00 Jan 27, 2025

mattcjo commented Jan 27, 2025

mselim00 commented Jan 27, 2025

mattcjo left a comment

mselim00 commented Jan 28, 2025 •

edited

Loading

mselim00 commented Jan 28, 2025 •

edited

Loading

mattcjo Jan 28, 2025

Parallelize neuron training processes for each neuron core #566

Parallelize neuron training processes for each neuron core #566

Conversation

mselim00 commented Jan 21, 2025 • edited Loading

mselim00 commented Jan 27, 2025

mselim00 commented Jan 27, 2025

mattcjo Jan 27, 2025

Choose a reason for hiding this comment

mselim00 Jan 27, 2025

Choose a reason for hiding this comment

mattcjo commented Jan 27, 2025

mselim00 commented Jan 27, 2025

mattcjo left a comment

Choose a reason for hiding this comment

mselim00 commented Jan 28, 2025 • edited Loading

mselim00 commented Jan 28, 2025 • edited Loading

mattcjo Jan 28, 2025

Choose a reason for hiding this comment

mselim00 commented Jan 21, 2025 •

edited

Loading

mselim00 commented Jan 28, 2025 •

edited

Loading

mselim00 commented Jan 28, 2025 •

edited

Loading