Skip to content

Commit

Permalink
update tests (#111)
Browse files Browse the repository at this point in the history
Signed-off-by: Dmitry Shmulevich <[email protected]>
  • Loading branch information
dmitsh authored Nov 6, 2024
1 parent bf6f029 commit 39ed040
Show file tree
Hide file tree
Showing 10 changed files with 18 additions and 27 deletions.
2 changes: 1 addition & 1 deletion docs/examples/kueue/kueue.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@
Install `kueue` by following these [instructions](https://kueue.sigs.k8s.io/docs/installation/):

```bash
KUEUE_VERSION=v0.8.0
KUEUE_VERSION=v0.9.0
kubectl apply --server-side -f https://github.com/kubernetes-sigs/kueue/releases/download/${KUEUE_VERSION}/manifests.yaml

kubectl apply -f charts/overrides/kueue/priority.yaml
Expand Down
2 changes: 1 addition & 1 deletion resources/benchmarks/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -36,7 +36,7 @@ To run the benchmark test for Run:ai

## Scaling Benchmark Test

The scaling benchmark workflow operates on 500 virtual GPU nodes with tho workflows. The first [workflow](scaling/workflows/run-test-multi.yaml) submits is a job with 500 replicas, the second [workflow](scaling/workflows/run-test-single.yaml) submits a batch of 500 single-node jobs.
The scaling benchmark workflow operates on 700 virtual GPU nodes with tho workflows. The first [workflow](scaling/workflows/run-test-multi.yaml) submits is a job with 700 replicas, the second [workflow](scaling/workflows/run-test-single.yaml) submits a batch of 700 single-node jobs.

### Example

Expand Down
9 changes: 0 additions & 9 deletions resources/benchmarks/nwtopo/templates/runai/mpijob.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -43,15 +43,6 @@ spec:
affinity:
podAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 20
podAffinityTerm:
labelSelector:
matchExpressions:
- key: app
operator: In
values:
- {{._NAME_}}
topologyKey: net-layer-3
- weight: 70
podAffinityTerm:
labelSelector:
Expand Down
4 changes: 2 additions & 2 deletions resources/benchmarks/scaling/workflows/config-nodes.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -13,14 +13,14 @@
# limitations under the License.

name: config-nodes
description: create 500 virtual GPU nodes
description: create 700 virtual GPU nodes
tasks:
- id: configure
type: Configure
params:
nodes:
- type: dgxa100.80g
count: 500
count: 700
labels:
nvidia.com/gpu.count: "8"
timeout: 5m
Original file line number Diff line number Diff line change
Expand Up @@ -40,5 +40,5 @@ tasks:
submitacl: '*'
resources:
max:
{memory: 360Gi, vcore: 50000m, nvidia.com/gpu: 4000}
{memory: 360Gi, vcore: 70000m, nvidia.com/gpu: 5600}
timeout: 1m
6 changes: 3 additions & 3 deletions resources/benchmarks/scaling/workflows/run-test-multi.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -13,13 +13,13 @@
# limitations under the License.

name: test-scaling-multi-node-job
description: deploy a 500-replicas job
description: deploy a 700-replicas job
tasks:
- id: job
type: SubmitObj
params:
refTaskId: register
count: 1
params:
replicas: 500
ttl: 2m
replicas: 700
ttl: 5m
6 changes: 3 additions & 3 deletions resources/benchmarks/scaling/workflows/run-test-single.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -13,13 +13,13 @@
# limitations under the License.

name: test-scaling-single-node-jobs
description: deploy 500 single-replica jobs
description: deploy 700 single-replica jobs
tasks:
- id: job
type: SubmitObj
params:
refTaskId: register
count: 500
count: 700
params:
replicas: 1
ttl: 2m
ttl: 5m
6 changes: 3 additions & 3 deletions resources/benchmarks/scaling/workflows/runai-test-multi.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -13,13 +13,13 @@
# limitations under the License.

name: test-scaling
description: deploy a 500-replicas job
description: deploy a 700-replicas job
tasks:
- id: job
type: SubmitObj
params:
refTaskId: register-mpi
count: 1
params:
workers: 499
ttl: 2m
workers: 699
ttl: 5m
6 changes: 3 additions & 3 deletions resources/benchmarks/scaling/workflows/runai-test-single.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -13,12 +13,12 @@
# limitations under the License.

name: test-scaling
description: deploy 500 single-replica jobs
description: deploy 700 single-replica jobs
tasks:
- id: job
type: SubmitObj
params:
refTaskId: register-trainingworkload
count: 500
count: 700
params:
ttl: 2m
ttl: 5m
2 changes: 1 addition & 1 deletion scripts/env.sh
Original file line number Diff line number Diff line change
Expand Up @@ -131,7 +131,7 @@ function deploy_jobset() {
}

# https://github.com/kubernetes-sigs/kueue
KUEUE_VERSION=v0.8.1
KUEUE_VERSION=v0.9.0

function deploy_kueue() {
printGreen Deploying kueue
Expand Down

0 comments on commit 39ed040

Please sign in to comment.