Skip to content

LFSCODE/AITurbo

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

39 Commits
 
 
 
 
 
 

Repository files navigation

DDL training jobs

Contains some experimental data

Description of DDL training jobs in the dataset.xlsx

In the dataset.xlsx, there are a total of 8692 rows of data, and each row represents a submitted job.

The meaning of each column is:

  • Job type: There are a total of 27 types of distributed jobs, in which the model name used is unknown.

  • Running time(second): The time from the start to the end of the job in the cluster

  • Ps: The number of parameter servers used in the job

  • Worker: The number of workers used in the job

  • Epoch: The number of epochs used in job training

  • Accuracy: The accuracy of the final output of the job

  • Loss: The corresponding loss value under the highest accuracy

  • GPU: The number of GPUs used in the job

  • Memory(GB): The number of memory used in the job

  • CPU: The number of CPUs used in the job

  • Batch Size: The amount of data processed by a job in a batch

  • Learning rate: One of the hyperparameters of job

  • Step: The number of steps for job training

  • Throughput: The amount of data processed by the job per unit time

  • Resubmit: Whether the job is submitted repeatedly, 0 means the job is submitted repeatedly, and 1 means the first submission of the job

  • Predictability: Whether the job is predictable, 0 means it is a predictable job, 1 means it is an unpredictable job

Distribution characteristics of DDL training jobs in the dataset.xlsx

  • Running time distribution

图片名称

  • Ratio of predictable and unpredictable jobs

图片名称

About

Contains some experimental data

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published