Skip to content

Latest commit

 

History

History
60 lines (32 loc) · 2.12 KB

README.md

File metadata and controls

60 lines (32 loc) · 2.12 KB

DDL training jobs

Contains some experimental data

Description of DDL training jobs in the dataset.xlsx

In the dataset.xlsx, there are a total of 8692 rows of data, and each row represents a submitted job.

The meaning of each column is:

  • Job type: There are a total of 27 types of distributed jobs, in which the model name used is unknown.

  • Running time(second): The time from the start to the end of the job in the cluster

  • Ps: The number of parameter servers used in the job

  • Worker: The number of workers used in the job

  • Epoch: The number of epochs used in job training

  • Accuracy: The accuracy of the final output of the job

  • Loss: The corresponding loss value under the highest accuracy

  • GPU: The number of GPUs used in the job

  • Memory(GB): The number of memory used in the job

  • CPU: The number of CPUs used in the job

  • Batch Size: The amount of data processed by a job in a batch

  • Learning rate: One of the hyperparameters of job

  • Step: The number of steps for job training

  • Throughput: The amount of data processed by the job per unit time

  • Resubmit: Whether the job is submitted repeatedly, 0 means the job is submitted repeatedly, and 1 means the first submission of the job

  • Predictability: Whether the job is predictable, 0 means it is a predictable job, 1 means it is an unpredictable job

Distribution characteristics of DDL training jobs in the dataset.xlsx

  • Running time distribution

图片名称

  • Ratio of predictable and unpredictable jobs

图片名称