DDL training jobs

Contains some experimental data

Description of DDL training jobs in the dataset.xlsx

In the dataset.xlsx, there are a total of 8692 rows of data, and each row represents a submitted job.

The meaning of each column is:

Job type: There are a total of 27 types of distributed jobs, in which the model name used is unknown.
Running time(second): The time from the start to the end of the job in the cluster
Ps: The number of parameter servers used in the job
Worker: The number of workers used in the job
Epoch: The number of epochs used in job training
Accuracy: The accuracy of the final output of the job
Loss: The corresponding loss value under the highest accuracy
GPU: The number of GPUs used in the job
Memory(GB): The number of memory used in the job
CPU: The number of CPUs used in the job
Batch Size: The amount of data processed by a job in a batch
Learning rate: One of the hyperparameters of job
Step: The number of steps for job training
Throughput: The amount of data processed by the job per unit time
Resubmit: Whether the job is submitted repeatedly, 0 means the job is submitted repeatedly, and 1 means the first submission of the job
Predictability: Whether the job is predictable, 0 means it is a predictable job, 1 means it is an unpredictable job