Package v1 is the v1 version of the API.
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.
JobCondition describes the state of the job at a certain point.
Field | Description |
---|---|
|
Type of job condition. |
|
Status of the condition, one of True, False, Unknown. |
|
The reason for the condition’s last transition. |
|
A human readable message indicating details about the transition. |
|
The last time this condition was updated. |
|
Last time the condition transitioned from one status to another. |
JobConditionType defines all kinds of types of JobStatus.
JobStatus represents the current observed state of the training Job.
Field | Description |
---|---|
|
Conditions is an array of current observed job conditions. |
|
ReplicaStatuses is map of ReplicaType and ReplicaStatus, specifies the status of each replica. |
|
Represents time when the job was acknowledged by the job controller. It is not guaranteed to be set in happens-before order across separate operations. It is represented in RFC3339 form and is in UTC. |
|
Represents time when the job was completed. It is not guaranteed to be set in happens-before order across separate operations. It is represented in RFC3339 form and is in UTC. |
|
Represents last time when the job was reconciled. It is not guaranteed to be set in happens-before order across separate operations. It is represented in RFC3339 form and is in UTC. |
ReplicaSpec is a description of the replica
Field | Description |
---|---|
|
Replicas is the desired number of replicas of the given template. If unspecified, defaults to 1. |
|
Template is the object that describes the pod that will be created for this replica. RestartPolicy in PodTemplateSpec will be overide by RestartPolicy in ReplicaSpec |
|
Restart policy for all replicas within the job. One of Always, OnFailure, Never and ExitCode. Default to Never. |
ReplicaStatus represents the current observed state of the replica.
Field | Description |
---|---|
|
The number of actively running pods. |
|
The number of pods which reached phase Succeeded. |
|
The number of pods which reached phase Failed. |
ReplicaType represents the type of the replica. Each operator needs to define its own set of ReplicaTypes.
RestartPolicy describes how the replicas should be restarted. Only one of the following restart policies may be specified. If none of the following policies is specified, the default one is RestartPolicyAlways.
RunPolicy encapsulates various runtime policies of the distributed training job, for example how to clean up resources and how long the job can stay active.
Field | Description |
---|---|
|
CleanPodPolicy defines the policy to kill pods after the job completes. Default to Running. |
|
TTLSecondsAfterFinished is the TTL to clean up jobs. It may take extra ReconcilePeriod seconds for the cleanup, since reconcile gets called periodically. Default to infinite. |
|
Specifies the duration in seconds relative to the startTime that the job may be active before the system tries to terminate it; value must be positive integer. |
|
Optional number of retries before marking this job failed. |
|
SchedulingPolicy defines the policy related to scheduling, e.g. gang-scheduling |
SchedulingPolicy encapsulates various scheduling policies of the distributed training job, for example minAvailable
for gang-scheduling.
Field | Description |
---|---|
|