Skip to content

Commit

Permalink
Preparing release of v0.3.0 (#414)
Browse files Browse the repository at this point in the history
Also
- Updated Makefile to use new version
- extra notes for developers
  • Loading branch information
alculquicondor authored Sep 1, 2021
1 parent 5fca328 commit b9141c0
Show file tree
Hide file tree
Showing 9 changed files with 180 additions and 34 deletions.
70 changes: 57 additions & 13 deletions CONTRIBUTING.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,34 +14,78 @@ git clone https://github.com/${GITHUB_USER}/mpi-operator.git

## Install Dependencies

We use Go v1.13+ for development and use [Go Modules](https://blog.golang.org/using-go-modules) to download and install the dependencies.
We use Go v1.15+ for development and use [Go Modules](https://blog.golang.org/using-go-modules) to download and install the dependencies.

## Run Unit Test
## Controller versions

You can execute all the unit tests via `go test ./...`.
The main module `github.com/kubeflow/mpi-operator` contains the code of the legacy
controllers `v1alpha1`, `v1alpha2` and `v1`.

The newest iteration of the controller is in the module `github.com/kubeflow/mpi-operator/v2`.

## Run tests

### Unit and integration tests

You can execute all the unit and integration tests via `make test`.

If you only which to run the tests for the v2 controller, you can run `make test_v2`.

You can find the unit tests in the same folders as the functional code.

You can find the integration tests in a separate directory, `v2/test/integration`.
Integration tests make use of a real kube-apiserver to test the interaction of
the controller with a real Kubernetes API. In these tests, other components
are not running, including `kubelet` or `kube-controller-manager`.

Consider adding an integration test if your feature makes new API calls.

### E2E tests

E2E tests run against a real cluster. In our tests, we create a cluster using
[kind](https://kind.sigs.k8s.io/docs/user/quick-start/).

You can run the tests with `make test_e2e`.

If desired, you can run the tests against any existing cluster. Just make sure
that credentials for the cluster are present in `${HOME}/.kube/config` and run:

```bash
USE_EXISTING_CLUSTER=true make test_e2e
```

## Check Code Style

We use [golangci-lint](https://github.com/golangci/golangci-lint) to check issues on code style. Please also check out [this wiki](https://github.com/golang/go/wiki/CodeReviewComments) for some additional instructions on code review.
We use [golangci-lint](https://github.com/golangci/golangci-lint) to check issues on code style.
Please also check out [this wiki](https://github.com/golang/go/wiki/CodeReviewComments) for some additional instructions on code review.

You can run formatter and linter with:

```bash
make fmt lint
```

## Run

You have to build the image and deploy the standalone YAMLs in a cluster.

```shell
CONTROLLER_VERSION=v1 RELEASE_VERSION=latest make images
kubectl apply -k manifests/overlays/standalone
```bash
make images dev_manifest
kubectl apply -k manifests/overlays/dev
```

If you need to use a different registry, you can do:
You can build an image using a legacy controller by setting `CONTROLLER_VERSION`

```shell
IMAGE_NAME=example.com/mpi-operator CONTROLLER_VERSION=v1 RELEASE_VERSION=latest make images
```bash
make CONTROLLER_VERSION=v1 images dev_manifest
kubectl apply -k manifests/overlays/dev
```

Next, modify the line `newName` in `manifests/overlays/standalone/kustomization.yaml`
to match the image name. After pushing the image to the registry, you can apply
the YAMLs the same way as before.
If you need to use a different registry, or a different tag, you can do:

```bash
make IMAGE_NAME=example.com/mpi-operator CONTROLLER_VERSION=v1 RELEASE_VERSION=dev make images dev_manifest
```

To look at the controller's logs, you can do:

Expand Down
6 changes: 3 additions & 3 deletions Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -3,8 +3,8 @@ REPO_PATH="github.com/kubeflow/mpi-operator"
REL_OSARCH="linux/amd64"
GitSHA=`git rev-parse HEAD`
Date=`date "+%Y-%m-%d %H:%M:%S"`
RELEASE_VERSION?=v0.2.2
CONTROLLER_VERSION?=v1alpha2
RELEASE_VERSION?=v0.3.0
CONTROLLER_VERSION?=v2
IMG_BUILDER=docker
LD_FLAGS=" \
-X '${REPO_PATH}/pkg/version.GitSHA=${GitSHA}' \
Expand All @@ -14,7 +14,7 @@ LD_FLAGS_V2=" \
-X '${REPO_PATH}/v2/pkg/version.GitSHA=${GitSHA}' \
-X '${REPO_PATH}/v2/pkg/version.Built=${Date}' \
-X '${REPO_PATH}/v2/pkg/version.Version=${RELEASE_VERSION}'"
IMAGE_NAME?=kubeflow/mpi-operator
IMAGE_NAME?=mpioperator/mpi-operator
KUBEBUILDER_ASSETS_PATH := $(dir $(abspath $(firstword $(MAKEFILE_LIST))))bin/kubebuilder/bin
KIND_VERSION=v0.11.1
# This kubectl version supports -k for kustomization.
Expand Down
47 changes: 36 additions & 11 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,7 @@ You can deploy the operator with default settings by running the following comma
```shell
git clone https://github.com/kubeflow/mpi-operator
cd mpi-operator
kubectl create -f deploy/v1alpha2/mpi-operator.yaml
kubectl apply -f deploy/v2beta1/mpi-operator.yaml
```

Alternatively, follow the [getting started guide](https://www.kubeflow.org/docs/started/getting-started/) to deploy Kubeflow.
Expand All @@ -34,16 +34,22 @@ mpijobs.kubeflow.org 4d
...
```

If it is not included you can add it as follows using [kustomize](https://github.com/kubernetes-sigs/kustomize):
If it is not included, you can add it as follows using [kustomize](https://github.com/kubernetes-sigs/kustomize):

```bash
git clone https://github.com/kubeflow/mpi-operator
cd mpi-operator/manifests
kustomize build overlays/kubeflow | kubectl apply -f -
cd mpi-operator
kustomize build manifests/overlays/kubeflow | kubectl apply -f -
```

Note that since Kubernetes v1.14, `kustomize` became a subcommand in `kubectl` so you can also run the following command instead:

Since Kubernetes v1.21, you can use:

```bash
kubectl apply -k manifests/overlays/kubeflow
```

```bash
kubectl kustomize base | kubectl apply -f -
```
Expand All @@ -53,13 +59,13 @@ kubectl kustomize base | kubectl apply -f -
You can create an MPI job by defining an `MPIJob` config file. See [TensorFlow benchmark example](https://github.com/kubeflow/mpi-operator/blob/master/examples/v1alpha2/tensorflow-benchmarks.yaml) config file for launching a multi-node TensorFlow benchmark training job. You may change the config file based on your requirements.

```
cat examples/v1alpha2/tensorflow-benchmarks.yaml
cat examples/v2beta1/tensorflow-benchmarks.yaml
```

Deploy the `MPIJob` resource to start training:

```
kubectl create -f examples/v1alpha2/tensorflow-benchmarks.yaml
kubectl apply -f examples/v2beta1/tensorflow-benchmarks.yaml
```

## Monitoring an MPI Job
Expand All @@ -71,7 +77,7 @@ kubectl get -o yaml mpijobs tensorflow-benchmarks
```

```
apiVersion: kubeflow.org/v1alpha2
apiVersion: kubeflow.org/v2beta1
kind: MPIJob
metadata:
creationTimestamp: "2019-07-09T22:15:51Z"
Expand All @@ -82,7 +88,8 @@ metadata:
selfLink: /apis/kubeflow.org/v1alpha2/namespaces/default/mpijobs/tensorflow-benchmarks
uid: 1c5b470f-a297-11e9-964d-88d7f67c6e6d
spec:
cleanPodPolicy: Running
runPolicy:
cleanPodPolicy: Running
mpiReplicaSpecs:
Launcher:
replicas: 1
Expand Down Expand Up @@ -199,6 +206,12 @@ Variables: horovod
total images/sec: 308.27
```

For a sample that uses Intel MPI, see:

```bash
cat examples/pi/pi-intel.yaml
```

## Exposed Metrics

| Metric name | Metric type | Description | Labels |
Expand All @@ -213,9 +226,21 @@ total images/sec: 308.27
With [kube-state-metrics](https://github.com/kubernetes/kube-state-metrics), one can join metrics by labels.
For example `kube_pod_info * on(pod,namespace) group_left label_replace(mpi_operator_job_infos, "pod", "$0", "launcher", ".*")`

# Docker Images
## Docker Images

Docker images are built and pushed automatically to [mpioperator on Dockerhub](https://hub.docker.com/u/mpioperator). You can use the following Dockerfiles to build the images yourself:
We push Docker images of [mpioperator on Dockerhub](https://hub.docker.com/u/mpioperator) for every release.
You can use the following Dockerfile to build the image yourself:

- [mpi-operator](https://github.com/kubeflow/mpi-operator/blob/master/Dockerfile)
- [kubectl-delivery](https://github.com/kubeflow/mpi-operator/blob/master/cmd/kubectl-delivery/Dockerfile)

Alternative, you can build the image using make:

```bash
make RELEASE_VERSION=dev images
```

This will produce an image with the tag `kubeflow/mpi-operator:dev`.

## Contributing

Learn more in [CONTRIBUTING](https://github.com/kubeflow/mpi-operator/blob/master/CONTRIBUTING.md).
15 changes: 15 additions & 0 deletions RELEASE.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,20 @@
# MPI Operator Releases

## Release v0.3.0

* Scalability improvements
* Worker start up no longer issues requests to kube-apiserver.
* Dropped kubectl-delivery init container, reducing stress on kube-apiserver.
* Support for Intel MPI.
* Support for `runPolicy` (`ttlSecondsAfterFinish`, `activeDeadlineSeconds`, `backoffLimit`)
by using a k8s Job for the launcher.
* Samples for plain MPI applications.
* Production readiness improvements:
* Increased coverage throughout unit, integration and E2E tests.
* More robust API validation.
* Revisited v2beta1 MPIJob API.
* Using fully-qualified label names, in consistency with other kubeflow operators.

## Release v0.2.3

### Enhancements
Expand Down
5 changes: 3 additions & 2 deletions examples/horovod/tensorflow-mnist.yaml
Original file line number Diff line number Diff line change
@@ -1,10 +1,11 @@
apiVersion: kubeflow.org/v1
apiVersion: kubeflow.org/v2beta1
kind: MPIJob
metadata:
name: tensorflow-mnist
spec:
slotsPerWorker: 1
cleanPodPolicy: Running
runPolicy:
cleanPodPolicy: Running
mpiReplicaSpecs:
Launcher:
replicas: 1
Expand Down
11 changes: 10 additions & 1 deletion examples/pi/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,13 +7,22 @@ Then, it calculates an approximate value for pi.

## How to build Image

For OpenMPI:

```bash
docker build -t mpi-pi .
```

For Intel MPI:

```bash
docker build -t mpi-pi . -f intel.Dockerfile
```

## Create MPIJob

Modify `pi.yaml` to set up the image name from your own registry.
Modify `pi.yaml` (for OpenMPI) or `pi-intel.yaml` (for Intel MPI) to set up the
image name from your own registry.

Then, run:

Expand Down
4 changes: 2 additions & 2 deletions examples/pi/pi-intel.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,7 @@ spec:
template:
spec:
containers:
- image: kubeflow/mpi-pi:intel
- image: mpioperator/mpi-pi:intel
imagePullPolicy: Always
name: mpi-launcher
securityContext:
Expand All @@ -33,7 +33,7 @@ spec:
template:
spec:
containers:
- image: kubeflow/mpi-pi:intel
- image: mpioperator/mpi-pi:intel
imagePullPolicy: Always
name: mpi-worker
securityContext:
Expand Down
4 changes: 2 additions & 2 deletions examples/pi/pi.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,7 @@ spec:
template:
spec:
containers:
- image: kubeflow/mpi-pi
- image: mpioperator/mpi-pi
name: mpi-launcher
securityContext:
runAsUser: 1000
Expand All @@ -33,7 +33,7 @@ spec:
template:
spec:
containers:
- image: kubeflow/mpi-pi
- image: mpioperator/mpi-pi
name: mpi-worker
securityContext:
runAsUser: 1000
Expand Down
52 changes: 52 additions & 0 deletions examples/v2beta1/tensorflow-benchmarks.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,52 @@
apiVersion: kubeflow.org/v2beta1
kind: MPIJob
metadata:
name: tensorflow-benchmarks
spec:
slotsPerWorker: 1
runPolicy:
cleanPodPolicy: Running
mpiReplicaSpecs:
Launcher:
replicas: 1
template:
spec:
containers:
- image: mpioperator/tensorflow-benchmarks:latest
name: tensorflow-benchmarks
command:
- mpirun
- --allow-run-as-root
- -np
- "2"
- -bind-to
- none
- -map-by
- slot
- -x
- NCCL_DEBUG=INFO
- -x
- LD_LIBRARY_PATH
- -x
- PATH
- -mca
- pml
- ob1
- -mca
- btl
- ^openib
- python
- scripts/tf_cnn_benchmarks/tf_cnn_benchmarks.py
- --model=resnet101
- --batch_size=64
- --variable_update=horovod
Worker:
replicas: 2
template:
spec:
containers:
- image: mpioperator/tensorflow-benchmarks:latest
name: tensorflow-benchmarks
resources:
limits:
nvidia.com/gpu: 1

0 comments on commit b9141c0

Please sign in to comment.