Welcome to the Rhodes JupyterHub environment! This document contains:
- An overview of JupyterHub and the JupyterHub-on-Kubernetes architecture.
- Information for faculty users of the JupyterHub environment.
- Information about how to configure and maintain the environment.
- Troubleshooting information.
- Documentation of the initial cluster design.
Here are some quick links, a full table of contents follows:
- Notebook environment
- Admin page
- Kubernetes cluster
- CS Program GCP Project
- Working with GitHub/nbgitpuller (this document)
- Monitoring dashboards (this document)
- Viewing the cluster (this document)
- Administering user servers (this document)
- Zero to Jupyterhub
(Created by gh-md-toc)
- Rhodes Jupyter Environment
- Table of Contents
- Cluster Architecture and Deployment
- Teaching in the environment
- Configuring the environment
- Regular maintainence
- Troubleshooting and Administration
- Initial Cluster Setup
- Known issues
First, a brief overview of how JupyterHub works:
- There is a JupyterHub process running to manage single-user Jupyter servers. Users connect to this process and authenticate. After authenticating, a Jupyter server is spawned for that user.
- When users are running notebooks or managing their files, etc., they are communicating directly with their server.
- The main JupyterHub process can cull inactive servers, be used for server administration for admin users, etc.
In the Rhodes CS GCP project, there is a Kubernetes cluster running the JupyterHub deployment for the program. Kubernetes is a tool for running and managing containers (in our case, Docker containers) on a cluster of physical machines.
If you are not familiar with containers, think of them as lightweight virtual machines. A container image is the "disk image" for that VM, containing the binaries that the virtual machine will run, the filesystem, etc.
In this cluster, there is a set of nodes/"physical" machines (GCE VMs) that is running the main JupyterHub server. When a user logs in, a user server is spawned by JupyterHub. This is a container running the Jupyter server.
If there is not enough physical resources to run the user server, another node is added to the cluster. When user servers time out and are reaped, creating idle resources, these nodes are given back to GCE.
Users' home directories are stored on GCE persistent disks that are attached to their containers when the containers are started. There is more detail about this later in the document.
The JupyterHub deployment is configured in two ways:
-
The Helm chart that configures the JupyterHub application. Helm is package manager for Kubernetes. A Helm chart is a configuration of Kubernetes processes/resources. We use the chart template provided by the JupyterHub project (the repository is here), which configures the JupyterHub server, the network proxy, etc.
We supply values to fill in this template and customize our deployment of JupyterHub (
config/config.yaml
). This configuration includes which docker container to run, how many resources should be provided to a server, how authentication works, etc. -
The Docker container images that run user servers. The Jupyter project maintains a set of container images. We layer some libraries on top of the Jupyter scipy stack for COMP141 and therefore have a custom container image for COMP141. This is configured in
config/Dockerfile.comp141
and is hosted in the GCP project here with the namegcr.io/rhodes-cs/nb_server_comp141
.In this guide, there are instructions for configuring, building, and pushing custom container images.
Authentication is done via OneLogin. We have integrated with the Rhodes OneLogin production instance so that students and faculty can log in using their Rhodes credentials. The cluster is open to all Rhodes students, though we plan to clean up storage periodically.
Students and faculty use Rhodes OneLogin to log into their accounts. All Rhodes students have access to the cluster.
On first login, a user's storage is initialized. Note that student work is stored on separate disks; by default, the instructor cannot access the student filesystem, except through using the admin panel (although you can manually mount a student's disk, see this section for more information).
If you want to start or stop your server once logged in, click on "Control Panel" -- this will let you stop your currently-running server, and then restart it. You may need to advise students to do this if they haven't restarted in a long time and need to pick up library/file changes.
Students may need help with their environment, or you may need to peek at student work.
To access a student server, you have to be an administrator (configured in
config/config.yaml
). To access the admin panel, click "Control Panel" then
"Admin" in the header bar of the environment. Here, you can see all student
accounts and servers. Click "access server" to access a student's server (you
can also start or stop student servers from here as well).
Note that even though there's a scary "authorize" button, your activity is not visible to the student.
For Fall '23: All users automatically sync this Github repo on server start.
The deployment is configured to automatically update Rhodes-specific libraries
when a user logs in, by running pip install
on this
repository.
Distributing files to students is done by using nbgitpuller, a tool that automatically syncs files in a user's directory with a GitHub repository. The tool is smart enough to do conflict resolution and hides the machinery of cloning/updating a repo from the student.
To distribute a repository to a student, first make the repository public and generate a URL for nbgitpuller using this generator. Some configuration has been pre-populated. The generated url can then be distributed to students (e.g., through Canvas).
It is also possible to configure JupyterHub to automatically run nbgitpuller when a user's server starts. Instructions are here.
WARNING: It is a best practice to not delete or rename files in git
after they have been cloned by students. The merge behavior for gitpuller
is
permissive, but has difficulty resolving some conflicts, particularly conflicts
where a file has been created/renamed in one branch and deleted in another.
For your sanity, don't move or remove files after they are in students'
hands! See Known issues, particularly
#5.
END WARNING
Then, make sure you clone the repo to a separate directory than the
nbgitpuller
synced one (currently comp141-fa23
):
$ git clone https://github.com/Rhodes-CS-Department/comp141-fa23.git comp141-development
When making local changes to be reviewed/to shared resources, use a branch:
$ git checkout -b <my local branch>
...
$ git commit -m 'my changes'
$ git push --set-upstream origin <my local branch> # use your token to login if using 2fa
...
$ git checkout main
$ git pull
If you had changes in main
but wanted to actually be working in a branch:
$ git stash # stash changes
$ git checkout -b <branch>
$ git stash pop # unstash changes in new branch
We use Rhodes-specific Python libraries for many assignments, which can be found in this repo.
In order to reduce startup time, these libraries are installed in the Docker image, rather than on server start. This means that changes to the 141 libraries require a rebuild of the Docker image and a version tag update followed by a helm upgrade.
If you want to use additional libraries for your course, they must be added to the server docker image. Please file an issue for the student admins to install the library on the deployed image, or follow the instructions in the Customizing the Docker Image section.
We use okpy for student notebook submissions. Since okpy only supports Google for authentication, students need to log in to okpy with a Google account. In the past, we had students create Google accounts that corresponded to their Rhodes email (see here, for example) in order to access the environment. Since we have moved to OneLogin, students no longer do this.
You can still require students to do so, or you can collect email accounts from them. Then, you can use a setup like this for students in your course:
Okpy is difficult to use with multiple instructors in the same course shell.
However, okpy config files require a course shell endpoint in order to submit
student work. The c1.notebooks
library in the
comp141-libraries
allows instructors to distribute a template okpy config file with assigments.
Instead of distributing a foo.ok
config file for assignment foo
, instructors
will distribute a template file. The COMP141 library will create the correct
foo.ok
config file from the template.
Template files must be named .template.ok
(notice the .
) and can contain an
endpoint placeholder value (<#endpoint#>
). For example:
{
"name": "Program 08",
"endpoint": "<#endpoint#>/project08",
"src": [
"P8.ipynb",
"imgfilter.py"
],
"tests": {
"tests/q*.py": "ok_test"
},
"protocols": [
"file_contents",
"grading",
"backup"
]
}
This allows for course instructors to distribute a single template file per project/lab, and for students to fill in the endpoint for their specific instructor (allowing multiple instructors to have different endpoints).
The endpoint options should be JSON-encoded and can be distributed in a file
named .options
or will be pulled from the following url:
https://storage.googleapis.com/comp141-public/options.json.
When a student runs the ok_login
code...
from cs1.notebooks import *
ok_login('p8.ok')
they will be prompted to select an endpoint and the file p8.ok
will be dropped
in the directory with the correct endpoint.
Endpoint selection is cached and will be used to automatically populate future template assignments (e.g., this needs to be done once per semester).
See this PR for more info.
- It is preferable to use the
cs1.notebooks
wrapper for a lot of things, it automates logging the student in if they are logged out. - Use the lock cells feature to lock cells you don't want students to edit.
- Use dependent cell execution if you want to make sure that cells are executed in a certain order (e.g., an earlier cell is a precondition for a later one).
-
Install
gcloud
via its install page and log in to the Rhodes CS project.gcloud init gcloud config set project rhodes-cs gcloud auth login
You should have received an email when I added you to the Rhodes GCP project. Use that email to log in.
-
Install
kubectl
by runninggcloud components install kubectl
. -
Generate a
kubeconfig
entry for the JupyterHub cluster:gcloud container clusters get-credentials jupyter --zone=us-central1-a
This will allow you to control the cluster with the
kubectl
command line tool. You can read the explanation here. -
Give your account permissions to perform all admin actions necessary by running
./scripts/cluster_permissions.sh your-google-account
(one time). -
Install Docker.
-
Run
gcloud auth configure-docker
in order to be able to push togcr.io
. -
Install Helm following the instruction here, or on MacOS, run
brew install helm
if you are using homebrew.
config/confg.yaml
contains the config for the JupyterHub instance. Any
configuration changes for the cluster should modify this file.
Note that your helm version must be at least 3.2 in order to be compatible with the current JupyterHub helm chart.
-
One time set up: Make Helm aware of the JupyterHub Helm chart repo:
helm repo add jupyterhub https://jupyterhub.github.io/helm-chart/
-
When you want to update the config, pull the latest changes from the helm repo.
helm repo update
-
Tokens: No secrets are stored in this repo; all tokens should be stored elsewhere in a
values.yaml
file. The path to this file is a parameter tohelm_upgrade.sh
(the next step). -
Now, to push changes to
config.yaml
to the cluster, runhelm upgrade
using the included script:./scripts/helm_upgrade.sh -s=path/to/secrets.yaml -d
Note that the
-d
flag indicates tha this is a dry run and only the config will be printed. To actually run the update, remove-d
.This is the step that actually installs JupyterHub, so it might take a little bit.
Warning: This will create a brief outage where the service is unavailable, so don't make config changes frequently.
As you can see in config/config.yaml
under the singleuser
node, we use a
custom Docker container with tools installed that we want.
The Docker container used for COMP141 user servers is defined in
config/Dockerfile.comp141
. Any changes to the user's image must be made by
editing this file.
Make sure that you have installed Docker.
In config/Dockerfile.comp141
, make the desired changes.
Then, when you want to build and push the image, run:
./scripts/docker_build.sh comp141
This script will do the Docker build and prompt whether you want to push the newly-built image.
If you choose to test locally and push later, you can run the following to continue the push.
./scripts/docker_push.sh comp141
Important: Don't forget to update config.yaml
with the new version tag and
run helm_upgrade.sh
to push the config.
You can confirm that the image was updated with docker image ls
.
To test the container image locally, you can run the Docker image with the
following (the -p
flag forwards the container's port 8888 to the local port
8888):
docker run -p 8888:8888 nb_server_comp141
If you want to do development of course materials or libraries that are stored locally, you can mount your local filesystem to the user home directory on the locally-running container image with the following.
docker run -p 8888:8888 -v /path/to/dir/to/mount:/home/jovyan/work nb_server_comp141
If you are developing libraries and experimenting with them in a notebook, it is helpful to auto-reload them on changes:
%load_ext autoreload
%autoreload 2
Note: It is preferrable to use the scripts scripts/
for building and
pushing Docker images! Only do this if you know what you're doing.
To build the image:
docker build -t nb_server_comp141 -f config/Dockerfile.comp141 config/
Next we need to publish the image to the Rhodes container registry, so that Kubernetes will start pulling the new image.
If you haven't configured Docker for GCP container registry, run the following:
gcloud auth configure-docker
Now, choose a release candidate version for the new image. The version should be
of the form YYYY_MM_DDrcVV
where VV
is used for multiple versions per day
(start with 00).
export RC_VERSION=YYYY_MM_DDrcVV
docker tag nb_server_comp141:latest gcr.io/rhodes-cs/nb_server_comp141:$RC_VERSION
docker image ls
docker push gcr.io/rhodes-cs/nb_server_comp141:$RC_VERSION
Now you should see the container when you run gcloud container images list
.
Additionally, you should see the the new release candidate tag when you run
gcloud container images list-tags gcr.io/rhodes-cs/nb_server_comp141
:
gcloud container images list-tags gcr.io/rhodes-cs/nb_server_comp141
DIGEST TAGS TIMESTAMP
48d50b385e28 2020_02_16rc01,latest 2021-02-16T15:54:46
fde1b612abad 2021-02-14T18:06:17
e05babc291c6 2021-01-24T21:32:16
6d4d44ff86d5 2020-12-19T18:13:00
If you would like to create additional images for students to use (e.g., for
different courses, create a new Dockerfile in config/
with the appropriate
suffix.
Then, follow the same procedure for building/deploying above, but use the new suffix.
The first time this is done, you will have to follow the TODO in
config.yaml
using the instructions
here.
In the scripts/tools/
directory there is a script for culling idle profiles.
This can be used to delete users between semesters. Culling a user will also
clean the PVC, which in turn will enable Kubernetes to reap their disk (so,
removing a user eventually remove their storage).
The general outline for updating between semesters is the following:
- Create a new monorepo for course content.
- Update the JupyterHub config to pull the new monorepo on server start.
- Update the Helm chart to pickup any new release (optional).
- Rebuild the Docker image to pick up any library updates.
- Push your changes to deploy.
- Update the public
options.json
file.
#44 is an example of an update for a new semester.
The current method for delivering content is to have a monorepo that is synced upon server startup (see the distributing files section for discussion of how this works).
For a new semester, we create a single monorepo for the semester (e.g., summer
'22, spring
'22, etc.). The only
config change required for this is to update postStart
lifecycle hook in
config/config.yaml
to pull the new repo.
The JupyerHub Helm chart
repository lists the
stable and dev releases. To update the hub version, select a version, follow the
instructions for configuring your
environment to sync the
chart repo and then update the --version
flag in scripts/helm_upgrade.sh
with the desired version.
Follow the instructions here to configure and push the Docker image.
Note that this pulls in the latest release of jupyter/scipy-notebook
, so
performing this regularly keeps the actual Jupyter version up to date. It also
pulls in any security updates for the underlying Ubuntu container. See the
Docker stacks
documentation for
details on the Jupyter Docker images.
Make sure to bump the version number in config/config.yaml
to select the new
image release.
Run scripts/helm_upgrade.sh -s=PATH_TO_SECRETS.yaml
to deploy!
The options.json
file lists the okpy paths where student submissions should be sent. It can be viewed publicly at [https://storage.googleapis.com/comp141-public/options.json]. It needs to be updated to show the current instructors of the course, and to use the current semester.
Example contents:
{
"Cowan": "rhodes/cs141-cowan/fa24",
"King": "rhodes/cs141-king/fa24",
"Welsh": "rhodes/cs141-welsh/fa24"
}
Process for updating:
- Download the current
options.json
file:gcloud storage cp gs://comp141-public/options.json .
. - Edit the downloaded
options.json
file. - Upload the modified
options.json
file:gcloud storage cp options.json gs://comp141-public/options.json
. - Check that
options.json
has been updated by navigating to [https://storage.googleapis.com/comp141-public/options.json].
You can use the kubectl
command line tool to view the cluster state, or you
can use the GCP
UI. In the
jupyter
cluster, you can view the provisioned nodes, and persistent disks.
kubectl
has a few verbs that one should know in order to view and debug a
cluster:
kubectl get [api-resource]
-- list objects of typeapi-resource
(usekubectl api-resources
to list them all). The main ones for our purposes areservice
,deployment
,pod
, andnode
.kubectl describe [api-resource] [resource-name]
-- view details for the particular resource (or all resources of the given type, if no resource name is given), including events.kubectl logs [pod-name]
-- view logs (STDOUT) for a particular pod.kubectl exec [pod-name] -- [command]
-- run a command within the given pod. Add the-ti
flags toexec
to run an interactive command (e.g.,bash
).kubectl proxy
-- open a proxy to the cluster to access the cluster API and the API of all resources (through http verbs).
[lang@eschaton ~]$ kubectl -n jhub get service
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
hub ClusterIP 10.3.245.111 <none> 8081/TCP 4h19m
proxy-api ClusterIP 10.3.251.117 <none> 8001/TCP 4h19m
proxy-http ClusterIP 10.3.241.81 <none> 8000/TCP 73m
proxy-public LoadBalancer 10.3.243.111 35.225.189.212 443:31835/TCP,80:32140/TCP 4h19m
Services are
conceptual APIs with an endpoint, and are how different applications within the
same k8s cluster can find and communicate with one another. A service can be
backed by one or more containers/pods. In our deployment, the only services are
those for the hub itself (pointing to the hub
deployment) and the JupyterHub
API and HTTP proxies that proxy traffic for a particular user to a particular
server (both pointing to the proxy
deployment).
Anything that will be exposed to the outside world requires a service. Here, only the public proxy is exposed publically. This proxy will route HTTPS to internal services (i.e., hub traffic to the hub and user traffic to user servers).
The proxy-public
service points to autohttps
, which is running
Traefik to route requests to internal services.
Note that services no not correspond 1:1 with pods or even deployments. Services use labels and selectors to match pods.
[lang@eschaton ~]$ kubectl -n jhub get deployment
NAME READY UP-TO-DATE AVAILABLE AGE
autohttps 1/1 1 1 73m
hub 1/1 1 1 4h20m
proxy 1/1 1 1 4h20m
user-scheduler 2/2 2 2 4h20m
A deployment is a set of replicas of pods. In this configuration, we have one replica in each deployment, except for two pods for the user to node scheduler, which are running leader election.
A pod is an instance of a
running container (or set of grouped containers). Each pod is assigned a single
IP address and can be thought of conceptually as a single virtual machine. Pods
run on "physical" VMs called
nodes, with multiple
pods per node. You can see the restrictions on co-residency in config/config.yaml
.
[lang@eschaton ~/dsrc/jupyter]$ kubectl -n jhub get pod -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
autohttps-b59c8d84f-8hnzz 2/2 Running 0 43h 10.0.0.25 gke-jupyter-default-pool-23a4322a-qdpl <none> <none>
continuous-image-puller-4hjxt 1/1 Running 0 2d 10.0.1.3 gke-jupyter-user-pool-175a3536-chnf <none> <none>
hub-6cc79f5bb-wp794 1/1 Running 0 45h 10.0.0.22 gke-jupyter-default-pool-23a4322a-qdpl <none> <none>
jupyter-langma-40gmail-2ecom 1/1 Running 0 2m34s 10.0.1.13 gke-jupyter-user-pool-175a3536-chnf <none> <none>
proxy-5784b6b988-568ct 1/1 Running 0 2d 10.0.0.14 gke-jupyter-default-pool-23a4322a-qdpl <none> <none>
user-placeholder-0 1/1 Running 0 2d 10.0.1.4 gke-jupyter-user-pool-175a3536-chnf <none> <none>
user-placeholder-1 1/1 Running 0 2d 10.0.1.2 gke-jupyter-user-pool-175a3536-chnf <none> <none>
user-placeholder-2 1/1 Running 0 2d 10.0.1.5 gke-jupyter-user-pool-175a3536-chnf <none> <none>
user-scheduler-7588c8977d-9n9z4 1/1 Running 0 2d 10.0.0.13 gke-jupyter-default-pool-23a4322a-qdpl <none> <none>
user-scheduler-7588c8977d-v4knn 1/1 Running 0 2d 10.0.0.12 gke-jupyter-default-pool-23a4322a-qdpl <none> <none>
Note the -o wide
flag. This shows what nodes each pod is running on, as well
as their IP addresses.
You can see from the response that there is one Jupyter server running (for
[email protected]
). The hub
pod is the JupyterHub server. autohttps
and
proxy
are the SSL certificate creator and network proxy. user-scheduler
has
two instances (running leader election) and will map users to nodes (VMs). The
user-placeholder
pods are dummy containers that ensure that there are free
slots for new arrivals.
Pod state and logs:
If you want to view details about a pod, including recent events, you can run
kubectl describe
to view pod details:
kubectl -n jhub describe pod <pod name>
The describe
command can also be used with services and other resources.
The recent events listed by describe are useful for finding errors. To dig
deeper, you cna use kubectl logs
to view logs (or use the Logs UI in GCP).
In our configuration, users have a GCE Persistent Disk that is their home directory. Persistent disks can be attached to any VM running in GCP and then mounted onto the Linux filesystem.
When a user first logs in, a 1G storage claim (persistent volume
claim) is
created, and the Kubernetes master creates a persistent disk to satisfy that
claim. The size of the claim is configured in singleusesr/storage/capacity
in
config/config.yaml
.
When a pod is created for the user's server, the disk is attached to the node running the pod, and the volume is mounted as the user's home directory.
[lang@eschaton ~]$ kubectl -n jhub get persistentvolumeclaim
NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE
claim-langma-40gmail-2ecom Bound pvc-8df5bbf0-e119-461d-90d1-51cfd81168e5 1Gi RWO standard 6m49s
hub-db-dir Bound pvc-58ed8e5b-6b4e-45c2-aef1-7c70b89c3a35 1Gi RWO standard 4h23m
[lang@eschaton ~]$ kubectl -n jhub get persistentvolume
NAME CAPACITY ACCESS MODES RECLAIM POLICY STATUS CLAIM STORAGECLASS REASON AGE
pvc-58ed8e5b-6b4e-45c2-aef1-7c70b89c3a35 1Gi RWO Delete Bound jhub/hub-db-dir standard 4h23m
pvc-8df5bbf0-e119-461d-90d1-51cfd81168e5 1Gi RWO Delete Bound jhub/claim-langma-40gmail-2ecom standard 6m55s
Here you can see that since only I have logged in, there are only two claims: one for the JupyterHub user and one for me. They are backed by two GCE persistent disks.
When volume claims are deleted, so are the corresponding disks (reclaim policy).
Note that right now there is not a great way to delete PVCs when a user is
deleted. There is a
discussion
about this and issues tracked
here and
here.
As of 12/30/21, we have deployed a new JupyterHub version that includes a hook
for deleting PVCs when a user is deleted. Users can be deleted via the UI or via
the tool in scripts/tools
(see the section on culling
users).
You can view logs in two ways: using kubectl
or using the Cloud Log
viewer.
To view the logs for a particular pod, you can list pods, and then get its logs:
kubectl get pods -n jhub
kubectl logs <pod name> -n jhub
Note: Logs for the post-startup script are not collected, since these are
commands run inside the pod, after it has started. The logs for the pip install
of the COMP141 libraries and nbgitpuller
are redirected to
/tmp/startup_logs
which can be viewed by logging into the student's server.
These do not persist across pod restarts.
See this issue for
context. If postStart
fails, this would be published as an event for the pod;
however, we swallow failures in our startup script, and therefore no events are
published.
The easiest way to administer user servers is via the admin page. This allows you to log in to user servers, kill and restart them, etc.
Once you've logged in to a user server, you can run a terminal through the JupyterHub interface.
If, however, you want to ssh into their pod without going through their
Jupyter server, you can execute commands on the pod using kubectl exec
.
For example, to run an interactive bash shell, run:
kubectl -n jhub exec -it [pod name] /bin/bash
Use --
to run command with flags:
[lang@eschaton ~]$ kubectl -n jhub exec jupyter-langma-40gmail-2ecom -- ls -lh
total 20K
drwxrws--- 2 root users 16K Dec 23 00:48 lost+found
-rw-r--r-- 1 jovyan users 72 Dec 24 21:21 Test.ipynb
It is possible for a user to make changes to their home directory in a way that prevents their server from starting. If, for example, they prevent nbgitpuller from running successfully, this may cause the startup script for the server to fail. This is tracked in #5.
In that case, you might want to access a user's file system. In order to do this, you will need to create a temporary virtual machine, attach the user's home directory to it, and modify their file system.
First, find the disk that corresponds to the user:
[lang@eschaton ~/courses/141]$ kubectl -n jhub get pvc | grep langma
NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE
claim-langma-40gmail-2ecom Bound pvc-8df5bbf0-e119-461d-90d1-51cfd81168e5 1Gi RWO standard 30d
The pvc-<GUID>
is the name of the user's disk. To access it, create a VM
instance and then follow the procedure for adding a disk to the VM (outlined
here.
Note that in those instructions, you will be attaching a new disk, but in our
case, you will want to choose an existing disk, and use the disk name you found
above.
After this step, you can ssh into the new VM and mount/edit the disk.
$ sudo su
$ mkdir debug
$ mount /dev/sdb debug
$ cd debug
$ ...
kubectl -n jhub get deployment
kubectl -n jhub rollout restart deployment <deployment name>
If user servers cannot schedule themselves, users will see an error message indicating that there are not enough nodes to schedule their pod. When this happens, the user pool needs to be scaled up. Autoscaling should take care of this normally, but if manual intervention is required, you can manually scale the pool up.
Using gcloud
:
gcloud container clusters resize \
jupyter \
--node-pool user-pool \
--num-nodes [new size] \
--zone us-central1-a
It will take a few moments for the pool to scale up.
Using the UI:
- Go to the User Pool config in the Cloud Console.
- Click "EDIT" and manually change the "Number of nodes" field to a higher number.
- Click "Save." It will take a few moments for the node pool to scale up.
Both the GCP page and the Cloud Monitoring dashboard has telemetry data about the cluster.
There is also a dashboard monitor that was installed alongside the cluster with the following:
# Install dashboard
kubectl apply -f https://raw.githubusercontent.com/kubernetes/dashboard/v2.0.0/aio/deploy/recommended.yaml
# Create admin service account for dashboard, and give it admin permissions
kubectl create serviceaccount admin-user -n kubernetes-dashboard
kubectl create clusterrolebinding dash-cluster-admin-binding \
--clusterrole=cluster-admin \
--serviceaccount=kubernetes-dashboard:admin-user
To view and login to the dashboard, run the following:
kubectl -n kubernetes-dashboard describe secret $(kubectl -n kubernetes-dashboard get secret | grep admin-user | awk '{print $1}')
Copy the token
from the results. This is the bearer token of the dashboard
service account.
Then run kubectl proxy
to create a gateway between your local machine and the
Kubernetes API server. The dashboard is accessible at
http://localhost:8001/api/v1/namespaces/kubernetes-dashboard/services/https:kubernetes-dashboard:/proxy/.
Use the token you copied to log in. Select the namespace jhub
from the
namespace drop down to view the JupyterHub telemetry.
There is Python code to remove users (and PVCs) under scripts/tools/cull_users
.
To run, you will need to grant yourself an OAuth token to use for the script to take action on your behalf. From the token page click "Request new API token" and copy the token that is generated.
Then, to run the script, first run in dry run mode (default), to see that the users that will be deleted are expected:
pipenv run python cull_users.py --token YOUR_OAUTH_TOKEN
Then, run the script in non-dry run mode:
pipenv run python cull_users.py --token YOUR_OAUTH_TOKEN --no_dry_run
The Zero-to-JupyterHub docs have information about debugging, and other FAQ topics.
The initial cluster setup was done using the Zero to JupyterHub with Kubernetes guide.
Outside of the guide, there are a few cloud resources that I manually set up.
- Appropriate APIs are already enabled for the project (GKE, logging, containers, etc.).
- There's a static external IP address provisioned that is in use by the
cluster's proxy server. This is configured in
config/config.yaml
. You can see this here.
- Create the Kubernetes cluster by running
./scripts/create_cluster.sh
. - You can verify that the cluster exists by running
kubectl get node
- Create a pool of user nodes by running
./scripts/create_node_pool.sh
(you might have to update your Cloud SDK to install beta components).
The cluster creation script enrolls the cluster in the GKE stable
release
channel, which will cause the cluster to receive automatic version updates from
GCP. This in turn results in the node pools receiving image version updates.
- If a user deletes a file, then the file is deleted from the
nbgitpuller
origin repo, this can make a user's server inaccessible until the conflict is fixed. Can be mitigated by switching from a single repo to per-assignment repos (at least then the server is accessible).- Tracked in issue #5.
- Mitigation is to fix the user directory/repo (see this section).