What is it?
Run:AI is a job scheduler which is somewhat similar to SLURM or Sun Grid Engine. It manages a cluster, which is a set of computers (individually called “nodes”), and distributes jobs to them. It makes sure to only start jobs on systems which have the necessary resources, at least if the person submitting the job specifies what is needed. It can also leave jobs in a queue and then run them when enough resources are available.
Instead of running data analysis software directly, Run:AI runs “containers”. Containers are described here as “packages of software that contain all of the necessary elements to run in any environment”. They are a bit like very lightweight VMs. The idea is to find or create a container image which has the data analysis software which you want to use and then do the analysis in a container created from that image.
There are freely available images available for several popular programs such as Jupyter Notebooks, TensorFlow, and PyTorch. More specialized images mentioned in research papers may be available for download. Some images just supply a basic operating system (e.g. Ubuntu 22.04) and expect you to install your own software. Before you can do any real work in Run:AI you need to figure out which image you will use.
In our clusters, Run:AI runs on top of Kubernetes, which in turn uses the containerd server from Docker. If you are familiar with Kubernetes then learning Run:AI should be easy. Many Run:AI commands are intentionally similar to Kubernetes “kubectl” commands and in fact you may still want to use kubectl at times. If you are familiar with Docker and its “docker” command-line tool then some of the same concepts apply to Run:AI and some commands are similar. However the “docker” command and the “dockerd” server do not work with our clusters and are not installed on most nodes. They may be installed on some dedicated development nodes to make it easier for you to build your own container images.
Logging in
There are two interfaces to a Run:AI cluster. One is a command-line program called “runai”, the other is a web interface.
To use the “runai” program you should SSH to a cluster login server and run it there. It is important that you use the “runai” command from the login node because we keep that version of the “runai” command in sync with the version of Run:AI we are using. You can use this command to do things such as start and stop jobs, suspend and resume jobs, attach to containers so that you are effectively logged into the container, and execute commands in a container without attaching. If you want to do any scripting or wish to do anything which requires the kubectl command then you should use the command-line interface.
Currently the only cluster login server for the Locust Cluster is locust-login.seas.upenn.edu
To use the web interface you should connect to https://clustername.run.ai substituting the name of the cluster for “clustername” and log in at that site. The web-based interface has some nice monitoring tools (based on Grafana) and includes pretty much the same features as the runai command. The web interface also has a “clone job” button, an easy way to make job templates, plus an easy way to connect to containers which run web-based tools (like Jupyter notebooks).
Currently the web interface for the Locust Cluster is at https://locust.run.ai
Running an analysis using the command line
The Basics
The command to submit a job to Run:AI is runai submit
. Additional arguments to this commands should include a name for the job, the name of the container you want to run, and other arguments such as --interactive
for a job which you want to be able to log into. Example:
runai submit test1 -i gcr.io/run-ai-demo/quickstart -g 0.5
This would submit a job named “test1” which uses the image “gcr.io/run-ai-demo/quickstart” and half of one GPU.
runai list
This would list the jobs which you have running.
runai delete job test1
This would delete the “test1” job. The running container would end and then be removed. It would no longer appear in runai list
.
Three types of jobs
There are three types of Run:AI jobs.
- Interactive — Jobs which you will log into, either in a shell or in an interface like a Jupyter Notebook.
- Training — Jobs which just run in the background doing calculations. Called “training” because they are often used to train neural networks.
- Inference — Specialized jobs which take a model (usually a neural network creating with a training job), feed it some data, and output the results.
Most people will only use Interactive and Training jobs. The default type of job is Training. If you want to run an Interactive job then you need to add --interactive
to the runai submit
command.
Reserving resources
When you start a job you should tell it how much memory, CPUs, and GPUs you want the job to have. The defaults are zero GPUs, 0.1 CPU, and 100 MiB of memory. If you want to request more resources then you should add these flags to your “runai submit” command:
–cpu The minimum number of CPUs required to run the job
–cpu-limit The maximum number of CPUs to use
–gpu The exact number of GPUs to use
–memory The minimum amount of memory required to run the job
–memory-limit The maximum amount of memory to useSee the Allocation of CPU and Memory documentation at the Run:AI website for more information.
To request a specific type of GPU use the --node-type
flag followed by the type of GPU. The type is the marketing model number with only lowercase letters. For example: a5000, a6000, a100, or l40.
It is absolutely vital to request exactly the number of GPUs which you want to use. As for CPU and memory, please request as much as you need using --cpu
and --memory
and no more than that if possible. If the cluster is busy then the fewer resources you request the more quickly your job will get out of the queue and start running.
Example: Jupyter Notebooks
To run a Jupyter Notebook you should find an appropriate image, then start a job while adding the appropriate port forwarding flags (see example). Also add --interactive
since it is an interactive job.
runai submit --name jupyter-1 -i jupyter/scipy-notebook --interactive --service-type external-url,port=8888 -v ${HOME}:/home/jovyan --command -- start-notebook.sh --NotebookApp.base_url='/${RUNAI_PROJECT}/${RUNAI_JOB_NAME}' --NotebookApp.token=''
This command will start a JupyterLab instance in a container on a compute node, with your cluster home directory mapped to the default user’s home directory inside the container. Once submitted, select your job in the Run:AI web interface and click the “Connect” button to connect to your JupyterLab instance.
Kubernetes experts might know about NodePorts and PortForwards. These can also be used but they are more complicated and require keeping track of which ports are being used on various systems in the cluster so we do not recommend using them.