What is it?
Run:AI is a job scheduler which is somewhat similar to SLURM or Sun Grid Engine. It manages a cluster, which is a set of computers (individually called “nodes”), and distributes jobs to them. It makes sure to only start jobs on systems which have the necessary resources, at least if the person submitting the job specifies what is needed. It can also leave jobs in a queue and then run them when enough resources are available.
Instead of running data analysis software directly, Run:AI runs “containers”. Containers are described here as “packages of software that contain all of the necessary elements to run in any environment”. They are a bit like very lightweight VMs. The idea is to find or create a container image which has the data analysis software which you want to use and then do the analysis in a container created from that image.
There are freely available images available for several popular programs such as Jupyter Notebooks, TensorFlow, and PyTorch. More specialized images mentioned in research papers may be available for download. Some images just supply a basic operating system (e.g. Ubuntu 22.04) and expect you to install your own software. Before you can do any real work in Run:AI you need to figure out which image you will use.
In our clusters, Run:AI runs on top of Kubernetes, which in turn uses the containerd server from Docker. If you are familiar with Kubernetes then learning Run:AI should be easy. Many Run:AI commands are intentionally similar to Kubernetes “kubectl” commands and in fact you may still want to use kubectl at times. If you are familiar with Docker and its “docker” command-line tool then some of the same concepts apply to Run:AI and some commands are similar. However the “docker” command and the “dockerd” server do not work with our clusters and are not installed on most nodes. They may be installed on some dedicated development nodes to make it easier for you to build your own container images.
There are two interfaces to a Run:AI cluster. One is a command-line program called “runai”, the other is a web interface.
To use the “runai” program you should SSH to a cluster login server and run it there. It is important that you use the “runai” command from the login node because we keep that version of the “runai” command in sync with the version of Run:AI we are using. You can use this command to do things such as start and stop jobs, suspend and resume jobs, attach to containers so that you are effectively logged into the container, and execute commands in a container without attaching. If you want to do any scripting or wish to do anything which requires the kubectl command then you should use the command-line interface.
Currently the only cluster login server for the Locust Cluster is locust-login.seas.upenn.edu
To use the web interface you should connect to https://clustername.run.ai substituting the name of the cluster for “clustername” and log in at that site. The web-based interface has some nice monitoring tools (based on Grafana) and includes pretty much the same features as the runai command. The web interface also has a “clone job” button, an easy way to make job templates, plus an easy way to connect to containers which run web-based tools (like Jupyter notebooks).
Currently the web interface for the Locust Cluster is at https://locust.run.ai
Running an analysis using the command line
The command to submit a job to Run:AI is “runai submit”. Additional arguments to this commands should include a name for the job, the name of the container you want to run, and other arguments such as “–interactive” for a job which you want to be able to log into. Example:
runai submit test1 -i gcr.io/run-ai-demo/quickstart -g 0.5
This would submit a job named “test1” which uses the image “gcr.io/run-ai-demo/quickstart” and half of one GPU.
This would list the jobs which you have running.
runai delete job test1
This would delete the “test1” job. The running container would end and then be removed. It would no longer appear in “runai list”.
Three types of jobs
There are three types of Run:AI jobs.
- Interactive — Jobs which you will log into, either in a shell or in an interface like a Jupyter Notebook.
- Training — Jobs which just run in the background doing calculations. Called “training” because they are often used to train neural networks.
- Inference — Specialized jobs which take a model (usually a neural network creating with a training job), feed it some data, and output the results.
Most people will only use Interactive and Training jobs. The default type of job is Training. If you want to run an Interactive job then you need to add “–interactive” to the “runai submit” command.
When you start a job you should tell it how much memory, CPUs, and GPUs you want the job to have. The defaults are zero GPUs, 0.1 CPU, and 100 MiB of memory. If you want to request more resources then you should add these flags to your “runai submit” command:
–cpu The minimum number of CPUs required to run the job
–cpu-limit The maximum number of CPUs to use
–gpu The exact number of GPUs to use
–memory The minimum amount of memory required to run the job
–memory-limit The maximum amount of memory to use
See the Allocation of CPU and Memory documentation at the Run:AI website for more information.
To request a specific type of GPU use the –node-type flag followed by the type of GPU. The type is the marketing model number with only lowercase letters. For example: a5000, a6000, a100, or l40.
It is absolutely vital to request exactly the number of GPUs which you want to use. As for CPU and memory, please request as much as you need using –cpu and –memory and no more than that if possible. If the cluster is busy then the fewer resources you request the more quickly your job will get out of the queue and start running.
Example: Jupyter Notebooks
To run a Jupyter Notebook you should find an appropriate image, then start a job while adding the “–jupyter” flag. Also add “–interactive” since it is an interactive job.
runai submit –name jupyter-1 -i jupyter/scipy-notebook –interactive –jupyter
Running this command will create two things: a container running a Jupyter Notebook on port 8888, and a Service which lets you access port 8888 on the container. You can find the IP and port using a kubectl command:
kubectl get service
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
jupyter-1 ClusterIP 10.105.90.93 <none> 8888/TCP 5s
This will print out all of the services which you have access to. One will have the same name as your notebook. Use the IP address and port listed for that service when connecting to the Jupyter Notebook. Note that the IP will be different every time and that different Jupyter Notebook images might use ports other than 8888.
As it happens the IP will be internal to the cluster so you will need an SSH tunnel. Let’s say that the gateway login for the cluster is called cluster-login.seas.upenn.edu. From a Mac or Linux computer you would set up an SSH tunnel like this:
ssh -L 28888:10.105.90.93:8888 firstname.lastname@example.org
Then if you connect a web browser to http://localhost:28888 you will get to your Jupyter Notebook. Note that port 28888 is arbitrary, you just have to choose a port which is not being used on your computer. The IP and the other port are the ones from “kubectl get service”. Of course username is your username on the cluster, probably the same as your UPenn PennKey.
Kubernetes experts might know about NodePorts and PortForwards. These can also be used but they are more complicated and require keeping track of which ports are being used on various systems in the cluster so we do not recommend using them.