Artificial Intelligence (AI) Clusters in our space describe workflows where development flexibility and isolation are paramount, and tasks are not tightly integrated. These problem sets (highly concurrent vector operations on independent data subsets) typically are well suited to gaining performance when run on GPU platforms for their SIMD execution optimizations.
On these clusters we runĀ either the SLURM scheduler or a Kubernetes orchestrator over an underlying Container platform managed via Run:AI, a combined resource manager and scheduler for managing jobs and CPU/GPU resources and scheduling resources across time, available resources, and resource characteristics such as available memory, CPU cores, and/or GPU cores (whole and fractional).
Containers allow decoupling of the application from the host OS (we currently support Rocky Linux which is a variant of Red Hat Enterprise Linux). This means users can utilize variations on libraries and other package loadouts, and even run on an entirely different Linux guest OS if the need arises, and lock down all that software environment to specific versions that are validated to work with their application, yielding consistent, repeatable results.