Introducing SUNK: A Slurm on Kubernetes Implementation for HPC and Large Scale AI

SUNK, an implementation of Slurm on Kubernetes, will be made open-source in early 2024. Learn how it works.

‍

In the HPC and AI world, there are two kings: Slurm for scheduling and orchestrating massively parallel jobs and Kubernetes for running production applications like Inference.

However, many companies face making a choice between them or needing to manage two separate clusters.

Developed by SchedMD, The Slurm Workload Manager (commonly called Slurm) is the defacto scheduler for many HPC workloads, used by leading researchers, academics, and AI companies across the world. However, it’s designed for batch jobs that have a finite lifespan.

Kubernetes on the other hand was built for long-running workloads, like inference. Therefore, the Kubernetes scheduler for batch workloads isn’t as popular as Slurm and lacks some functionality. Other Kubernetes-based schedulers, such as Volcano and YuniKorn, aim to provide Slurm-like scheduling capabilities for Kubernetes but face an uphill battle trying to unset the vast knowledge base around Slurm.

Some companies try to bridge this gap by running both Slurm and Kubernetes, but it’s a major challenge for companies today to leverage both. Each cluster runs with its own pool of compute to manage and operate, sometimes in different clouds with separate storage, too.

While this helps companies be able to separate workloads by Slurm or Kubernetes, resource management is completely separate. So, any communication or collaboration between the two requires manual work, which can take significant time and effort.

Two popular platforms. Two separate pools for compute.

Two separate solutions to manage and own.

To truly combine the strengths of both solutions, X-Cloud has been working on an implementation of Slurm on Kubernetes that effectively syncs the two.

‍

Introducing SUNK (“SlUrm oN Kubernetes”)

SUNK is an open-source project (to be released in early 2024) that brings Kubernetes containerized deployments and Git Ops to Slurm and integrates a Slurm scheduler plugin to Kubernetes.

In essence, SUNK integrates Slurm as a Kubernetes scheduler and allows for Slurm jobs to run inside Kubernetes. This creates a more seamless experience, supports both burst and batch workloads on the same central platform, and allows developers to leverage the resource management of SLURM on Kubernetes.

Managing Slurm and Kubernetes separately reduces the overall complexity, but also greatly reduces the flexibility you have to choose what kinds of workloads run across all of your compute. AKA it’s more difficult to maximize utilization of GPU resources.

By deploying a Slurm cluster on top of Kubernetes (SUNK), on top of the same pool of compute, you have the flexibility to seamlessly use that compute from the Kubernetes or Slurm sides.

Two solutions. One platform. One pool of compute.

“One ring to rule them all” kinda vibe.

‍

Why did X-Cloud choose to create this?

The simple answer: Client efficiency. When you’re running very large and expensive HPC clusters, getting as close to 100% utilization is very important. Any times when you might not be using the compute you are paying for can be very costly.

X-Cloud is built entirely on top of Kubernetes, where clients each have a single point of entry and management for their cluster. But, we realized that many clients who preferred Slurm would manage it separately or ask us if we had an integration for Slurm.

X-Cloud Cloud is all about efficiency—it’s the reason why we say we’re purpose-built for GPU-intensive use cases.

We wanted to enable clients to leverage the benefits of Slurm while maintaining the integrity of our system and the ease of use (aka no management of separate clusters). Since that solution didn’t exist, we decided to build it.

‍

Features of SUNK

Configuration and deployment: By deploying a Slurm cluster as various Kubernetes resources, we are able to deploy it with a highly customizable helm chart. This unlocks the large ecosystem of Kubernetes-based GitOps workflows and all the features that come with it. Other benefits include:

Easy tracking & configuration of prolog and epilog scripts
Quickly deploy staging clusters
Support for s6 scripts and services
Configurable authentication schemes including Ldap through companion OpenLdap helm chart or third party solution (Authentik, GAuth, etc.)

‍Kubernetes Integration: Once you deploy SUNK, you get all the normal benefits from running on Kubernetes, like:

Fast scheduling
Containerization
High availability of control plane services
Dynamic node scaling
Resource management with request and limits
Shared filesystem via PersistentVolumeClaim resources

This also includes a custom Slurm Kubernetes scheduler (for scheduling native Kubernetes workloads via Slurm scheduler), which enables you to dynamically shift a single pool of compute between Slurm jobs (bursty workloads) and Kubernetes workloads (serverless workloads).

State management: By running on top of Kubernetes, you also get more control over state management, including:

Dynamic nodes with two-way syncing of state between k8s and Slurm
Automatic topology generation
Support for Pyxis container execution
GRES support and auto-identification