Quickstart - SkyShift with Slurm#
Congratulations on setting up SkyShift! You’re on your way to simplifying and enhancing your job management and scheduling on Slurm clusters. Let’s dive into how you can add, manage clusters, and submit jobs efficiently.
In this guide, we’ll cover the following topics:
Adding Slurm Cluster information in a simple YAML format
Adding and removing Slurm Clusters from SkyShift.
Creating and running SkyShift jobs on Slurm.
Prerequisites: - SkyShift API Server and Manager: Setup Guide.
Interfacing with Slurm Clusters#
Setting up the Config File#
With SkyShift, integrating a remote Slurm Cluster into your workflow is straightforward.
Copy the following YAML into ~/.skyconf/slurm_config.yaml
file:
# Example of Slurm cluster accessed through CLI.
SlurmCluster1:
interface: cli # Interfacing method SkyShift will use
access_config: # Fields needed for SkyShift to reach the Cluster
hostname: llama.millennium.berkeley.edu # SSH hostname of the Slurm node
user: mluo # SSH Username
ssh_key: ~/berzerkeley # Local path to a RSA key
# password: whatever
This defines a cluster with the following components:
SlurmCluster1
: name of the Slurm Cluster, this needs to be unique for each cluster, but can be whatever you choose.interface
: the method SkyShift will use to access the cluster. Currently only cli is supported.access_config
: the login/security parameters needed for SkyShift to access the Slurm cluster.hostname
: SSH hostname used on the Slurm node.user
: username on the Slurm node.ssh_key
: local absolute path to private SSH key for remote authentiation with the Slurm node.password
: in the advent the Slurm cluster does not support SSH key authentiation, this password field can be added for password authentication(not recommended).
Once these fields are populated with your Slurm cluster’s information, we can now attach it to SkyShift!
Attaching the Remote Slurm Cluster#
With the configuration file fully populated with the details needed to access the Slurm Cluster, let’s attach it to SkyShift.
Upon launch, SkyShift will automatically discover and attempt to attach all Slurm Clusters defined inside of slurm_config.yaml.
To launch SkyShift, run in skyshift/
:
./launch_skyshift.sh
If SkyShift is already running, attach the Slurm Cluster using the unique name given to it in the configuration step.
skyctl create cluster SlurmCluster1 --manager slurm
Checking Cluster Status#
Before deploying a job to the Slurm Cluster, let’s double check the status of your configured clusters, simply run:
> skyctl get clusters
You’ll see an output similar to the following, providing a snapshot of your clusters’ resources and their status for job provisioning:
NAME MANAGER RESOURCES STATUS
SlurmCluster1 slurm cpus: 520.0/600.0 READY
memory: 1235171.0/3868184.0 MiB
P100: 8.0/8.0
cluster3 k8s cpus: 1.83/2.0 READY
memory: 6035.6/7954.6 MiB
Now you’re ready to deploy jobs to your Slurm Cluster through SkyShift!
Submitting Jobs to Slurm#
Submitting jobs to Slurm follows the same process as a standard SkyShift job, let’s submit a simple test job to the Slurm Cluster.
Monitoring Your Job#
To check the status of your jobs and ensure they’re running as expected:
> skyctl get jobs
NAME CLUSTER REPLICAS RESOURCES NAMESPACE STATUS
my-test-job SlurmCluster1 2/2 cpus: 1 default RUNNING
memory: 128.0 MiB
You’ll see details about each job, including the cluster it’s running on, resources allocated, and its current status.
Detaching the Cluster#
If you need to remove a cluster from SkyShift, the process is just as simple:
skyctl delete cluster SlurmCluster1
Note
If SkyShift is relaunched, it will automatically discover and attempt to attach all clusters defined in the configuration file. To prevent a cluster from being reattached, it will have to be removed from the configuration file.
After detaching, you can verify the status of the remaining clusters with skyctl get clusters
to see
the updated list.
Now that you’re equipped with the basics of managing clusters and jobs in SkyShift using Slurm, you can start harnessing the full potential of your Slurm clusters. SkyShift is designed to make your computational tasks easier, more efficient, and scalable. Happy computing!