vLLM Serving Example#

This example demonstrates how an application like LLM serving using [vLLM](https://vllm.ai/) can be deployed and scaled using Skyshift This example uses two k8s clusters with a single node with 1 GPU. We will be deploying two replicas of Llama 3.1 8B model over vLLM, and Skyshift will deploy the replicas across two clusters and load balance the requests between the two replicas.

## Pre-requisite

Two K8s clusters:
- Cluster1 with 1 nvidia GPU
- Cluster2 with 1 nvidia GPU
For instance, lets assume the GPU type is nvidia L4

$ skyctl get clusters
⠙ Fetching clusters
NAME       MANAGER    LABELS    RESOURCES                  STATUS    AGE
cluster1   k8                   cpus: 14.29/15.88          READY     49s
                                memory: 70.09 GB/71.92 GB
                                disk: 87.05 GB/87.05 GB
                                L4: 1.0/1.0
cluster2   k8                   cpus: 14.29/15.88          READY     49s
                                memory: 70.09 GB/71.92 GB
                                disk: 87.05 GB/87.05 GB
                                L4: 1.0/1.0

Start Skyshift using the commands listed in Setup Guide.
Add the two clusters to Skyshift

skyctl create cluster --manager k8 cluster1

skyctl create cluster --manager k8 cluster2

Create a link between the two clusters

skyctl create link -s cluster1 -t cluster2 clink

Submit the vllm job with two replicas to Skyshift

export SKYSHIFT=<Path to cloned Skyshift directory>

Add the HuggingFace Hub token if the model is gated.

yq eval '.spec.envs.HUGGING_FACE_HUB_TOKEN = "your_token_value"' -i $SKYSHIFT/examples/vllm-demo/vllm.yaml

Submit the job.

skyctl apply -f $SKYSHIFT/examples/vllm-demo/vllm.yaml

Note

Verify if the job is in running state and the replicas are distributed across cluster1 and cluster2

$ skyctl get jobs
⠙ Fetching jobs
NAME    CLUSTER    REPLICAS    RESOURCES         NAMESPACE    STATUS    AGE
vllm    cluster1   1/1         cpus: 4.0         default      RUNNING   5s
                               memory: 12.00 GB
                               L4: 1.0
vllm    cluster2   1/1         cpus: 4.0         default      RUNNING   5s
                               memory: 12.00 GB
                               L4: 1.0
✔ Fetching jobs completed successfully.

Create a service for the vllm job

skyctl apply -f $SKYSHIFT/examples/vllm-demo/vllm_service.yaml

Now, retrieve the vllm-service’s IP

skyctl get svc vllm-service

export VLLM_SERVICE='Use the IP address from the above command'

Note

Alternatively, use the following command to get the IP/host:

export VLLM_SERVICE=`kubectl get svc vllm-service --context cluster1 -o jsonpath='{.status.loadBalancer.ingress[0].ip}'`

Use .status.loadBalancer.ingress[0].hostname in the above command if the cloud k8s service allocates hostname instead of IP.

Now, test it out

curl -X POST "$VLLM_SERVICE/v1/chat/completions" \
    -H "Content-Type: application/json" \
    --data '{
            "model": "meta-llama/Llama-3.1-8B-Instruct",
            "messages": [
                    {"role": "user", "content": "San Francisco is a "}
            ]
    }' | jq

Finally, Cleanup

./cleanup.sh