How to Autoscale Kubernetes Pods Based on GPU

May 31, 2022

Share this post

There are several resources available on the internet on how to scale your Kubernetes pods based on CPU, but when it comes to Kubernetes pods based on GPU, it’s hard to find a concise breakdown that outlines each step and how to test. In this article, we outline the steps to scale Kubernetes pods based on GPU metrics. These steps are performed on a AKS (Azure Kubernetes Service), but work well with most cloud service providers, as well as, with self managed clusters.

For this tutorial, you will need an API key. Contact us to download yours.

Step 0: Prerequisites

Kubernetes cluster

You’ll need to have a Kubernetes cluster up and running for this tutorial. To set up an AKS cluster, see this guide from Azure.

Note: Your cluster should have at least 2 GPU enabled nodes.

Kubectl

To manage Kubernetes resources, set up the kubectl command line client.

Here is a guide to install kubectl if you haven’t installed it already.

Helm

Helm is used to manage packaging, configuration and deployment of resources to the Kubernetes cluster. In this tutorial we’ll make use of the helm.

Use this guide and follow your OS specific installation instructions.

Step 1: Install metrics server

Now that we have prerequisites installed and setup, we’ll move ahead with installing Kubernetes plugins and tools to set up auto scaling based on GPU metrics.

Metrics server collects various resource metrics from Kubelet and exposes it via a metrics API of Kubernetes. Most of the cloud (ie. AKS), as well as the local distribution of Kubernetes, have this metrics-server already installed. If you’re not sure, follow the instructions to check and install it.

To check if you have metrics-service running

kubectl get pods -A | grep metrics-server

If the metrics-server is installed, you should see an output like this.

kube-system metrics-server-774f99dbf4-tjw6l

In case you don’t have it installed, use the following command to install it.

kubectl apply -f https://github.com/kubernetes-sigs/metrics-server/releases/latest/download/components.yaml

Step 2: Install nvidia device plugin

The Nvidia device plugin for Kubernetes is a Daemonset that allows you to run GPU enabled containers in your cluster.

Install it using the following command:

kubectl create -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v0.11.0/nvidia-device-plugin.yml

To learn more about the Nvidia device plugin, see this resource here.

Step 3: Install dcgm exporter

DCGM-Exporter collects GPU telemetry using Go bindings of NVIDIA’s API and allows you to monitor health and utilization of GPU. It exposes an easy to consume http endpoint (/metrics) for monitoring tools like Prometheus.

Run the following command to install dcgm-exporter:

kubectl create -f https://raw.githubusercontent.com/NVIDIA/dcgm-exporter/master/dcgm-exporter.yaml

Once it is running, you can try to query its /metrics endpoint.

First, forward port 9400 of dcgm-exporter service. (Run this command in a separate terminal)

kubectl port-forward svc/dcgm-exporter 9400:9400

Query /metrics endpoint.

curl localhost:9400/metrics

Step 4: Install kube-prometheus-stack

Next, install the prometheus stack using the kube-prometheus-stack.values. This value file has some changes that are suggested by NVIDIA (to make prometheus available to local machines) and an additionalScrapeConfigs which create a job to scrape the metrics exported by dcgm-exporter.

Find the kube-prometheus-stack.values file below.

Add & update the helm repo:

helm repo add prometheus-community https://prometheus-community.github.io/helm-charts helm repo update

Once we have the helm repo set up, inspect the helm chart and modify the settings.

 helm inspect values prometheus-community/kube-prometheus-stack > /tmp/kube-prometheus-stack.values

In the Prometheus instance section of the chart, update the service type from ClusterIP to NodePort. This change will allow Prometheus server to be available at your local machine at port 30900.

From: ## Port to expose on each node ## Only used if service.type is 'NodePort' ## nodePort: 30090 ## Loadbalancer IP ## Only use if service.type is "loadbalancer" loadBalancerIP: "" loadBalancerSourceRanges: [] ## Service type ## type: ClusterIP To: ## Port to expose on each node ## Only used if service.type is 'NodePort' ## nodePort: 30090 ## Loadbalancer IP ## Only use if service.type is "loadbalancer" loadBalancerIP: "" loadBalancerSourceRanges: [] ## Service type ## type: NodePort

Update the value of serviceMonitorSelectorNilUsesHelmValues to false.

## If true, a nil or {} value for prometheus.prometheusSpec.serviceMonitorSelector ## will cause the prometheus resource to be created with selectors based on ## values in the helm deployment, which will also match the servicemonitors created ## serviceMonitorSelectorNilUsesHelmValues: false

Add this configMap to the additionalScrapeConfigs section of the helm chart.

 additionalScrapeConfigs: - job_name: gpu-metrics scrape_interval: 1s metrics_path: /metrics scheme: http kubernetes_sd_configs: - role: endpoints namespaces: names: - default relabel_configs: - source_labels: [__meta_kubernetes_pod_node_name] action: replace target_label: kubernetes_node

Once you have your helm chart ready, install kube-prometheus-stack via Helm.

helm install prometheus-community/kube-prometheus-stack --create-namespace --namespace prometheus --generate-name --values /tmp/kube-prometheus-stack.values

After installation is finished, your output should look like this.

NAME: kube-prometheus-stack-1652691100 LAST DEPLOYED: Mon May 16 14:22:12 2022 NAMESPACE: prometheus STATUS: deployed REVISION: 1 NOTES: kube-prometheus-stack has been installed. Check its status by running: kubectl --namespace prometheus get pods -l "release=kube-prometheus-stack-1652691100"

Step 5: Install prometheus-adapter

Now we’ll install the prometheus-adapter . The adapter gathers available metrics from Prometheus at a regular interval.

prometheus_service=$(kubectl get svc -nprometheus -lapp=kube-prometheus-stack-prometheus -ojsonpath='{range .items[*]}{.metadata.name}{"n"}{end}') helm upgrade --install prometheus-adapter prometheus-community/prometheus-adapter --set rbac.create=true,prometheus.url=http://${prometheus_service}.prometheus.svc.cluster.local,prometheus.port=9090

This will take a moment to set up, after it’s up, you should be able to make

Step 6: Create a HPA which scales based on GPU

Now that all the pieces are available, create a HorizontalPodAutoscaler and configure it to scale on the bases of GPU utilization metric (DCGM_FI_DEV_GPU_UTIL)

apiVersion: autoscaling/v2beta2 kind: HorizontalPodAutoscaler metadata: name: my-gpu-app spec: maxReplicas: 3 # Update this accordingly minReplicas: 1 scaleTargetRef: apiVersion: apps/v1beta1 kind: Deployment name: my-gpu-app metrics: - type: Pods # scale based on gpu pods: metric: name: DCGM_FI_DEV_GPU_UTIL target: type: AverageValue averageValue: 80

There are other GPU metrics available than just DCGM_FI_DEV_GPU_UTIL. Find a complete list of available metrics in their docs.

Step 7: Create a LoadBalancer service (Optional)

This is an optional step to expose your app to the web. If you are setting up your cluster using a cloud service provider, there’s a good chance that it’ll allocate a public IP address, which you can use to interact with your application. Alternatively, you can create a service of type nodePort and access your app via that.

apiVersion: v1 kind: Service metadata: name: app-ip labels: component: app spec: type: LoadBalancer selector: component: app ports: - name: http port: 80 targetPort: 8080

In this configuration, we are assuming that our app runs on port 8080 and we are mapping it to port 80 of the pod.

Step 8: Summing it all up together

Now that we have all the external pieces that we need, let’s create a Kubernetes manifest file and save it as autoscaling-demo.yml.

For demonstration we’ll use container image of deid application, this is Private AI’s container based de-identification system. You can use any GPU based application of your choice.

apiVersion: apps/v1 kind: Deployment metadata: name: my-gpu-app spec: replicas: 1 selector: matchLabels: component: app template: metadata: labels: component: app spec: containers: - name: app securityContext: capabilities: # SYS_ADMIN capabilities needed for DCMG Exporter add: - SYS_ADMIN resources: limits: nvidia.com/gpu: 1 image: privateai/deid:2.11full_gpu # You can use any GPU based image --- apiVersion: v1 kind: Service metadata: name: app-ip labels: component: app spec: type: LoadBalancer selector: component: app ports: - name: http port: 80 targetPort: 8080 # The port might be different for your application --- apiVersion: autoscaling/v2beta2 kind: HorizontalPodAutoscaler metadata: name: my-gpu-app spec: maxReplicas: 2 # Update this according to your desired number of replicas minReplicas: 1 scaleTargetRef: apiVersion: apps/v1beta1 kind: Deployment name: my-gpu-app metrics: - type: Pods pods: metric: name: DCGM_FI_DEV_GPU_UTIL target: type: AverageValue averageValue: 30

Step 9: Create a deployment

Run kubectl create command to create your deployment.

kubectl create -f deid.yaml 

Once your deployment is complete, you should be able to see the running status of pods and our HorizontalPodAutoscaler, which will scale based on GPU utilization.

To check the status of pods

$ kubectl get pods NAME READY STATUS RESTARTS AGE dcgm-exporter-6bjn8 1/1 Running 0 3h37m dcgm-exporter-xmn74 1/1 Running 0 3h37m my-gpu-app-675b967d56-q7swb 1/1 Running 0 12m prometheus-adapter-6696b6d76-g2csx 1/1 Running 0 104m

To check the status of Horizontal Pod Autoscaler

$ kubectl get hpa NAME REFERENCE TARGETS MINPODS MAXPODS REPLICAS AGE my-gpu-app Deployment/my-gpu-app 0/30 1 2 1 2m15s

Getting your public/external ip

$ kubectl get svc NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE app-ip LoadBalancer 10.0.208.227 20.233.60.124 80:31074/TCP 15s dcgm-exporter ClusterIP 10.0.116.180 9400/TCP 3h55m kubernetes ClusterIP 10.0.0.1 443/TCP 4h26m prometheus-adapter ClusterIP 10.0.12.96 443/TCP 122m

20.233.60.124 is your IP.

Step 10: Test autoscaling

Increase the GPU utilization by making requests to the application. When the average GPU utilization (target) crosses 30, max average utilization set by us, you’ll observe that the application will scale up and spin another pod.

Making a request to your app

Here we are making a request to /deidentiy_text endpoint of our deid container. You can make a request to any resource which utilizes GPU.

for ((i=1;i<=10;i++)); do curl -X POST http://20.233.60.124/deidentify_text -H 'content-type: application/json' -d '{"text": ["My name is John and my friend is Grace", "I live in Berlin"], "unique_pii_markers": false, "key": “”}' &; done

Need an API key? Contact us to download yours.

Meanwhile keep observing the status of horizontal pod autoscaler. When the GPU utilization (target) crosses 30, the system will automatically spin up another instance of pod.

$ kubectl get hpa NAME REFERENCE TARGETS MINPODS MAXPODS REPLICAS AGE my-gpu-app Deployment/my-gpu-app 40/30 2 2 1 30m 

Check the status of pods, you’ll notice that now we have another my-gpu-app spinned up by our autoscaler.

$ kubectl get pods NAME READY STATUS RESTARTS AGE dcgm-exporter-6bjn8 1/1 Running 0 3h37m dcgm-exporter-xmn74 1/1 Running 0 3h37m my-gpu-app-675b967d56-q7swb 1/1 Running 0 30m my-gpu-app-572f924e36-q7swb 1/1 Running 0 5m prometheus-adapter-6696b6d76-g2csx 1/1 Running 0 104m 

Additional resources for Kubernetes GPU deployment

Interested in receiving more tech tips like autoscaling Kubernetes pods based on GPU? Sign up for Private AI’s mailing list to get notified about the latest information on machine learning deployment, privacy, and more.