Article Goals
One common request is how to protect your expensive GPU nodes from having a CPU only pod scheduled on them. This article will provide you with how we give you the power to do this as well as the process you can follow to enable GPU node protection in your own environments.
How Do We Protect GPU Nodes?
We leverage the ExtendedResourceToleration Admission Controller to add a toleration for nvidia.com/gpu:NoSchedule on the pods that request a nvidia.com/gpu extended resource. The node is tainted with nvidia.com/gpu:NoSchedule and prevents pods from being accidentally scheduled on it. We have also added the nvidia.com/gpu:NoSchedule toleration to Rok components to ensure we have them running on the required GPU nodes.
Protecting GPU Nodes in Your Environment(s)
- Create a dedicated GPU node group and add the appropriate toleration. Follow the GPU node group section on our official docs
If you already have a GPU node, you must taint the node like so:
You should see the taint when describing the node.kubectl taint node <node name>
nvidia.com/gpu:NoSchedule
you can untaint the node with:
We don't recommend you taint individual nodes, but thought we would add the process so you understand how it works at the individual node level. You should taint the node group viakubectl taint node <node name>
nvidia.com/gpu:NoSchedule
-
Kubernetes node taints can be applied to new and existing managed node groups using the AWS Management Console or through the Amazon EKS API.
Via the management console visit EKS -> clusters-> <cluster name> -> Node Group: <node group name> .
Edit the node group.
and click the "save changes" button at the bottom of the page.
You can also update via the command line via:
To remove the taint via the command line use:aws eks update-nodegroup-config --cluster-name <cluster-name> --nodegroup-name <node-group-name> --taints 'addOrUpdateTaints=[{key="nvidia.com/gpu",effect="NO_SCHEDULE"}]'
aws eks update-nodegroup-config --cluster-name <cluster-name> --nodegroup-name <node-group-name> --taints 'removeTaints=[{key="nvidia.com/gpu",effect="NO_SCHEDULE"}]'
- Request a GPU for your pipeline step with Kale.
a GPU in a pod manifest
or via the Kubeflow Pipelines directly. To learn how to do this please reference the tutorials on the official Kubeflow documentation .container: ... resources: limits: nvidia.com/gpu: 2
Comments
0 comments
Article is closed for comments.