Blog

Bootstrapping an NVIDIA GPU in K3s

With the recent proliferation of local AI models, both LLMs and image generation models, I decided to dust off my old gaming machine and join it into my home-lab K3s cluster.

This wasn’t as straightforward out of the box as I anticipated so I thought I’d document the process here.

Assumptions

  • Your graphics card is an NVIDIA GPU
  • You’re running Ubuntu (24.04 LTS in this example)
  • You already have a functioning K3s cluster; cluster set-up is not covered here.

Node Configuration

Drivers

Ubuntu provides the following handy utility that recommends a stable (generally LTS) driver package for your given GPU hardware.

$ ubuntu-drivers devices 2>/dev/null | grep "recommended" | awk '{print $3}'

Alternatively, if you wish to live a little closer to the bleeding edge you can view the packages available on your system using:

$ apt search nvidia-driver- | grep -E "^nvidia-driver-[0-9]*(-server)?-open"

and then:

$ sudo apt update && sudo apt install -y nvidia-driver-<YOUR_CHOSEN_VERSION>

If your Ubuntu install is headless pick the -server variant but if, like me, you’re adding a workstation to your cluster pick the non-server variant.

NVIDIA Container Toolkit

$ curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg \
  && curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list | \
    sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \
    sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list

Again, if you like living on the bleeding edge you can configure experimental packages

$ sudo sed -i -e '/experimental/ s/^#//g' /etc/apt/sources.list.d/nvidia-container-toolkit.list

Then install the following:

$ sudo apt-get install -y \
      nvidia-container-toolkit \
      nvidia-container-toolkit-base \
      libnvidia-container-tools \
      libnvidia-container1

Joining the cluster

Execute the following to join your node to the cluster. Note that we’re labeling the node as having a GPU.

$ curl -sfL https://get.k3s.io | K3S_URL=https://<APISERVER_IP:6443 \
K3S_TOKEN="<YOUR_TOKEN>" \
sh -s - agent \
  --node-label "hardware=gpu"

Container Runtime

Lastly we must configure the container runtime using nvidia-ctk:

$ sudo nvidia-ctk runtime configure --runtime=containerd \
  --config=/var/lib/rancher/k3s/agent/etc/containerd/config.toml

And restart K3s:

sudo systemctl restart k3s-agent

Cluster Configuration

Firstly we configure a RuntimeClass that references our newly configured nvidia container runtime.

$ cat <<EOF | kubectl apply -f -
apiVersion: node.k8s.io/v1
kind: RuntimeClass
metadata:
  name: nvidia
handler: nvidia
EOF

Then we configure the NVIDIA k8s-device-plugin:

$ export VERSION=$(curl -s https://api.github.com/repos/NVIDIA/k8s-device-plugin/releases/latest | jq -r .tag_name)
$ kubectl apply -f "https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/$VERSION/deployments/static/nvidia-device-plugin.yml"

This, however, needs some modifications:

$ kubectl patch daemonset nvidia-device-plugin-daemonset -n kube-system --type='json' -p='[
  {"op": "add", "path": "/spec/template/spec/nodeSelector", "value": {"hardware": "gpu"}},
  {"op": "add", "path": "/spec/template/spec/containers/0/args", "value": ["--device-discovery-strategy=nvml"]},
  {"op": "add", "path": "/spec/template/spec/containers/0/securityContext", "value": {"privileged": true}},
  {"op": "add", "path": "/spec/template/spec/containers/0/volumeMounts", "value": [
    {"name": "device-plugin", "mountPath": "/var/lib/kubelet/device-plugins"},
    {"name": "nvidia-libs", "mountPath": "/usr/lib/x86_64-linux-gnu", "readOnly": true},
    {"name": "dev", "mountPath": "/dev"}
  ]},
  {"op": "add", "path": "/spec/template/spec/volumes", "value": [
    {"name": "device-plugin", "hostPath": {"path": "/var/lib/kubelet/device-plugins", "type": "Directory"}},
    {"name": "nvidia-libs", "hostPath": {"path": "/usr/lib/x86_64-linux-gnu", "type": "Directory"}},
    {"name": "dev", "hostPath": {"path": "/dev"}}
  ]}
]'

This does the following:

  • Only provisions the device plugin Pods on the nodes labeled with hardware=gpu
  • Enforces use of NVIDIA Management Library (NVML) for device discovery
  • Runs the Pod as privileged (NOTE: Don’t do this in production but for my home-lab its fine)
  • Mounts /dev to give the plugin access to requisite device nodes
  • Mounts in nvidia-libs so that the plugin has access to libnvidia-ml.so on the host.

Now we can validate with:

$ kubectl run gpu-check --rm -it \
  --restart=Never \
  --image=nvidia/cuda:12.4.1-runtime-ubuntu22.04 \
  --overrides='{"spec": {"runtimeClassName": "nvidia"}, "nodeSelector": {"hardware": "gpu"}}' \
  -- nvidia-smi

If everything has worked correctly you should see something like:

+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 580.126.09             Driver Version: 580.126.09     CUDA Version: 13.0     |
+-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce RTX 3080        Off |   00000000:2D:00.0 Off |                  N/A |
|  0%   28C    P8              2W /  320W |     217MiB /  10240MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+

Summary

That’s a wrap; if the above command succeeded then you’re ready to run your choice self-hosted LLMs, stable diffusion workers, or even local automated video transcription models.