With the recent proliferation of local AI models, both LLMs and image generation models, I decided to dust off my old gaming machine and join it into my home-lab K3s cluster.
This wasn’t as straightforward out of the box as I anticipated so I thought I’d document the process here.
Assumptions
- Your graphics card is an NVIDIA GPU
- You’re running Ubuntu (24.04 LTS in this example)
- You already have a functioning K3s cluster; cluster set-up is not covered here.
Node Configuration
Drivers
Ubuntu provides the following handy utility that recommends a stable (generally LTS) driver package for your given GPU hardware.
$ ubuntu-drivers devices 2>/dev/null | grep "recommended" | awk '{print $3}'
Alternatively, if you wish to live a little closer to the bleeding edge you can view the packages available on your system using:
$ apt search nvidia-driver- | grep -E "^nvidia-driver-[0-9]*(-server)?-open"
and then:
$ sudo apt update && sudo apt install -y nvidia-driver-<YOUR_CHOSEN_VERSION>
If your Ubuntu install is headless pick the -server variant but if, like me,
you’re adding a workstation to your cluster pick the non-server variant.
NVIDIA Container Toolkit
$ curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg \
&& curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list | \
sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \
sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list
Again, if you like living on the bleeding edge you can configure experimental packages
$ sudo sed -i -e '/experimental/ s/^#//g' /etc/apt/sources.list.d/nvidia-container-toolkit.list
Then install the following:
$ sudo apt-get install -y \
nvidia-container-toolkit \
nvidia-container-toolkit-base \
libnvidia-container-tools \
libnvidia-container1
Joining the cluster
Execute the following to join your node to the cluster. Note that we’re labeling the node as having a GPU.
$ curl -sfL https://get.k3s.io | K3S_URL=https://<APISERVER_IP:6443 \
K3S_TOKEN="<YOUR_TOKEN>" \
sh -s - agent \
--node-label "hardware=gpu"
Container Runtime
Lastly we must configure the container runtime using nvidia-ctk:
$ sudo nvidia-ctk runtime configure --runtime=containerd \
--config=/var/lib/rancher/k3s/agent/etc/containerd/config.toml
And restart K3s:
sudo systemctl restart k3s-agent
Cluster Configuration
Firstly we configure a RuntimeClass that references our newly configured nvidia container runtime.
$ cat <<EOF | kubectl apply -f -
apiVersion: node.k8s.io/v1
kind: RuntimeClass
metadata:
name: nvidia
handler: nvidia
EOF
Then we configure the NVIDIA k8s-device-plugin:
$ export VERSION=$(curl -s https://api.github.com/repos/NVIDIA/k8s-device-plugin/releases/latest | jq -r .tag_name)
$ kubectl apply -f "https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/$VERSION/deployments/static/nvidia-device-plugin.yml"
This, however, needs some modifications:
$ kubectl patch daemonset nvidia-device-plugin-daemonset -n kube-system --type='json' -p='[
{"op": "add", "path": "/spec/template/spec/nodeSelector", "value": {"hardware": "gpu"}},
{"op": "add", "path": "/spec/template/spec/containers/0/args", "value": ["--device-discovery-strategy=nvml"]},
{"op": "add", "path": "/spec/template/spec/containers/0/securityContext", "value": {"privileged": true}},
{"op": "add", "path": "/spec/template/spec/containers/0/volumeMounts", "value": [
{"name": "device-plugin", "mountPath": "/var/lib/kubelet/device-plugins"},
{"name": "nvidia-libs", "mountPath": "/usr/lib/x86_64-linux-gnu", "readOnly": true},
{"name": "dev", "mountPath": "/dev"}
]},
{"op": "add", "path": "/spec/template/spec/volumes", "value": [
{"name": "device-plugin", "hostPath": {"path": "/var/lib/kubelet/device-plugins", "type": "Directory"}},
{"name": "nvidia-libs", "hostPath": {"path": "/usr/lib/x86_64-linux-gnu", "type": "Directory"}},
{"name": "dev", "hostPath": {"path": "/dev"}}
]}
]'
This does the following:
- Only provisions the device plugin Pods on the nodes labeled with
hardware=gpu - Enforces use of NVIDIA Management Library (NVML) for device discovery
- Runs the Pod as privileged (NOTE: Don’t do this in production but for my home-lab its fine)
- Mounts
/devto give the plugin access to requisite device nodes - Mounts in
nvidia-libsso that the plugin has access tolibnvidia-ml.soon the host.
Now we can validate with:
$ kubectl run gpu-check --rm -it \
--restart=Never \
--image=nvidia/cuda:12.4.1-runtime-ubuntu22.04 \
--overrides='{"spec": {"runtimeClassName": "nvidia"}, "nodeSelector": {"hardware": "gpu"}}' \
-- nvidia-smi
If everything has worked correctly you should see something like:
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 580.126.09 Driver Version: 580.126.09 CUDA Version: 13.0 |
+-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA GeForce RTX 3080 Off | 00000000:2D:00.0 Off | N/A |
| 0% 28C P8 2W / 320W | 217MiB / 10240MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| No running processes found |
+-----------------------------------------------------------------------------------------+
Summary
That’s a wrap; if the above command succeeded then you’re ready to run your choice self-hosted LLMs, stable diffusion workers, or even local automated video transcription models.