GPU quickstart
This guide walks through provisioning GPU bare metal servers as private nodes in a tenant cluster. It covers GPU-specific prerequisites, node type configuration, and workload verification.
The Getting Started guide covers NodeProvider setup, BMC credentials, and server registration.
Prerequisites
Complete the Getting Started guide first. You need a working NodeProvider with Metal3 deployed and at least one BareMetalHost in available state.
You also need:
- GPU servers that have passed Metal3 inspection and are in
availablestate. - An OS image with GPU drivers pre-installed, a standard cloud image with a cloud-init script that installs them, or a standard cloud image with the NVIDIA GPU Operator installed post-provision. GPU Operator is the most common approach for production fleets.
- SSH keys configured as
SSHKeyresources if you need post-provision server access.
1. Label your GPU servers
Add labels to your BareMetalHost resources. The platform uses these labels to match servers to node types.
kubectl label baremetalhost server-01 -n metal3-system \
gpu-model=h100 \
gpu-count=8
These label keys are useful for GPU fleets:
gpu-model— GPU model, such ash100,a100, orl40s.gpu-count— Number of GPUs per server.rackordatacenter— Location for topology-aware scheduling.
2. Create an OSImage
An OSImage resource stores the OS image URL and checksum. Multiple node types can reference the same OSImage by name.
apiVersion: management.loft.sh/v1
kind: OSImage
metadata:
name: ubuntu-noble-gpu
spec:
properties:
metal3.vcluster.com/image-url: https://your-registry.example.com/ubuntu-noble-gpu-amd64.img
metal3.vcluster.com/image-checksum: sha256:abc123...
metal3.vcluster.com/image-checksum-type: sha256
kubectl apply -f os-image.yaml
If you don't have a GPU-ready image, use a standard cloud image and install the NVIDIA driver via vcluster.com/user-data. See Configuration for user data options.
3. Configure a GPU node type
Add a GPU node type to your NodeProvider. The resources field defines what the platform advertises to the Kubernetes scheduler.
nodeTypes:
- name: "h100-8x"
displayName: "H100 8x GPU"
resources:
cpu: "64"
memory: 256Gi
nvidia.com/gpu: "8"
bareMetalHosts:
selector:
matchLabels:
gpu-model: h100
gpu-count: "8"
properties:
vcluster.com/os-image: ubuntu-noble-gpu
vcluster.com/ssh-keys: my-ssh-key
The platform validates CPU and memory against the hardware inventory collected during Metal3 inspection. It does not validate GPU resources against hardware. The nvidia.com/gpu value in resources is what the scheduler sees. Make sure it matches the actual hardware.
4. Create a tenant cluster with GPU private nodes
Configure a vCluster to use the GPU node type as a private node.
privateNodes:
enabled: true
autoNodes:
- provider: metal3-provider
static:
- name: gpu-nodes
quantity: 1
nodeTypeSelector:
- property: vcluster.com/node-type
value: h100-8x
The platform selects an available h100-8x server, installs the OS, and joins it to the tenant cluster as a worker node.
5. Verify GPU access
Once the node joins, verify the GPU is visible to the scheduler.
vcluster connect my-cluster
kubectl get nodes -o json | jq '.items[].status.capacity | select(."nvidia.com/gpu")'
The output should show "nvidia.com/gpu": "8" in the node capacity. If it's missing, check that the NVIDIA device plugin or GPU Operator is installed on the node.
What the platform handles
- Selecting a
BareMetalHostthat matches the node type label selector. - Allocating an IP and generating network configuration.
- Generating cloud-init and joining the server to the tenant cluster.
- Releasing the IP and returning the server to the pool on deprovision.
What you manage
- Enrolling
BareMetalHostresources and applying GPU labels. - Providing an OS image with GPU driver support, or installing drivers via cloud-init.
- Deploying the NVIDIA device plugin or GPU Operator inside the tenant cluster.
- Setting the correct
nvidia.com/gpucount in the node type to match the hardware.