GPU Fleet Operations
This guide covers fleet design and day-2 operations for GPU bare metal servers managed by vMetal. For initial NodeProvider setup and provisioning a single GPU server, start with the GPU Quickstart.
Label strategy
Labels on BareMetalHost resources drive node type selection. The platform's label selector matches servers to node types, so a consistent labeling scheme is the foundation of a manageable fleet.
Recommended labels
| Label | Example values | Purpose |
|---|---|---|
gpu-model | h100, a100, l40s | GPU model identifier |
gpu-count | 8, 4, 2 | Number of GPUs per server |
rack | rack-a, rack-b | Physical location |
datacenter | us-east-1, eu-west-1 | Site identifier |
nvlink-domain | nvlink-0, nvlink-1 | NVLink fabric group for multi-server NVLink topologies |
Apply labels when registering servers:
kubectl label baremetalhost server-01 -n metal3-system \
gpu-model=h100 \
gpu-count=8 \
rack=rack-a \
datacenter=us-east-1
Updating labels without re-provisioning
You can update labels on a BareMetalHost at any time. Label changes do not trigger re-provisioning. The platform reads labels only when selecting a server for a new Machine claim.
kubectl label baremetalhost server-01 -n metal3-system nvlink-domain=nvlink-0 --overwrite
Labels on a claimed server have no effect until you delete the Machine and provision a new one.
Node type design
One node type per hardware profile
Define a separate node type for each distinct GPU hardware profile. Servers with different GPU models, GPU counts, or CPU configurations should have distinct node types with matching selectors.
nodeTypes:
- name: "h100-8x"
displayName: "H100 8-GPU"
resources:
cpu: "96"
memory: 768Gi
nvidia.com/gpu: "8"
bareMetalHosts:
selector:
matchLabels:
gpu-model: h100
gpu-count: "8"
properties:
vcluster.com/os-image: ubuntu-noble-gpu
- name: "a100-4x"
displayName: "A100 4-GPU"
resources:
cpu: "64"
memory: 256Gi
nvidia.com/gpu: "4"
bareMetalHosts:
selector:
matchLabels:
gpu-model: a100
gpu-count: "4"
properties:
vcluster.com/os-image: ubuntu-noble-gpu
Per-tenant node types
AI Clouds often dedicate capacity to specific tenants by combining hardware labels with a tenant label. Add a tenant label to the BareMetalHost resources assigned to that tenant:
kubectl label baremetalhost server-01 -n metal3-system tenant=customer-a
Then define a node type whose selector includes both hardware and tenant attributes:
nodeTypes:
- name: "customer-a-h100-8x"
displayName: "Customer A H100 8-GPU"
resources:
cpu: "96"
memory: 768Gi
nvidia.com/gpu: "8"
bareMetalHosts:
selector:
matchLabels:
gpu-model: h100
gpu-count: "8"
tenant: customer-a
properties:
vcluster.com/os-image: ubuntu-noble-gpu
Only servers labeled tenant: customer-a match this node type. Servers assigned to other tenants are not eligible, even if their hardware labels match.
Node types are also composable. You can define them along orthogonal dimensions. For example, you might create one set by hardware profile and another set by tenant. A server labeled with both dimensions matches both types simultaneously. Tenants select from the pool that combines whichever dimensions apply to their request.
Cost configuration
Each node type has an associated cost. Karpenter uses cost to select the cheapest matching node type when a provisioning request matches multiple types. For GPU node types, the default cost contribution is 10000 per GPU resource unit.
Override cost explicitly when you want to control scheduling preference independently of resource counts:
nodeTypes:
- name: "h100-8x"
cost: 500000
resources:
cpu: "96"
memory: 768Gi
nvidia.com/gpu: "8"
A lower cost makes the node type preferred over higher-cost alternatives when both match.
Scheduling behavior
GPU resources inside a tenant cluster
When a bare metal server joins a tenant cluster as a private node, the platform sets the node's capacity based on the node type's resources field. Workloads inside the tenant cluster schedule to the node using standard Kubernetes resource requests.
resources:
limits:
nvidia.com/gpu: "1"
For GPU scheduling to work, the NVIDIA device plugin or GPU Operator must run inside the tenant cluster. Without it, the node does not advertise nvidia.com/gpu capacity to the scheduler, even if the node type specifies it.
NVIDIA device plugin vs. GPU Operator
The NVIDIA device plugin is a lightweight DaemonSet that discovers GPUs and exposes them to the Kubernetes scheduler. Deploy it directly when you control the driver installation and want minimal overhead.
The NVIDIA GPU Operator manages the device plugin alongside driver installation, container runtime configuration, and monitoring. It is the most common approach for production fleets where the OS image does not include pre-installed drivers.
Fleet lifecycle
Rotating OS images
To update the OS image across a fleet, create a new OSImage resource and update the node type to reference it. Do not edit the existing OSImage in place — updating an existing resource does not trigger re-provisioning for running Machines.
- Create a new
OSImageresource with the updated URL and checksum. - Update the node type's
vcluster.com/os-imageproperty to reference the newOSImage.
The node type change causes drift. The platform detects that running Machines no longer match the node type and rolls them over automatically, deprovisioning and re-provisioning servers with the updated image.
Heterogeneous GPU fleets
A single NodeProvider can manage servers with different GPU models. Define one node type per hardware profile and use label selectors to route requests to the right servers. The platform selects the cheapest matching type when multiple types satisfy a request.
nodeTypes:
- name: "h100-8x"
bareMetalHosts:
selector:
matchLabels:
gpu-model: h100
- name: "a100-4x"
bareMetalHosts:
selector:
matchLabels:
gpu-model: a100
Tenants request a specific node type by name using nodeTypeSelector. If the tenant omits a selector, the platform picks the cheapest available type.
Server pinning
Pin a Machine to a specific physical server when a workload requires a particular server, such as a node in a specific NVLink fabric or a server with a known hardware configuration.
Set metal3.vcluster.com/server-name on the Machine, not on the node type:
privateNodes:
enabled: true
autoNodes:
- provider: metal3-provider
static:
- name: pinned-gpu
quantity: 1
nodeTypeSelector:
- property: vcluster.com/node-type
value: h100-8x
properties:
metal3.vcluster.com/server-name: server-01
The platform claims only the named server. If that server is unavailable, provisioning fails rather than selecting a different server.