Skip to main content

GPU presentation modes

vMetal provisions physical servers. It does not virtualize GPUs, partition GPUs, or install a specific accelerator driver by itself. The server hardware, OS image, driver stack, and vendor device plugin or Dynamic Resource Allocation driver determine GPU presentation.

When a vMetal Machine joins a tenant cluster as a private node, Kubernetes sees the installed OS and device stack.

Responsibility boundary

LayerResponsible component
Server power, BMC communication, inspection, PXE boot, OS installationMetal3, Ironic, and vMetal
OS image selection and cloud-init user datavMetal OSImage, NodeProvider, NodeType, and Machine properties
GPU driver, kernel modules, container runtime configurationOS image, cloud-init, or vendor operator
GPU discovery and Kubernetes resource advertisementVendor device plugin, GPU Operator, or DRA driver
Tenant Kubernetes API, workload scheduling, and object syncvCluster

For the tenant-cluster compatibility layer, see GPU and Accelerator Support in vCluster.

Bare-metal passthrough

Bare-metal passthrough is the default vMetal model. There is no hypervisor between the workload node and the physical GPU. The provisioned OS sees the PCIe devices directly. The vendor driver and device plugin expose those devices to Kubernetes.

Use this model for dedicated GPU servers, AI training nodes, inference nodes, and other workloads that need direct device access and predictable performance.

NVIDIA vGPU

NVIDIA vGPU requires the NVIDIA vGPU software stack and compatible hardware licensing. vMetal does not create vGPU profiles or configure the vGPU manager. If you need vGPU, include the required NVIDIA vGPU host or guest driver configuration in the OS image or post-provision setup. Then install the Kubernetes components that expose the resulting resources to the tenant cluster.

Use this model only when your NVIDIA vGPU architecture is already defined at the OS and driver layer.

NVIDIA MIG

Configure NVIDIA Multi-Instance GPU (MIG) through the NVIDIA driver stack and GPU Operator or device plugin. vMetal can provision the node image that contains the required driver support. The NVIDIA components control MIG strategy, partitioning, and exposed resource names.

After MIG is configured, Kubernetes workloads request the resources advertised by the NVIDIA plugin, such as full GPU or MIG-specific resource names.

AMD and other accelerators

vMetal can provision AMD GPU servers and other accelerator servers the same way it provisions NVIDIA servers. The provisioned node must install the correct vendor driver. It must also run a Kubernetes integration that advertises the device.

Examples include:

  • AMD GPUs exposed by the AMD GPU device plugin or AMD GPU Operator as resources such as amd.com/gpu.
  • Vendor-specific AI accelerators exposed through a device plugin.
  • FPGAs, DPUs, NICs, or other PCIe devices exposed through a device plugin or DRA driver.

For device-plugin based accelerators, use the vendor's resource name in the NodeType resources field and in workload requests. For DRA-based accelerators, install the vendor DRA driver and configure the required DRA objects, such as DeviceClass, ResourceClaim, and ResourceClaimTemplate.

OS image guidance

Choose one of these patterns:

  • Build a GPU-ready OS image with the driver stack pre-installed.
  • Use a standard OS image and install drivers through vcluster.com/user-data during cloud-init.
  • Install the vendor operator after the node joins the tenant cluster.

For production fleets, prefer repeatable OS images or deterministic cloud-init configuration. Changing an OSImage resource does not modify running Machines. To roll out a new driver or presentation mode, create a new OSImage. Then update the NodeType to reference it and let the platform roll Machines to the new image.

See GPU Fleet Operations for the rollout workflow.

Validation

After the node joins the tenant cluster, validate the presentation mode from Kubernetes:

kubectl describe node <node-name>
kubectl get node <node-name> -o jsonpath='{.status.capacity}'

The advertised resource names should match the driver and plugin configuration. If the expected extended resource is missing, troubleshoot the OS image, driver, container runtime configuration, and vendor plugin first. For DRA, validate the vendor DRA driver and DRA objects instead of looking only for an extended resource in node capacity.

External references