Skip to main content

Troubleshooting

This guide covers how to diagnose and resolve common vMetal failures. Start by identifying the state your server is in using BareMetalHost States, then jump to the matching section below.

Set your namespace and server name once. Every command on this page updates automatically.

Modify the following with your specific values to replace them across the whole page:
server stuck atregistering1 Check BMO logskubectl describe bmh · look for last error event2 BMC address reachable?ping · curl from a node in the control plane cluster3 Check firewall rulesport 623 (IPMI) · port 443 (Redfish) must reach BMC4 Verify credentials SecretsecretName and namespace in BareMetalHost specFix and return BareMetalHostset status back to availableserver stuck atinspecting1 Check BMC consoleserver powering on? network boot visible? stalls at PXE?2 Check DHCP server logsno entries for MAC? → verify Multus / hostNetwork config3 Check IPA ramdisk logsno logs? server never contacted Ironic · check bootMACAddress4 Check Ironic logslook for MAC address or IPMI errorsOther: ramdisk-downloader failedcheck init container logs · ramdisk image pull issueserver stuck atprovisioning1 Check Ironic logskubectl logs -n vmetal deploy/ironic2 Image URL unreachable?curl image URL from a debug pod in the cluster3 Checksum mismatch?recheck spec.image.checksum in BareMetalHost4 Wrong rootDeviceHints?verify spec.rootDeviceHints matches target diskForce re-provisiondelete Machine via UI or API
Click to enlarge

Common failures

Stuck in registering

The BMH stays in registering when the BMO cannot reach the BMC.

BMC unreachable

Check BMO logs for connection errors:

kubectl logs statefulset/metal3 -n my-metal3-namespace -c metal3 | grep my-baremetalhost-name

Verify:

  • The BMC address in spec.bmc.address is reachable from the control plane cluster.
  • Firewall rules allow traffic from the Metal3 pod to the BMC port (623 for IPMI, 443 for Redfish).
  • The Secret referenced by spec.bmc.credentialsName contains valid username and password keys.

Stuck in inspecting

The BMH stays in inspecting when the server cannot boot into the IPA ramdisk and phone home to Ironic. Work through these checks in order. Each one narrows the failure to a specific layer.

Check BMC console

The BMC console shows the server's boot sequence directly. Because it is out-of-band, it works even if the vMetal setup is misconfigured.

Access the console through the BMC web interface or using ipmitool:

ipmitool -I lanplus -H <bmc-address> -U <user> -P <pass> sol activate

You should see this sequence:

  1. The server powers on or reboots (triggered by Metal3 over BMC).
  2. The firmware attempts a network boot on the provisioning NIC.
  3. The NIC gets a DHCP lease.
  4. The server downloads a kernel and initramfs over TFTP or HTTP.

If the server does not power on or reboot, the problem is BMC access. Check the BMC address, network reachability from the Metal3 pod, and the credentials in the Secret referenced by spec.bmc.credentialsName.

If the server powers on but stalls during network boot, check the DHCP server logs next.

Check DHCP server logs

The DHCP server logs every request and response it sees. It is the most targeted tool for diagnosing a stuck network boot.

kubectl logs -n my-metal3-namespace -l app=dhcp-proxy

The logs include:

  • DHCP requests from the server, including requests from MACs that do not match any registered BareMetalHost. These help identify a wrong spec.bootMACAddress.
  • DHCP responses sent back to the server.
  • TFTP requests for the bootloader.
  • HTTP requests for the kernel and initramfs images.

If no entries appear for the server, the DHCP broadcast is not reaching the pod. Check that the DHCP server is attached to the provisioning network using a Multus NetworkAttachmentDefinition, or that it is running with hostNetwork: true for flat networks.

Check ramdisk logs

The ramdisk-logs container tails IPA log files as they arrive. If no logs appear, the server never contacted Ironic.

kubectl logs statefulset/metal3 -n my-metal3-namespace -c ramdisk-logs

Check Ironic logs

kubectl logs statefulset/metal3 -n my-metal3-namespace -c ironic

Look for errors related to the server's MAC address or IPMI/Redfish operations.

Other causes

  • Wrong boot MAC. The spec.bootMACAddress field must match the NIC used for PXE boot. The DHCP server logs show requests from unmatched MACs, which makes this easy to spot.
  • IPA ramdisk not downloaded. The ramdisk-downloader init container fetches the IPA image on startup. A failed init container blocks the entire Metal3 pod. Run kubectl describe pod metal3-0 -n <namespace> and check the init container status.

Stuck in provisioning

The BMH stays in provisioning when Ironic cannot complete the OS installation.

Check Ironic logs

kubectl logs statefulset/metal3 -n my-metal3-namespace -c ironic

Common causes

  • Image URL unreachable. The server's IPA agent downloads the OS image directly. It must be reachable from the bare metal server, not just from the Metal3 pod. Test whether Ironic can reach the URL using an ephemeral debug container: kubectl debug -it metal3-0 -n <namespace> --image=curlimages/curl -- curl -I <image-url>.
  • Checksum mismatch. Ironic verifies the downloaded image against spec.image.checksum. If the checksum is wrong, provisioning fails with a checksum error in the Ironic logs. Recheck the checksum against the actual file.
  • Wrong rootDeviceHint. If a server has multiple disks, Ironic may target the wrong one. Set spec.rootDeviceHints on the BareMetalHost to target the correct disk by size, WWN, or device name.

Force a re-provision

If provisioning is stuck and you want Ironic to retry, delete the Machine from the vCluster Platform UI or API. The platform deprovisions the BMH and returns it to available. A new Machine request then triggers a fresh provisioning attempt.

Do not manually edit the BMH spec.image field while a Machine holds the claim. The platform manages that field and will reconcile it on the next cycle.

Stuck in deprovisioning

When a Machine is deleted, the platform removes the image reference from the BMH and sets it offline. The BMO then triggers a cleaning cycle that wipes the disk and returns the server to available.

What cleaning does

The BMO powers on the server via BMC, boots a cleaning ramdisk, and runs automated disk erasure steps. The BMC must remain reachable from Metal3 throughout this process. The cleaning ramdisk must also be able to reach Ironic and the DHCP proxy when booting.

Cleaning can take several minutes on servers with many disks or large disk capacities.

Check BMO and Ironic logs

kubectl logs statefulset/metal3 -n my-metal3-namespace -c metal3 | grep -i "clean\|deprovi"
kubectl logs statefulset/metal3 -n my-metal3-namespace -c ironic | grep -i "clean"

Server permanently unavailable

If a server is physically removed or permanently unavailable, delete the BareMetalHost resource. The platform detects the deletion and marks the associated NodeClaim as failed. The IP is released on the next reconcile cycle.

Error state

A BMH in error state has stopped retrying. Read the error details first.

kubectl get baremetalhost my-baremetalhost-name -n my-metal3-namespace -o jsonpath='{.status.errorMessage}'
kubectl get baremetalhost my-baremetalhost-name -n my-metal3-namespace -o jsonpath='{.status.errorType}'

When to retry vs. when to recreate

  • Retry when the error is transient: network blip or temporary BMC unresponsiveness. Annotate the BMH to trigger a retry:
kubectl annotate baremetalhost my-baremetalhost-name -n my-metal3-namespace \
baremetalhost.metal3.io/reboot='{"force":false}' --overwrite
  • Fix credentials and retry when the error is wrong BMC credentials. Update the Secret with the correct values. The BMO picks up the change on its next reconcile without reprovisioning the server. Then annotate the BMH to trigger a retry using the command above.

  • Abort and create a new Machine when the error is a checksum mismatch or image URL failure. The image URL and checksum are written to Ironic's database at provisioning time. Updating the OSImage resource does not affect an in-flight Machine. Delete the Machine from the vCluster Platform UI or API to abort provisioning. The platform deprovisions the BMH and returns it to available. Create a new Machine with corrected image configuration to restart. To avoid waiting for the full deprovisioning cycle, you can delete the Machine immediately after the BMH reaches deprovisioning.

  • Delete and recreate the BMH when the underlying hardware configuration is wrong, such as for a bad BMC address or incorrect boot MAC. Fix the problem, then delete and recreate the BMH. The platform detects the missing resource on its next reconcile and marks the associated NodeClaim as failed.

IP allocation issues

vMetal allocates an IP address per Machine from a CIDR or explicit IP range configured on the NodeProvider. The platform stores allocations in a ConfigMap named ipam-default in the loft management namespace (typically loft). If Machine creation fails with a failed to allocate IP error, check here.

"No available IPs" / CIDR exhausted

The error appears in the loft platform logs and in Machine events:

failed to allocate IP: CIDR exhausted; no free addresses left

or, for explicit IP ranges:

failed to allocate IP: all ranges exhausted; no free addresses left

Inspect current allocations

kubectl get configmap ipam-default -n loft -o yaml

The ConfigMap data is a flat map of <ip>: NodeClaim/<namespace>/<name> entries. Each entry is one allocated address.

Example output:

data:
10.10.1.2: NodeClaim/loft-p-myproject/machine-abc
10.10.1.3: NodeClaim/loft-p-myproject/machine-def

Leaked IPs from failed deprovision

If the platform deleted a Machine but did not release its IP (for example, a crash during cleanup), the entry remains in the ConfigMap. The corresponding NodeClaim no longer exists. Remove stale entries manually:

  1. List all active NodeClaims:

    kubectl get nodeclaims -A
  2. Compare against the ConfigMap entries. Any entry whose NodeClaim no longer exists is stale.

  3. Edit the ConfigMap to remove stale entries:

    kubectl edit configmap ipam-default -n loft

CIDR too small

If the number of Machines regularly exceeds the available address space, expand the IP range in the NodeProvider configuration. Set the range using the metal3.vcluster.com/network-cidr or metal3.vcluster.com/network-ip-range property on the NodeProvider or NodeType.

Webhook readiness

The Metal3 ValidatingWebhook validates every BareMetalHost create and update. The webhook runs inside the metal3 container of the metal3 StatefulSet and serves on port 9443. The API server routes requests through the metal3-webhook-service Service.

The webhook configuration uses failurePolicy: Fail by default. If the webhook endpoint is unreachable when you create a BMH, the API server rejects the request with a connection error. The BMH is never created. Servers appear unregistered until the webhook comes up.

Check webhook readiness

Check whether the BMO container is ready:

kubectl get pod metal3-0 -n my-metal3-namespace -o jsonpath='{.status.containerStatuses[?(@.name=="metal3")].ready}'

Check the readiness probe directly:

kubectl exec -n my-metal3-namespace metal3-0 -c metal3 -- \
wget -qO- http://localhost:9440/readyz

A healthy response returns ok.

Check the ValidatingWebhookConfiguration for the current CA bundle and service endpoint:

kubectl get validatingwebhookconfiguration metal3-validating-webhook-configuration -o yaml

Log locations

All Metal3 components run as containers inside the metal3 StatefulSet on the control plane cluster. The namespace matches the one configured in the NodeProvider's clusterRef.

ComponentContainerLog command
Ironicironickubectl logs statefulset/metal3 -n <namespace> -c ironic
Bare Metal Operatormetal3kubectl logs statefulset/metal3 -n <namespace> -c metal3
IPA ramdisk outputramdisk-logskubectl logs statefulset/metal3 -n <namespace> -c ramdisk-logs

Platform-level events

Machine and NodeClaim events appear on the management cluster. To see events for a specific Machine:

kubectl describe machine my-baremetalhost-name -n my-metal3-namespace

For NodeClaims:

kubectl get nodeclaims -A
kubectl describe nodeclaim my-baremetalhost-name -n my-metal3-namespace

These events report why vMetal could not claim a BMH, such as no available hosts, an IPAM failure, or an image resolution error. You can check them before digging into component logs.