Troubleshooting
This guide covers how to diagnose and resolve common vMetal failures. Start by identifying the state your server is in using BareMetalHost States, then jump to the matching section below.
Set your namespace and server name once. Every command on this page updates automatically.
Common failures
Stuck in registering
The BMH stays in registering when the BMO cannot reach the BMC.
BMC unreachable
Check BMO logs for connection errors:
kubectl logs statefulset/metal3 -n my-metal3-namespace -c metal3 | grep my-baremetalhost-name
Verify:
- The BMC address in
spec.bmc.addressis reachable from the control plane cluster. - Firewall rules allow traffic from the Metal3 pod to the BMC port (623 for IPMI, 443 for Redfish).
- The Secret referenced by
spec.bmc.credentialsNamecontains validusernameandpasswordkeys.
Stuck in inspecting
The BMH stays in inspecting when the server cannot boot into the IPA ramdisk and phone home to Ironic. Work through these checks in order. Each one narrows the failure to a specific layer.
Check BMC console
The BMC console shows the server's boot sequence directly. Because it is out-of-band, it works even if the vMetal setup is misconfigured.
Access the console through the BMC web interface or using ipmitool:
ipmitool -I lanplus -H <bmc-address> -U <user> -P <pass> sol activate
You should see this sequence:
- The server powers on or reboots (triggered by Metal3 over BMC).
- The firmware attempts a network boot on the provisioning NIC.
- The NIC gets a DHCP lease.
- The server downloads a kernel and initramfs over TFTP or HTTP.
If the server does not power on or reboot, the problem is BMC access. Check the BMC address, network reachability from the Metal3 pod, and the credentials in the Secret referenced by spec.bmc.credentialsName.
If the server powers on but stalls during network boot, check the DHCP server logs next.
Check DHCP server logs
The DHCP server logs every request and response it sees. It is the most targeted tool for diagnosing a stuck network boot.
kubectl logs -n my-metal3-namespace -l app=dhcp-proxy
The logs include:
- DHCP requests from the server, including requests from MACs that do not match any registered
BareMetalHost. These help identify a wrongspec.bootMACAddress. - DHCP responses sent back to the server.
- TFTP requests for the bootloader.
- HTTP requests for the kernel and initramfs images.
If no entries appear for the server, the DHCP broadcast is not reaching the pod. Check that the DHCP server is attached to the provisioning network using a Multus NetworkAttachmentDefinition, or that it is running with hostNetwork: true for flat networks.
Check ramdisk logs
The ramdisk-logs container tails IPA log files as they arrive. If no logs appear, the server never contacted Ironic.
kubectl logs statefulset/metal3 -n my-metal3-namespace -c ramdisk-logs
Check Ironic logs
kubectl logs statefulset/metal3 -n my-metal3-namespace -c ironic
Look for errors related to the server's MAC address or IPMI/Redfish operations.
Other causes
- Wrong boot MAC. The
spec.bootMACAddressfield must match the NIC used for PXE boot. The DHCP server logs show requests from unmatched MACs, which makes this easy to spot. - IPA ramdisk not downloaded. The
ramdisk-downloaderinit container fetches the IPA image on startup. A failed init container blocks the entire Metal3 pod. Runkubectl describe pod metal3-0 -n <namespace>and check the init container status.
Stuck in provisioning
The BMH stays in provisioning when Ironic cannot complete the OS installation.
Check Ironic logs
kubectl logs statefulset/metal3 -n my-metal3-namespace -c ironic
Common causes
- Image URL unreachable. The server's IPA agent downloads the OS image directly. It must be reachable from the bare metal server, not just from the Metal3 pod. Test whether Ironic can reach the URL using an ephemeral debug container:
kubectl debug -it metal3-0 -n <namespace> --image=curlimages/curl -- curl -I <image-url>. - Checksum mismatch. Ironic verifies the downloaded image against
spec.image.checksum. If the checksum is wrong, provisioning fails with a checksum error in the Ironic logs. Recheck the checksum against the actual file. - Wrong
rootDeviceHint. If a server has multiple disks, Ironic may target the wrong one. Setspec.rootDeviceHintson theBareMetalHostto target the correct disk by size, WWN, or device name.
Force a re-provision
If provisioning is stuck and you want Ironic to retry, delete the Machine from the vCluster Platform UI or API. The platform deprovisions the BMH and returns it to available. A new Machine request then triggers a fresh provisioning attempt.
Do not manually edit the BMH spec.image field while a Machine holds the claim. The platform manages that field and will reconcile it on the next cycle.
Stuck in deprovisioning
When a Machine is deleted, the platform removes the image reference from the BMH and sets it offline. The BMO then triggers a cleaning cycle that wipes the disk and returns the server to available.
What cleaning does
The BMO powers on the server via BMC, boots a cleaning ramdisk, and runs automated disk erasure steps. The BMC must remain reachable from Metal3 throughout this process. The cleaning ramdisk must also be able to reach Ironic and the DHCP proxy when booting.
Cleaning can take several minutes on servers with many disks or large disk capacities.
Check BMO and Ironic logs
kubectl logs statefulset/metal3 -n my-metal3-namespace -c metal3 | grep -i "clean\|deprovi"
kubectl logs statefulset/metal3 -n my-metal3-namespace -c ironic | grep -i "clean"
Server permanently unavailable
If a server is physically removed or permanently unavailable, delete the BareMetalHost resource. The platform detects the deletion and marks the associated NodeClaim as failed. The IP is released on the next reconcile cycle.
Error state
A BMH in error state has stopped retrying. Read the error details first.
kubectl get baremetalhost my-baremetalhost-name -n my-metal3-namespace -o jsonpath='{.status.errorMessage}'
kubectl get baremetalhost my-baremetalhost-name -n my-metal3-namespace -o jsonpath='{.status.errorType}'
When to retry vs. when to recreate
- Retry when the error is transient: network blip or temporary BMC unresponsiveness. Annotate the BMH to trigger a retry:
kubectl annotate baremetalhost my-baremetalhost-name -n my-metal3-namespace \
baremetalhost.metal3.io/reboot='{"force":false}' --overwrite
-
Fix credentials and retry when the error is wrong BMC credentials. Update the Secret with the correct values. The BMO picks up the change on its next reconcile without reprovisioning the server. Then annotate the BMH to trigger a retry using the command above.
-
Abort and create a new Machine when the error is a checksum mismatch or image URL failure. The image URL and checksum are written to Ironic's database at provisioning time. Updating the
OSImageresource does not affect an in-flight Machine. Delete the Machine from the vCluster Platform UI or API to abort provisioning. The platform deprovisions the BMH and returns it toavailable. Create a new Machine with corrected image configuration to restart. To avoid waiting for the full deprovisioning cycle, you can delete the Machine immediately after the BMH reachesdeprovisioning. -
Delete and recreate the BMH when the underlying hardware configuration is wrong, such as for a bad BMC address or incorrect boot MAC. Fix the problem, then delete and recreate the BMH. The platform detects the missing resource on its next reconcile and marks the associated NodeClaim as failed.
IP allocation issues
vMetal allocates an IP address per Machine from a CIDR or explicit IP range configured on the NodeProvider. The platform stores allocations in a ConfigMap named ipam-default in the loft management namespace (typically loft). If Machine creation fails with a failed to allocate IP error, check here.
"No available IPs" / CIDR exhausted
The error appears in the loft platform logs and in Machine events:
failed to allocate IP: CIDR exhausted; no free addresses left
or, for explicit IP ranges:
failed to allocate IP: all ranges exhausted; no free addresses left
Inspect current allocations
kubectl get configmap ipam-default -n loft -o yaml
The ConfigMap data is a flat map of <ip>: NodeClaim/<namespace>/<name> entries. Each entry is one allocated address.
Example output:
data:
10.10.1.2: NodeClaim/loft-p-myproject/machine-abc
10.10.1.3: NodeClaim/loft-p-myproject/machine-def
Leaked IPs from failed deprovision
If the platform deleted a Machine but did not release its IP (for example, a crash during cleanup), the entry remains in the ConfigMap. The corresponding NodeClaim no longer exists. Remove stale entries manually:
-
List all active NodeClaims:
kubectl get nodeclaims -A -
Compare against the ConfigMap entries. Any entry whose NodeClaim no longer exists is stale.
-
Edit the ConfigMap to remove stale entries:
kubectl edit configmap ipam-default -n loft
CIDR too small
If the number of Machines regularly exceeds the available address space, expand the IP range in the NodeProvider configuration. Set the range using the metal3.vcluster.com/network-cidr or metal3.vcluster.com/network-ip-range property on the NodeProvider or NodeType.
Webhook readiness
The Metal3 ValidatingWebhook validates every BareMetalHost create and update. The webhook runs inside the metal3 container of the metal3 StatefulSet and serves on port 9443. The API server routes requests through the metal3-webhook-service Service.
The webhook configuration uses failurePolicy: Fail by default. If the webhook endpoint is unreachable when you create a BMH, the API server rejects the request with a connection error. The BMH is never created. Servers appear unregistered until the webhook comes up.
Check webhook readiness
Check whether the BMO container is ready:
kubectl get pod metal3-0 -n my-metal3-namespace -o jsonpath='{.status.containerStatuses[?(@.name=="metal3")].ready}'
Check the readiness probe directly:
kubectl exec -n my-metal3-namespace metal3-0 -c metal3 -- \
wget -qO- http://localhost:9440/readyz
A healthy response returns ok.
Check the ValidatingWebhookConfiguration for the current CA bundle and service endpoint:
kubectl get validatingwebhookconfiguration metal3-validating-webhook-configuration -o yaml
Log locations
All Metal3 components run as containers inside the metal3 StatefulSet on the control plane cluster. The namespace matches the one configured in the NodeProvider's clusterRef.
| Component | Container | Log command |
|---|---|---|
| Ironic | ironic | kubectl logs statefulset/metal3 -n <namespace> -c ironic |
| Bare Metal Operator | metal3 | kubectl logs statefulset/metal3 -n <namespace> -c metal3 |
| IPA ramdisk output | ramdisk-logs | kubectl logs statefulset/metal3 -n <namespace> -c ramdisk-logs |
Platform-level events
Machine and NodeClaim events appear on the management cluster. To see events for a specific Machine:
kubectl describe machine my-baremetalhost-name -n my-metal3-namespace
For NodeClaims:
kubectl get nodeclaims -A
kubectl describe nodeclaim my-baremetalhost-name -n my-metal3-namespace
These events report why vMetal could not claim a BMH, such as no available hosts, an IPAM failure, or an image resolution error. You can check them before digging into component logs.