Troubleshooting

This guide covers how to diagnose and resolve common vMetal failures. Start by identifying the state your server is in using BareMetalHost States, then jump to the matching section below.

Set your namespace and server name once. Every command on this page updates automatically.

Modify the following with your specific values to replace them across the whole page:

NAMESPACE

BMH_NAME

Click to enlarge

Common failures

Stuck in `registering`

The BMH stays in registering when the BMO cannot reach the BMC.

BMC unreachable

Check BMO logs for connection errors:

kubectl logs statefulset/metal3 -n my-metal3-namespace -c metal3 | grep my-baremetalhost-name

Verify:

The BMC address in spec.bmc.address is reachable from the control plane cluster.
Firewall rules allow traffic from the Metal3 pod to the BMC port (623 for IPMI, 443 for Redfish).
The Secret referenced by spec.bmc.credentialsName contains valid username and password keys.

Stuck in `inspecting`

The BMH stays in inspecting when the server cannot boot into the IPA ramdisk and phone home to Ironic. Work through these checks in order. Each one narrows the failure to a specific layer.

Check BMC console

The BMC console shows the server's boot sequence directly. Because it is out-of-band, it works even if the vMetal setup is misconfigured.

Access the console through the BMC web interface or using ipmitool:

ipmitool -I lanplus -H <bmc-address> -U <user> -P <pass> sol activate

You should see this sequence:

The server powers on or reboots (triggered by Metal3 over BMC).
The firmware attempts a network boot on the provisioning NIC.
The NIC gets a DHCP lease.
The server downloads a kernel and initramfs over TFTP or HTTP.

If the server does not power on or reboot, the problem is BMC access. Check the BMC address, network reachability from the Metal3 pod, and the credentials in the Secret referenced by spec.bmc.credentialsName.

If the server powers on but stalls during network boot, check the DHCP server logs next.

Check DHCP server logs

The DHCP server logs every request and response it sees. It is the most targeted tool for diagnosing a stuck network boot.

kubectl logs -n my-metal3-namespace -l app=dhcp-proxy

The logs include:

DHCP requests from the server, including requests from MACs that do not match any registered BareMetalHost. These help identify a wrong spec.bootMACAddress.
DHCP responses sent back to the server.
TFTP requests for the bootloader.
HTTP requests for the kernel and initramfs images.

If no entries appear for the server, the DHCP broadcast is not reaching the pod. Check that the DHCP server is attached to the provisioning network using a Multus NetworkAttachmentDefinition, or that it is running with hostNetwork: true for flat networks.

Check ramdisk logs

The ramdisk-logs container tails IPA log files as they arrive. If no logs appear, the server never contacted Ironic.

kubectl logs statefulset/metal3 -n my-metal3-namespace -c ramdisk-logs

Check Ironic logs

kubectl logs statefulset/metal3 -n my-metal3-namespace -c ironic

Look for errors related to the server's MAC address or IPMI/Redfish operations.

Other causes

Wrong boot MAC. The spec.bootMACAddress field must match the NIC used for PXE boot. The DHCP server logs show requests from unmatched MACs, which makes this easy to spot.
IPA ramdisk not downloaded. The ramdisk-downloader init container fetches the IPA image on startup. A failed init container blocks the entire Metal3 pod. Run kubectl describe pod metal3-0 -n <namespace> and check the init container status.

Stuck in `provisioning`

The BMH stays in provisioning when Ironic cannot complete the OS installation.

Check Ironic logs

kubectl logs statefulset/metal3 -n my-metal3-namespace -c ironic

Common causes

Image URL unreachable. The server's IPA agent downloads the OS image directly. It must be reachable from the bare metal server, not just from the Metal3 pod. Test whether Ironic can reach the URL using an ephemeral debug container: kubectl debug -it metal3-0 -n <namespace> --image=curlimages/curl -- curl -I <image-url>.
Checksum mismatch. Ironic verifies the downloaded image against spec.image.checksum. If the checksum is wrong, provisioning fails with a checksum error in the Ironic logs. Recheck the checksum against the actual file.
Wrong rootDeviceHint. If a server has multiple disks, Ironic may target the wrong one. Set spec.rootDeviceHints on the BareMetalHost to target the correct disk by size, WWN, or device name.

Force a re-provision

If provisioning is stuck and you want Ironic to retry, delete the Machine from the vCluster Platform UI or API. The platform deprovisions the BMH and returns it to available. A new Machine request then triggers a fresh provisioning attempt.

Do not manually edit the BMH spec.image field while a Machine holds the claim. The platform manages that field and will reconcile it on the next cycle.

Stuck in `deprovisioning`

When a Machine is deleted, the platform removes the image reference from the BMH and sets it offline. The BMO then triggers a cleaning cycle that wipes the disk and returns the server to available.

What cleaning does

The BMO powers on the server via BMC, boots a cleaning ramdisk, and runs automated disk erasure steps. The BMC must remain reachable from Metal3 throughout this process. The cleaning ramdisk must also be able to reach Ironic and the DHCP proxy when booting.

Cleaning can take several minutes on servers with many disks or large disk capacities.

Check BMO and Ironic logs

kubectl logs statefulset/metal3 -n my-metal3-namespace -c metal3 | grep -i "clean\|deprovi"
kubectl logs statefulset/metal3 -n my-metal3-namespace -c ironic | grep -i "clean"

Server permanently unavailable

If a server is physically removed or permanently unavailable, delete the BareMetalHost resource. The platform detects the deletion and marks the associated NodeClaim as failed. The IP is released on the next reconcile cycle.

Error state

A BMH in error state has stopped retrying. Read the error details first.

kubectl get baremetalhost my-baremetalhost-name -n my-metal3-namespace -o jsonpath='{.status.errorMessage}'
kubectl get baremetalhost my-baremetalhost-name -n my-metal3-namespace -o jsonpath='{.status.errorType}'

When to retry vs. when to recreate

Retry when the error is transient: network blip or temporary BMC unresponsiveness. Annotate the BMH to trigger a retry:

kubectl annotate baremetalhost my-baremetalhost-name -n my-metal3-namespace \
baremetalhost.metal3.io/reboot='{"force":false}' --overwrite

Fix credentials and retry when the error is wrong BMC credentials. Update the Secret with the correct values. The BMO picks up the change on its next reconcile without reprovisioning the server. Then annotate the BMH to trigger a retry using the command above.
Abort and create a new Machine when the error is a checksum mismatch or image URL failure. The image URL and checksum are written to Ironic's database at provisioning time. Updating the OSImage resource does not affect an in-flight Machine. Delete the Machine from the vCluster Platform UI or API to abort provisioning. The platform deprovisions the BMH and returns it to available. Create a new Machine with corrected image configuration to restart. To avoid waiting for the full deprovisioning cycle, you can delete the Machine immediately after the BMH reaches deprovisioning.
Delete and recreate the BMH when the underlying hardware configuration is wrong, such as for a bad BMC address or incorrect boot MAC. Fix the problem, then delete and recreate the BMH. The platform detects the missing resource on its next reconcile and marks the associated NodeClaim as failed.

IP allocation issues

vMetal allocates an IP address per Machine from a CIDR or explicit IP range configured on the NodeProvider. The platform stores allocations in a ConfigMap named ipam-default in the loft management namespace (typically loft). If Machine creation fails with a failed to allocate IP error, check here.

"No available IPs" / CIDR exhausted

The error appears in the loft platform logs and in Machine events:

failed to allocate IP: CIDR exhausted; no free addresses left

or, for explicit IP ranges:

failed to allocate IP: all ranges exhausted; no free addresses left

Inspect current allocations

kubectl get configmap ipam-default -n loft -o yaml

The ConfigMap data is a flat map of <ip>: NodeClaim/<namespace>/<name> entries. Each entry is one allocated address.

Example output:

data:
  10.10.1.2: NodeClaim/loft-p-myproject/machine-abc
  10.10.1.3: NodeClaim/loft-p-myproject/machine-def

Leaked IPs from failed deprovision

If the platform deleted a Machine but did not release its IP (for example, a crash during cleanup), the entry remains in the ConfigMap. The corresponding NodeClaim no longer exists. Remove stale entries manually:

List all active NodeClaims:
```
kubectl get nodeclaims -A
```
Compare against the ConfigMap entries. Any entry whose NodeClaim no longer exists is stale.

Edit the ConfigMap to remove stale entries:

kubectl edit configmap ipam-default -n loft

CIDR too small

If the number of Machines regularly exceeds the available address space, expand the IP range in the NodeProvider configuration. Set the range using the metal3.vcluster.com/network-cidr or metal3.vcluster.com/network-ip-range property on the NodeProvider or NodeType.

Webhook readiness

The Metal3 ValidatingWebhook validates every BareMetalHost create and update. The webhook runs inside the metal3 container of the metal3 StatefulSet and serves on port 9443. The API server routes requests through the metal3-webhook-service Service.

The webhook configuration uses failurePolicy: Fail by default. If the webhook endpoint is unreachable when you create a BMH, the API server rejects the request with a connection error. The BMH is never created. Servers appear unregistered until the webhook comes up.

Check webhook readiness

Check whether the BMO container is ready:

kubectl get pod metal3-0 -n my-metal3-namespace -o jsonpath='{.status.containerStatuses[?(@.name=="metal3")].ready}'

Check the readiness probe directly:

kubectl exec -n my-metal3-namespace metal3-0 -c metal3 -- \
wget -qO- http://localhost:9440/readyz

A healthy response returns ok.

Check the ValidatingWebhookConfiguration for the current CA bundle and service endpoint:

kubectl get validatingwebhookconfiguration metal3-validating-webhook-configuration -o yaml

Log locations

All Metal3 components run as containers inside the metal3 StatefulSet on the control plane cluster. The namespace matches the one configured in the NodeProvider's clusterRef.

Component	Container	Log command
Ironic	`ironic`	`kubectl logs statefulset/metal3 -n <namespace> -c ironic`
Bare Metal Operator	`metal3`	`kubectl logs statefulset/metal3 -n <namespace> -c metal3`
IPA ramdisk output	`ramdisk-logs`	`kubectl logs statefulset/metal3 -n <namespace> -c ramdisk-logs`

Platform-level events

Machine and NodeClaim events appear on the management cluster. To see events for a specific Machine:

kubectl describe machine my-baremetalhost-name -n my-metal3-namespace

For NodeClaims:

kubectl get nodeclaims -A

kubectl describe nodeclaim my-baremetalhost-name -n my-metal3-namespace

These events report why vMetal could not claim a BMH, such as no available hosts, an IPAM failure, or an image resolution error. You can check them before digging into component logs.

Common failures​

Stuck in registering​

BMC unreachable​

Stuck in inspecting​

Check BMC console​

Check DHCP server logs​

Check ramdisk logs​

Check Ironic logs​

Other causes​

Stuck in provisioning​

Check Ironic logs​

Common causes​

Force a re-provision​

Stuck in deprovisioning​

What cleaning does​

Check BMO and Ironic logs​

Server permanently unavailable​

Error state​

When to retry vs. when to recreate​

IP allocation issues​

"No available IPs" / CIDR exhausted​

Inspect current allocations​

Leaked IPs from failed deprovision​

CIDR too small​

Webhook readiness​

Check webhook readiness​

Log locations​

Platform-level events​

Common failures

Stuck in `registering`

BMC unreachable

Stuck in `inspecting`

Check BMC console

Check DHCP server logs

Check ramdisk logs

Check Ironic logs

Other causes

Stuck in `provisioning`

Check Ironic logs

Common causes

Force a re-provision

Stuck in `deprovisioning`

What cleaning does

Check BMO and Ironic logs

Server permanently unavailable

Error state

When to retry vs. when to recreate

IP allocation issues

"No available IPs" / CIDR exhausted

Inspect current allocations

Leaked IPs from failed deprovision

CIDR too small

Webhook readiness

Check webhook readiness

Log locations

Platform-level events