Tuesday 11 February 2014

Dealing with VMware Permanent Device Loss (PDL) on an HA cluster

I recently updated the firmware on our Dell EqualLogic SAN, which is used for VMware VMFS volumes, from 6.0.6 H2 to 7.0.1. During the update, a couple of our PS6110XV arrays refused to restart through the GUI, citing error:

Array firmware update from version V6.0.6 to V7.0.1 failed. Reason: The array cannot be restarted using the GUI because of a RAID issue. Use the CLI 'restart' command to continue.

Some research indicated that this error was apparently benign, a result of a bug in firmware 6.0.6, so I should have no problems following its instructions to restart the array at the CLI. Unfortunately, things didn't go smoothly...

I normally have a continuous ping going to the array's network addresses whilst restarting to make sure it comes back. As a rule, it typically drops 4-5 pings during a controller failover, indicating that the array is off the network for 15-20 seconds. I therefore started getting concerned when it still wasn't responding to ping after a minute. It finally returned to service after nearly five minutes.

During this time, I was keeping an eye on our ESXi hosts to make sure they were behaving, keeping my fingers crossed that there was sufficiently little I/O going on that they wouldn't notice their iSCSI volumes disappearing from under them. Sadly, it was not to be - a number of hosts noticed that a selection of their volumes had vanished. After a while, they were marked as 'Failed / Offline'. /var/log/hostd.log on the affected hosts showed a number of errors along the lines of:

Device naa.6090a0c800b6fd7093250547a20180db has been removed or is permanently inaccessible. Affected datastores (if any): "storage3".
NotifyIfAccessibleChanged -- notify that datastore 52593748-41a1a7ae-6bfc-b8ca3a647bc0 at path /vmfs/volumes/52593748-41a1a7ae-6bfc-b8ca3a647bc0 now has accessibility of false due to PermanentDeviceLoss

VMware have a KB article on how to recover from this very situation. Unfortunately, step 1 in their instructions indicates that all VMs on the affected datastore must be powered off and unregistered from vCenter, which is pretty drastic when you're talking about 30-40 VMs! As it happens, the affected hosts in my case were in an HA cluster, so I was able to cheat by following a different process...

  1. In the vSphere client, each affected VM will be marked with a speech bubble question icon. Click each one in turn. You'll be told its backing storage has gone away and asked if you want to Retry or Cancel I/O. You should Cancel. The VM will grey out and be marked as inaccessible.
  2. Log onto the host that was previously running the VM by SSH.
  3. Run vim-cmd vmsvc/getallvms. Look for VMs you know to be affected and any VMs listed as invalid and note their VM ID.
  4. For each one in turn, run vim-cmd vmsvc/unregister <vmid>, inserting the appropriate VM ID from step 3.
  5. If any unregister operation complains that the VM is powered on, you may be able to run vim-cmd vmsvc/power.off <vmid>, then try unregistering again. If this doesn't help, you'll need to find the VM's world ID by running esxcli vm process list, then force it to end by running esxcli vm process kill --type=force --world-id=<World ID>. You should then be able to unregister it.
  6. Once you've unregistered all VMs on a datastore, the host will eventually reconnect. If it doesn't, you may need to force it to rescan, either through the vSphere client or by running esxcfg-rescan -A. If the GUI rescan doesn't work, the CLI is more likely to give an error - I ended up having issues with a couple of datastores where I couldn't unregister some templates, so I ended up having to remove these from the vCenter inventory and add them back.
  7. Watch as HA automatically restarts all unregistered VMs on any hosts that still have access to the affected datastores.
You'll probably need to deal with filesystem corruption on a number of VMs, but at least you haven't had to re-add them all to the inventory by hand!

Once you've finished, you should read this post on Josh Odger's blog for details of a couple of advanced settings you should configure to allow the cluster to automatically recover from this in the future, and this post on Yellow-Bricks for details of a change to the option you need to configure for automatic recovery in ESXi 5.5.

No comments:

Post a Comment