Tuesday 11 February 2014

Dealing with VMware Permanent Device Loss (PDL) on an HA cluster

I recently updated the firmware on our Dell EqualLogic SAN, which is used for VMware VMFS volumes, from 6.0.6 H2 to 7.0.1. During the update, a couple of our PS6110XV arrays refused to restart through the GUI, citing error:

Array firmware update from version V6.0.6 to V7.0.1 failed. Reason: The array cannot be restarted using the GUI because of a RAID issue. Use the CLI 'restart' command to continue.

Some research indicated that this error was apparently benign, a result of a bug in firmware 6.0.6, so I should have no problems following its instructions to restart the array at the CLI. Unfortunately, things didn't go smoothly...

I normally have a continuous ping going to the array's network addresses whilst restarting to make sure it comes back. As a rule, it typically drops 4-5 pings during a controller failover, indicating that the array is off the network for 15-20 seconds. I therefore started getting concerned when it still wasn't responding to ping after a minute. It finally returned to service after nearly five minutes.

During this time, I was keeping an eye on our ESXi hosts to make sure they were behaving, keeping my fingers crossed that there was sufficiently little I/O going on that they wouldn't notice their iSCSI volumes disappearing from under them. Sadly, it was not to be - a number of hosts noticed that a selection of their volumes had vanished. After a while, they were marked as 'Failed / Offline'. /var/log/hostd.log on the affected hosts showed a number of errors along the lines of:

Device naa.6090a0c800b6fd7093250547a20180db has been removed or is permanently inaccessible. Affected datastores (if any): "storage3".
NotifyIfAccessibleChanged -- notify that datastore 52593748-41a1a7ae-6bfc-b8ca3a647bc0 at path /vmfs/volumes/52593748-41a1a7ae-6bfc-b8ca3a647bc0 now has accessibility of false due to PermanentDeviceLoss

VMware have a KB article on how to recover from this very situation. Unfortunately, step 1 in their instructions indicates that all VMs on the affected datastore must be powered off and unregistered from vCenter, which is pretty drastic when you're talking about 30-40 VMs! As it happens, the affected hosts in my case were in an HA cluster, so I was able to cheat by following a different process...

  1. In the vSphere client, each affected VM will be marked with a speech bubble question icon. Click each one in turn. You'll be told its backing storage has gone away and asked if you want to Retry or Cancel I/O. You should Cancel. The VM will grey out and be marked as inaccessible.
  2. Log onto the host that was previously running the VM by SSH.
  3. Run vim-cmd vmsvc/getallvms. Look for VMs you know to be affected and any VMs listed as invalid and note their VM ID.
  4. For each one in turn, run vim-cmd vmsvc/unregister <vmid>, inserting the appropriate VM ID from step 3.
  5. If any unregister operation complains that the VM is powered on, you may be able to run vim-cmd vmsvc/power.off <vmid>, then try unregistering again. If this doesn't help, you'll need to find the VM's world ID by running esxcli vm process list, then force it to end by running esxcli vm process kill --type=force --world-id=<World ID>. You should then be able to unregister it.
  6. Once you've unregistered all VMs on a datastore, the host will eventually reconnect. If it doesn't, you may need to force it to rescan, either through the vSphere client or by running esxcfg-rescan -A. If the GUI rescan doesn't work, the CLI is more likely to give an error - I ended up having issues with a couple of datastores where I couldn't unregister some templates, so I ended up having to remove these from the vCenter inventory and add them back.
  7. Watch as HA automatically restarts all unregistered VMs on any hosts that still have access to the affected datastores.
You'll probably need to deal with filesystem corruption on a number of VMs, but at least you haven't had to re-add them all to the inventory by hand!

Once you've finished, you should read this post on Josh Odger's blog for details of a couple of advanced settings you should configure to allow the cluster to automatically recover from this in the future, and this post on Yellow-Bricks for details of a change to the option you need to configure for automatic recovery in ESXi 5.5.

When VDS portgroups escape...

Earlier today, I was attempting to move a couple of vSphere Distributed Switches into folders in my vCenter Networking inventory. One moved successfully, but when I tried to move the other, I was greeted with the error, "These vDSes failed to move." Descriptive, right?

Unfortunately, whilst the vDS itself didn't move, some of the portgroups it contained did. They went from being child nodes of the parent vDS to being child nodes of the folder I was trying to move the vDS into. Getting them back under their parent vDS turned out to require some slight hacking of the vCenter database, so I'm documenting what I did here in case it's helpful to anyone else.

To fix the problem, I pointed SQL Server Management Studio at the vCenter database and took a look at the [VPXV_ENTITY] table. This contains a row for every item in the vCenter inventory, together with a type ID and a parent. To find all vDS port groups, the SQL query you need is:

SELECT [ID], [NAME], [PARENT_ID] FROM [dbo].[VPX_ENTITY] WHERE [TYPE_ID] = 15

Take a look at the entities that are returned by this and compare one of the port groups correctly under the vDS to one of the escapees; you'll find that the PARENT_ID differs between the two. If you query the table for the row with ID equal to the PARENT_ID you've identified as correct, you'll find it corresponds to the folder containing the VDS, not to the VDS itself; it looks like the way that the port groups are displayed under the VDS is something handled by the vSphere client rather than something that happens by virtue of their hierarchy in the database.

You'll need to run a SQL UPDATE query to fix the rows with the wrong PARENT_ID. If all of the escaped port groups are in the same folder, you could just make a note of the corresponding PARENT_ID and run something like

UPDATE [dbo].[VPX_ENTITY] SET [PARENT_ID] = <CORRECT_PARENT_ID> WHERE [TYPE_ID] = 15 and [PARENT_ID] = <INCORRECT_PARENT_ID>

updating the correct and incorrect parent IDs as appropriate. If not, make a list of the IDs and run something like

UPDATE [dbo].[VPX_ENTITY] SET [PARENT_ID] = <CORRECT_PARENT_ID> WHERE [ID] IN (id1, id2, id3...)

updating the correct parent ID and entity IDs as appropriate. Once you've done this, you'll need to restart the VMware VirtualCenter service on your vCenter server.

By the way, the root cause of the original problem turns out to be that one of the existing VDSes in the folder I was trying to move it to had a port group defined with the same name as a port group on the new VDS. As we've seen from the database structure, the portgroup entities are children of the vDS's parent folder, not the vDS itself, so I'm surmising that the client got part-way through moving the port groups of the moving vDS, hit a name collision with a port group on the existing vDS in the folder and failed to clean up after itself properly!