Tuesday, 5 August 2014

Office 365 / Exchange Online hybrid configuration and moderated distribution groups

We're testing out Office 365 in hybrid configuration with our on-premises Exchange 2010 organization, using Exchange 2013 hybrid servers in order to give us access to on-premises public folders from Exchange Online mailboxes. We've hit a few teething problems along the way, one of which was getting moderated distribution groups to work between Exchange Online and on-premises Exchange.

I had to make a number of adjustments to the configuration put in place by the hybrid configuration wizard to get this up and running, so I thought I'd document what I did in case anyone else finds it useful. Throughout, text in magenta needs to be replaced with appropriate values for your organization.

On-premises changes

All changes were made using the Exchange Management Shell.

Create a remote domain for the actual tenant domain

The HCW creates a remote domain in the on-premises organization for the hybrid coexistence domain (TenantName.mail.onmicrosoft.com), but emails regarding moderated groups come from an arbitration mailbox in the tenant with email address SystemMailbox{bb558c35-97f1-4cb9-8ff7-d53741dc928c}@TenantName.onmicrosoft.com, so I had to add an additional remote domain:

New-RemoteDomain -Name "Hybrid Domain - TenantName.onmicrosoft.com" -DomainName tenantname.onmicrosoft.com

Update hybrid remote domain settings

I'd manually created the on-premises connector for our hybrid coexistence domain, so it didn't have the required settings to allow things like internal out-of-office messages to be passed. I updated the settings for both Exchange Online domains. Note that the option to enable TNEF is crucial - without this, you won't get the voting buttons on emails from the tenant, which breaks approval altogether!

Set-RemoteDomain -Name "Hybrid*" -IsInternal $true -TargetDeliveryDomain $true -AllowedOOFType InternalLegacy -MeetingForwardNotificationEnabled $true -TrustedMailOutboundEnabled $true -TrustedMailInboundEnabled $true -UseSimpleDisplayName $true -TNEFEnabled $true

Update Office 365 send connector

The last on-premises change was to update the Office 365 send connector to include the tenant domain, otherwise responses to moderation requests were NDR'd with an 'Authentication Required' message.

Set-SendConnector "Outbound to Office 365" -AddressSpaces @{Add="tenantname.onmicrosoft.com"}

Exchange Online changes

All changes were made using PowerShell connected to the Exchange Online provider.

Create a remote domain for each hybrid domain

It's necessary to create a remote domain in the tenant for each SMTP domain included in a ProxyAddress in your on-premises organization. You can see which domains Exchange Online is aware of using Get-AcceptedDomain; for each one that's not a tenant or coexistence domain, you'll need to run the following, substituting each of your domains in turn.

New-RemoteDomain -Name "Hybrid Domain - domainname.tld" -DomainName domainname.tld

Update hybrid remote domain settings

Finally, you'll need to update the Exchange Online remote domains in the same way that you updated the on-premises domains. Conveniently, if you created them as suggested above, you can run the exact same PowerShell command in Exchange Online. Again, TNEFEnabled is crucial to get the voting buttons displayed - if you leave it at the default of $null, messages get converted to HTML and the voting buttons are lost.

Set-RemoteDomain -Identity "Hybrid*" -IsInternal $true -TargetDeliveryDomain $true -AllowedOOFType InternalLegacy -MeetingForwardNotificationEnabled $true -TrustedMailOutboundEnabled $true -TrustedMailInboundEnabled $true -UseSimpleDisplayName $true -TNEFEnabled $true

At this point, everything should spring into life. Test by emailing an on-premises moderated group from an Exchange Online mailbox and see if you get a moderation message with voting buttons that you can approve.

Tuesday, 11 February 2014

Dealing with VMware Permanent Device Loss (PDL) on an HA cluster

I recently updated the firmware on our Dell EqualLogic SAN, which is used for VMware VMFS volumes, from 6.0.6 H2 to 7.0.1. During the update, a couple of our PS6110XV arrays refused to restart through the GUI, citing error:

Array firmware update from version V6.0.6 to V7.0.1 failed. Reason: The array cannot be restarted using the GUI because of a RAID issue. Use the CLI 'restart' command to continue.

Some research indicated that this error was apparently benign, a result of a bug in firmware 6.0.6, so I should have no problems following its instructions to restart the array at the CLI. Unfortunately, things didn't go smoothly...

I normally have a continuous ping going to the array's network addresses whilst restarting to make sure it comes back. As a rule, it typically drops 4-5 pings during a controller failover, indicating that the array is off the network for 15-20 seconds. I therefore started getting concerned when it still wasn't responding to ping after a minute. It finally returned to service after nearly five minutes.

During this time, I was keeping an eye on our ESXi hosts to make sure they were behaving, keeping my fingers crossed that there was sufficiently little I/O going on that they wouldn't notice their iSCSI volumes disappearing from under them. Sadly, it was not to be - a number of hosts noticed that a selection of their volumes had vanished. After a while, they were marked as 'Failed / Offline'. /var/log/hostd.log on the affected hosts showed a number of errors along the lines of:

Device naa.6090a0c800b6fd7093250547a20180db has been removed or is permanently inaccessible. Affected datastores (if any): "storage3".
NotifyIfAccessibleChanged -- notify that datastore 52593748-41a1a7ae-6bfc-b8ca3a647bc0 at path /vmfs/volumes/52593748-41a1a7ae-6bfc-b8ca3a647bc0 now has accessibility of false due to PermanentDeviceLoss

VMware have a KB article on how to recover from this very situation. Unfortunately, step 1 in their instructions indicates that all VMs on the affected datastore must be powered off and unregistered from vCenter, which is pretty drastic when you're talking about 30-40 VMs! As it happens, the affected hosts in my case were in an HA cluster, so I was able to cheat by following a different process...

  1. In the vSphere client, each affected VM will be marked with a speech bubble question icon. Click each one in turn. You'll be told its backing storage has gone away and asked if you want to Retry or Cancel I/O. You should Cancel. The VM will grey out and be marked as inaccessible.
  2. Log onto the host that was previously running the VM by SSH.
  3. Run vim-cmd vmsvc/getallvms. Look for VMs you know to be affected and any VMs listed as invalid and note their VM ID.
  4. For each one in turn, run vim-cmd vmsvc/unregister <vmid>, inserting the appropriate VM ID from step 3.
  5. If any unregister operation complains that the VM is powered on, you may be able to run vim-cmd vmsvc/power.off <vmid>, then try unregistering again. If this doesn't help, you'll need to find the VM's world ID by running esxcli vm process list, then force it to end by running esxcli vm process kill --type=force --world-id=<World ID>. You should then be able to unregister it.
  6. Once you've unregistered all VMs on a datastore, the host will eventually reconnect. If it doesn't, you may need to force it to rescan, either through the vSphere client or by running esxcfg-rescan -A. If the GUI rescan doesn't work, the CLI is more likely to give an error - I ended up having issues with a couple of datastores where I couldn't unregister some templates, so I ended up having to remove these from the vCenter inventory and add them back.
  7. Watch as HA automatically restarts all unregistered VMs on any hosts that still have access to the affected datastores.
You'll probably need to deal with filesystem corruption on a number of VMs, but at least you haven't had to re-add them all to the inventory by hand!

Once you've finished, you should read this post on Josh Odger's blog for details of a couple of advanced settings you should configure to allow the cluster to automatically recover from this in the future, and this post on Yellow-Bricks for details of a change to the option you need to configure for automatic recovery in ESXi 5.5.

When VDS portgroups escape...

Earlier today, I was attempting to move a couple of vSphere Distributed Switches into folders in my vCenter Networking inventory. One moved successfully, but when I tried to move the other, I was greeted with the error, "These vDSes failed to move." Descriptive, right?

Unfortunately, whilst the vDS itself didn't move, some of the portgroups it contained did. They went from being child nodes of the parent vDS to being child nodes of the folder I was trying to move the vDS into. Getting them back under their parent vDS turned out to require some slight hacking of the vCenter database, so I'm documenting what I did here in case it's helpful to anyone else.

To fix the problem, I pointed SQL Server Management Studio at the vCenter database and took a look at the [VPXV_ENTITY] table. This contains a row for every item in the vCenter inventory, together with a type ID and a parent. To find all vDS port groups, the SQL query you need is:

SELECT [ID], [NAME], [PARENT_ID] FROM [dbo].[VPX_ENTITY] WHERE [TYPE_ID] = 15

Take a look at the entities that are returned by this and compare one of the port groups correctly under the vDS to one of the escapees; you'll find that the PARENT_ID differs between the two. If you query the table for the row with ID equal to the PARENT_ID you've identified as correct, you'll find it corresponds to the folder containing the VDS, not to the VDS itself; it looks like the way that the port groups are displayed under the VDS is something handled by the vSphere client rather than something that happens by virtue of their hierarchy in the database.

You'll need to run a SQL UPDATE query to fix the rows with the wrong PARENT_ID. If all of the escaped port groups are in the same folder, you could just make a note of the corresponding PARENT_ID and run something like

UPDATE [dbo].[VPX_ENTITY] SET [PARENT_ID] = <CORRECT_PARENT_ID> WHERE [TYPE_ID] = 15 and [PARENT_ID] = <INCORRECT_PARENT_ID>

updating the correct and incorrect parent IDs as appropriate. If not, make a list of the IDs and run something like

UPDATE [dbo].[VPX_ENTITY] SET [PARENT_ID] = <CORRECT_PARENT_ID> WHERE [ID] IN (id1, id2, id3...)

updating the correct parent ID and entity IDs as appropriate. Once you've done this, you'll need to restart the VMware VirtualCenter service on your vCenter server.

By the way, the root cause of the original problem turns out to be that one of the existing VDSes in the folder I was trying to move it to had a port group defined with the same name as a port group on the new VDS. As we've seen from the database structure, the portgroup entities are children of the vDS's parent folder, not the vDS itself, so I'm surmising that the client got part-way through moving the port groups of the moving vDS, hit a name collision with a port group on the existing vDS in the folder and failed to clean up after itself properly!