Hello,
We are currently in the process of migrating from Exchange 2007 to Exchange 2013, coexistence has been implemented and 20% of our mailboxes have been migrated.
In the past week or so I have had two occurrences where mailbox migration batches containing a high number of mailboxes with small mailbox size appear to have caused one or all of the Exchange 2013 servers to fall over. These batches are all started through PowerShell from a CSV containing mailbox primary email address and target database so as to target multiple databases at in a single batch:
New-MigrationBatch -Local -Name $BatchName -CSVData ([System.IO.File]::ReadAllBytes($CSV)) -BadItemLimit 100 -NotificationEmails $AdminEmail -AutoStart
In addition the concurrent mailbox move limit has been left at the default of 20, in both the occurences of this issue the batches contained target databases on 3 Exchange 2013 servers meaning as I understand it we can have up to 60 synchronisations in progress at any one time during the batch.
The initial occurrence of this was a migration batch of 408 users, all of whom have small mailboxes, so the entire batch totalled only 43GB. Roughly 2 hours after the batch had begun its initial sync our service desk began to receive reports of mail delay, following investigation it appeared that one of the three target servers had begun to get its submission queue backed up with messages unable to connect to target databases on that server in order to deliver the messages. Worrying that the migration batch was the cause we stopped the job and within about 15minutes everything had returned to normal. The batch was then deleted and split into 3 separate batches of roughly 130 users each based on target server and re-run in order to identify if this was an issue with the target server which had the problem, however all 3 completed without issue separately.
The second occurrence of the issue has however been far more severe, in this case the batch was 120 mailboxes (again all small totalling 17GB for the entire batch) as we had drawn the conclusion that smaller batches are were better following the previous issue. In this case roughly an hour following the start of the synchronisation all 3 target servers began to be unresponsive in varying degrees:
- Users on all three servers were disconnected from Outlook
- One would not load ECP, however this degraded to none loading ECP as time went on
- SMTP continued to process initially, however this gradually begin to fail on each server
- Exchange Management Shell would not load on two servers, the remaining server would hang processing any EMS commands
- One of the three would not accept any new RDP connections and the majority of applications would not run
- All three however showed no noticeable problems from a resource point of view, CPU and memory and disk latency were all normal.
From the experience of the previous issue the first thing to be done was to stop the suspected migration batch, however up until the point where ECP and EMS stopped functioning none of the move requests went into a stopping or suspended state, and in turn this had no corrective impact on the issue.
On the surface in initially appeared that IIS was unhappy on all three target servers, iisreset however had no impact.
We took the view that restarting the worst impacted server was the only course of action for that device, this reboot took a lot longer than normal but did restore connection to mailboxes on that server, as such the other more severely impacted server was also rebooted.
During these reboots the Exchange Search service was stopped on the least impacted server, this lead to EMS commands completing and a manual suspension of the move requests was done. This server however continued to be unable to offer any client connectivity or access to ECP. As such this ended up being rebooted once the others had returned.
I have concerns around this now as I am unable to track down why this issue happened. I'm of the suspicion that the number of frequent and concurrent move requests doing their initial sync on such small mailboxes is causing one of the transport services to go into a tailspin and take other services out along the way, that said no services crashed and there was no unusually high resource usage from any of the Exchange services during these events. I have been toying with the idea that it is may be related to indexing the mailboxes as the drop into a 'Synced' state, and the number of indexing jobs running based on how quickly the mailboxes are syncing. Hence the delay in symptoms occurring after the batch is started and that stopping the Search Service seemed to somewhat alleviate some of the symptoms. If this were the case however I would have thought the noderunner.exe would have been chewing up CPU permanently, however it only appeared to be intermittently spiking up the resource tables during the course of the problems.
Is this likely to simply be a concurrency issue in move requests be it by the amount syncing at once or the amount sat open in total? Or is there something I'm missing here?
Thanks for any assistance anyone can offer.