Questions on Exchange corruption
Hi
Environment: Exchange 2007 SP2 on Windows 2008 Server SP2. CCR/SCC clustering in place with SCR.
Example scenario: Due to a driver issue, we recv 10xx alerts on our Exchange servers indicating that there is some corruption within the database. From my understanding, CCR is better in this scenario as the corruption can't be replicated to the passive
node due to the Inspector directory. With SCC, on the other hand, there is only one set of data, so my possible options are:
i. Move mailboxes on affected DB to another new DB
ii. Restore from backup
iii. Run ESEUtil /p to hard repair the database (last resort).
I was hoping someone could answer some queries I had on this:
1. Am I correct that most types of data corruption cannot be replicated with CCR/SCR technology. In which case, if the error happened on a CCR cluster, the best option would be to fail over to the passive node.
2. If this happened on an SCC cluster, then the 'move mailbox' idea is the best? However, how can we ensure that we don't carry any corruption across/have any data loss when the mailboxes are moved?
3. If we did decide to use ESEUtil /p, it would be safest to backup the database first. However, I always thought backups would fail if attempting to backup a database that had corruption?
As I mentioned, this is an example scenario.
September 11th, 2011 12:20am
Logical corruption can certainly be replicated, physical, not so much.
Regardless, my first step for logical corruption would be isinteg, followed by a reseed.
For physical, assuming the store was still up, I would be moving mailboxes to a new store that has already been replicated.
Restoring from backups would be the next step if the first two werent possible.
eseutil/p would be pretty much off the table, but if did run it, I would move mailboxes from a repaired store to a new one that has been replicated.
Free Windows Admin Tool Kit Click here and download it now
September 11th, 2011 1:01am
Hi Andy
Thanks! Some more questions :-)
"Logical corruption can certainly be replicated, physical, not so much." > How can we tell if the corruption is logical or physical? I guess physical corruption is caused by a hardware fault (driver etc), whereas logical? Looking at
http://support.microsoft.com/kb/314917 ,it seems to me that
1018/1019: Generally physical
1022: Database error
Would I be correct? How can we be sure? And if the error was 1018/1019, you don't recommend failing over to the passive CCR node to try and resolve?
"....I would be moving mailboxes to a new store that has already been replicated."
Do you know a command or switch in Exchange 2007 that we can use that will prevent data loss for any mailbox moves? That is, if there is likely to be any data loss for a particular mailbox, then exclude that mailbox from any moves?
Finally, how are we alerted for these errors? I know there are errors in the Event Log, but how often do they appear?
September 11th, 2011 1:17am
Yes, for physical corruption, I would failover to the other node and mount, sorry I was probably typing that too fast!
If that didnt work, then I would be thinking about moving mailboxes.
You'll know when its physical, those errrors -1018 etc... are exactly the kind of things you will see.
Logical corruption is probably a little harder to diagnose. It could be nothing but event log errors that you can ignore all the way to store crashes. Here is an example of a logical corruption that gets replicated to the both copies of the store in 2007:
http://support.microsoft.com/kb/959135
In this case, the quick fix was to run isinteg against the store after moving the mailbox that was causing the problem to another isolated store. Once isinteg was run, a reseed of the passive node was required.
I dont know of any switch that does that, but to be honest, if there are items that are corrupt you dont want them anyway. Those are generally calendar items.
Free Windows Admin Tool Kit Click here and download it now
September 11th, 2011 4:10pm
Oh, To your question:
Do you know a command or switch in Exchange 2007 that we can use that will prevent data loss for any mailbox moves? That is, if there is likely to be any data loss for a particular mailbox, then exclude that mailbox from any moves?
If you set data loss to 0 on mailbox moves, and there are corrupt items, then the mailbox wont be moved. So I guess in a round about way, that accomplishes what you want but it doesnt identify the corrupt items and exmerge/export to a pst wont export those
corrupt items either, so typically you set moves to allow for some item corruption otherwise you may never be able to move those mailboxes.
September 11th, 2011 4:16pm
Thanks Andy.
In terms of logical v physical corruption, are the 10xx events always physical corruption then?
Free Windows Admin Tool Kit Click here and download it now
September 11th, 2011 4:44pm
Thanks Andy.
In terms of logical v physical corruption, are the 10xx events always physical corruption then?
Yes, its pretty safe to say that if you ever see those -1018, -1019 or -1022 errors, its physical.
( Physical errors are pretty rare with today's hardware, but can still happen)
Error correction was added beginning with Exchange 2003 SP1
http://support.microsoft.com/kb/867626
September 11th, 2011 4:57pm
Thanks and final question :-)
If we did see those errors you mention in the Event Log, does that mean there definitely IS corruption and we should take action (failover, move mailboxes to another store etc) immediately rather than wait for users to notice?
Free Windows Admin Tool Kit Click here and download it now
September 11th, 2011 10:12pm
You mean the ones like the ones mentioned in
http://support.microsoft.com/kb/959135?
In that case the store crashes and failing over to passive node crashes as well since the problem is replicated.
If the store doesnt crash, you could failover yes to see if the errors follow.
September 11th, 2011 11:04pm
The link you mentioned above is for logical corruption isn't it? From your comments before, I take it that simply failing over (if we're using CCR) is not going to help with that situation, and we're stuffed if we're using SCC, so the only option is to move
mailboxes out/ restore? (And call PSS of course :-) )
I was more referring to the 10## errors mentioned in
http://support.microsoft.com/kb/314917 which I believe are physical. How do we know there is actually corruption there? Is there a command we can run to tell us for definite so we don't move mailboxes or fail over for no reason (ESEUtil /k for instance)
preferably without having to dismount the store to run it?
Free Windows Admin Tool Kit Click here and download it now
September 11th, 2011 11:49pm
If there is physical corruption that cant be corrected, the store wont mount. Thats when you would failover and attempt to mount the other copy.
September 12th, 2011 12:56am
Sure, but it is possible to have physical corruption and the store not dismount isn't it? Is the only way to verify that there was actually some corruption (as opposed to the alert being a false alarm) by using ESEUtil/k?
Free Windows Admin Tool Kit Click here and download it now
September 12th, 2011 1:04am
Yep, it's possible to have -1018/19/22 errors and the store will remain up. No need to verify with eseutil, those errors arent lying.
That goes back to the failover option. If you failover and there are no errors, then you fix the fix the hardware issue on the node that was throwing errors and then reseed from the active copy.
September 12th, 2011 1:42am