Exchange 2003 Public Folder Replication 100% CPU
Hi all, I have been struggling with this issue for a couple of days now, so this is my cry for help! This might be a bit long, so apologies for that and thank you if you make it all the way through.
We have an aging exchange 2003 server in the organisation, which I decided to begin the process of decommissioning by moving the public folders off it, and onto a 2007 SP3 machine. I decided to do this by setting up replication, which previously we didn't
use - we only have one public folder store (small company).
The public folder hierarchy is in a bit of a state, as a couple of folders got a spam message stuck in them which got duplicated by our CRM system - resulting in a normal looking tree except for 2 folders which have 40k or so messages in them, with about
4k of them being legit.
I knew this was outside of normal operation, but was hoping I could run a dedupe tool on the store after replication - it wouldn't run on the old server for some reason.
Its worth noting, that at this point the system functioned completely normally.
I enabled replication, and went home for the weekend, where thankfully the office is closed. I came back on Monday to find the old exchange server churning away at 100% CPU for store.exe, and virtually completely unresponsive. Its also our SMTP server, and
external mail was locked up completely also. I immediately hit the "stop all replication button" while i investigated. This made no difference, despite logging the correct event log message. I then marked all folders not for replication in ESM, again
making no difference. I then unmounted the store, which resulted in normal mail flow.
That’s pretty much where I am now. With the store mounted, the system is completely unusable, with the CPU spiking to 100% usage every couple of seconds. As soon as the store is mounted, "messages awaiting submission" immediately fills with
1000s of replication messages, combinations of "Public Folder Content" and Backfill requests. These are being sent from the problematic server to the working one, as if it’s ignoring my "stop replication command". If you manually
clear the queue, no further messages are created but the problem persists.
I’ve spent a bit of time trying to resolve the issue, and have got a licence for digiscope to get a better view of the store when its un-mounted. It appears to me like its got some kind of internal replication queue that its trying to work down. When
it’s un-mounted there are literally 100s of thousands of messages that look like the ones in the "awaiting submission queue" sat in the internal folder "Internal/ReplSendFolder"
Anyway, that’s my issue. I’m hoping I can get it mounted again and not bogging down the server so I can run digiscope against it to remove the messages that I presume have caused the error to begin with, then enable replication again to complete
the migration. At the moment that’s looking a long way away!
I have enabled diagnostic logging for public folders, which look a bit like this after mount;
Database Start - 9523
Outgoing replication message issued
Content broadcasts will not occur, replication is disabled
Incoming replication requests
Backfill responses will not be issued, replication is disabled
More incoming, some outgoing
Backfill requests
So, I’m pretty confused. It doesn’t look like replication has stopped at all, and for the record if ran the pause replication command on the 2007 SP3 server as well.
Any advice much appreciated, I am now pretty much baffled. Thank you for reading!
Edit: Also, the store is in a clean shutdown state, isinteg shows no fixes required, and i defragged it recently.
September 24th, 2010 6:11am
Might help..
Disabling Av on PF store,,,and also stopping system message generation
Free Windows Admin Tool Kit Click here and download it now
September 25th, 2010 11:27am
I've tried stopping AV / anything else that isn't necessary for public folders, but this didn't make that much of a difference - it made things slighty more responsive, but mailflow was still basically halted.
Ill try system message generation as well, and see where that gets me.
Does anyone have any idea what the "ReplSendFolder" is that I mentioned, and how to clear it (if thats possible, or wise?)
September 27th, 2010 4:22am
Figured I might as well post an update on this.
I ended up workign with MS exchange support on this for a while. The ReplSendFolder is, as I understand it, an internal queue of replication messages that have been generated but are awaiting processing / sending.
In my case, there were so many messages in this queue that it would take pretty much months for them all to send out, and there was no way of clearing this - its an internal folder, only visible in an unmounted edb.
So, I had to use a 3rd party tool (in my case, digispy) to open the mounted edb, delete as many of the corrupted messages as humanly possible, then export the whole thing to PST and re-import it on a new server.
Once in the state it was in at the beginning, this was pretty much the only option, as processing the messages would take forever, and there is no way to clear them other than processing them. So it became a data recovery exercise from a corrupted database.
Maybe this will help someone in a similar situation, good luck!
Free Windows Admin Tool Kit Click here and download it now
October 6th, 2010 9:34am