My work encountered a bug - and for no lack of trying - I could not find any bug submission for Ex2010 anymore. Since Most programming relies on the foundation of the past - if replication service was implemented the same in 2013 than it could leave to a full system failure.
The issue is something happened in REPLICATION service in such a fashion that it accepted a connection, and in the programming it set a variable to a LARGE NUMBER (fail closed) to replace for "Copy Queue
length" in the ball park of 922 quadrillion.
The service never updates this value because of its quasi-state and some timeout occurs and now the GOOD servers think their copy is ludicrously behind. This causes the GOOD servers to attempt to fail-over - as they believe their copy from the BAD server has failed or perhaps another check fails...
[6 DAG pieces on 3 servers, 4 pieces per server]
In my scenario Server 3 holds:
1 part of server 2 <= Tells server 2 you're bad - don't fail over
1 part of server 1. <= Tells server 1 you're bad - don't fail over
2 parts its own <= "good" - though dead replication service
As I do not know if this exploitable by simply placing a listener in place of replication service to accept and do not respond to connections - I do not know. However, there is a "bug" or flaw in my opinion to allow 1 bad server to superceed two good servers based on a partial service crash. Simply stopping replication service on Server 3 allowed Server 1 / 2 to take over with out issue - I then restarted replication service and no issue since.
Since i wasted an hour looking where to submit this - this forum will hopefully do - I am very much a white hat - but I am dislike the non-simple way before me to disclose said information to the creator. I want my hour back for their non-intuitive process, or maybe the simply don't care about Exchange 2010.
- Dan
- Edited by FFFrk 12 hours 52 minutes ago