Latency leads to full system Failure in Exchange 2010 - Resolved in 2013? (Network Steve Forum)

Latency leads to full system Failure in Exchange 2010 - Resolved in 2013?

My work encountered a bug - and for no lack of trying - I could not find any bug submission for Ex2010 anymore. Since Most programming relies on the foundation of the past - if replication service was implemented the same in 2013 than it could leave to a full system failure.

The issue is something happened in REPLICATION service in such a fashion that it accepted a connection, and in the programming it set a variable to a LARGE NUMBER (fail closed) to replace for "Copy Queue length" in the ball park of 922 quadrillion.

The service never updates this value because of its quasi-state and some timeout occurs and now the GOOD servers think their copy is ludicrously behind. This causes the GOOD servers to attempt to fail-over - as they believe their copy from the BAD server has failed or perhaps another check fails...

[6 DAG pieces on 3 servers, 4 pieces per server]

In my scenario Server 3 holds:

1 part of server 2 <= Tells server 2 you're bad - don't fail over

1 part of server 1. <= Tells server 1 you're bad - don't fail over

2 parts its own <= "good" - though dead replication service

As I do not know if this exploitable by simply placing a listener in place of replication service to accept and do not respond to connections - I do not know. However, there is a "bug" or flaw in my opinion to allow 1 bad server to superceed two good servers based on a partial service crash. Simply stopping replication service on Server 3 allowed Server 1 / 2 to take over with out issue - I then restarted replication service and no issue since.

Since i wasted an hour looking where to submit this - this forum will hopefully do - I am very much a white hat - but I am dislike the non-simple way before me to disclose said information to the creator. I want my hour back for their non-intuitive process, or maybe the simply don't care about Exchange 2010.

- Dan

Edited by FFFrk 12 hours 52 minutes ago

April 27th, 2015 2:31pm

Please note "LATENCY" in my topic is not network but inter-process on a single host, and is used as saying a timeout occurs without value.

Free Windows Admin Tool Kit Click here and download it now

April 27th, 2015 2:31pm

This topic is archived. No further replies will be accepted.