CAS load balancing
Hello,
I have a two Node Dag Exchange 2013 MultiRole server. The Usermailbox is on Node One and
mounted and in Healthy status. If for some reason the Backend MSExchangeOWAapppool is stopped on Node ONE the user will not be able to access the mailbox, even though the Node Two Mulit role is up and functional. Healthchecks on the (netscaler) load balancer
still direct owa requests to node ONE even though the pool has failed and the node is down. If I remove the load balancer it acts the same way. Is the component level failure not detected in Exchange 2013?
September 2nd, 2015 9:22am
Hello,
I have a two Node Dag Exchange 2013 MultiRole server. The Usermailbox is on Node One and
mounted and in Healthy status. If for some reason the Backend MSExchangeOWAapppool is stopped on Node ONE the user will not be able to access the mailbox, even though the Node Two Mulit role is up and functional. Healthchecks on the (netscaler) load balancer
still direct owa requests to node ONE even though the pool has failed and the node is down. If I remove the load balancer it acts the same way. Is the component level failure not detected in Exchange 2013?
I would think this is something managed availability should catch and move the mailbox database to a different server.
When you access http://servername/owa/healthcheck.htm does that show everything up for that server? That's really what the netscaler is checking against.
How long did you let it run in that state?
September 2nd, 2015 12:56pm
I'm not sure if your external load balancer will detect this type of failure and stop routing users to the partially failed server. The best thing to do would be to find out why the OWA app pool is failing and troubleshoot that.
Other things that can be done are to run a script which will remove the server from the load balancer should there be issues with app pools/services etc but this may be quite a bit of work.
IIS logs and event logs should give you an idea why the app pool is failing.
Thanks.
September 2nd, 2015 1:22pm
When I stop the OWA app pool, the healthcheck shows HTTP Error 503. The service is unavailable. The problem is the db doesn't fail to the other node. If I down the server, or the nic (vm), the db fails over automatically and is ok. Should the db fail
over if the app pool fails? If it should what should I look at to change the behaviour?
September 2nd, 2015 1:37pm
When I stop the OWA app pool, the healthcheck shows HTTP Error 503. The service is unavailable. The problem is the db doesn't fail to the other node. If I down the server, or the nic (vm), the db fails over automatically and is ok. Should the db fail
over if the app pool fails? If it should what should I look at to change the behaviour?
How long do you keep the app pool stopped for?
September 2nd, 2015 1:48pm
September 2nd, 2015 2:04pm
I wouldn't expect an app pool failure to cause the DAG to fail over. This is because the DAG is high availability for the mailbox server role and not the cas server role.
The load balancer will need to be informed to stop sending requests to the CAS server should there be a partial failure of that CAS server.
Thanks.
September 2nd, 2015 2:10pm
An CAS App pool failure may not cause a DAG failover but regardless you need to set your load balancer to check the health of the CAS and mark it down if its not responding so no clients are routed to it
http://blogs.technet.com/b/exchange/archive/2014/03/05/load-balancing-in-exchange-2013.aspx
To ensure that load balancers do not route traffic to a Client Access server that Managed Availability has marked as offline, load balancer health probes must be configured to check
<virtualdirectory>/healthcheck.htm (e.g., https://mail.contoso.com/owa/healthcheck.htm). Note that
healthcheck.htm does not actually exist within the virtual directories; it is generated in-memory based on the component state of the protocol in question.
If the load balancer health probe receives a 200 status response, then the protocol is up; if the load balancer receives a different status code, then Managed Availability has marked that protocol instance down on the Client Access server. As a result, the
load balancer should also consider that end point down and remove the Client Access server from the applicable load balancing pool.
September 2nd, 2015 2:25pm
I am manually stopping the owa app pool on one of the dag servers for testing, mind you this is my pre-prod test lab. I have my netscaler setup per http://danielruiz.net/2015/05/26/exchange-2013-layer-7-single-namespace-loadbalancing-with-citrix-netscaler/comment-page-1/
and the probe does mark the node as down.
Forgetting the load balancer, internally I cannot bring up an owa session from another cas server directly to the node with the app pool failure. I'm sure this is per design, but I'm wondering why the active db doesn't fail over for this type of failure?
September 2nd, 2015 2:54pm
Also, if the cas app pool doesn't cause a db failover, why have the cas role on the same server as the mb role per best practices (simplified dag with autoreseed and a load balancer). If the mailbox in question doesn't move requests will keep going to the
failed server.
Should I have 2 seperate cas servers load balanced?
September 2nd, 2015 3:01pm
The best practice is to keep both CAS and MBX on the same server but this is for simplicity I believe.
The CAS and MBX roles can work independently. For example, if either CAS server has a problem, the other CAS should be able to provide access to the mailboxes no matter which MBX server has the database mounted. I.e. a CAS failure doesn't require an DAG
failover and this is why app pools don't cause DAG failovers.
September 2nd, 2015 4:24pm
Also, if the cas app pool doesn't cause a db failover, why have the cas role on the same server as the mb role per best practices (simplified dag with autoreseed and a load balancer). If the mailbox in question doesn't move requests will keep going to the
failed server.
Should I have 2 seperate cas servers load balanced?
That would be true whether the CAS role was separate from the mailbox role or it was multi-role. The load balancer logic is what keeps clients from connecting to the CAS. Even if the database had failed over, clients could still be directed to the failing CAS.
Multi-role is best.
September 2nd, 2015 5:58pm
Multi-role has indeed been the recommendation since Exchange 2010.
Design simplicity, and removes issues concerning MBX -> CAS ratio for example.
In Exchange 2016 this is now how you will deploy Exchange. Roles are combined.
September 3rd, 2015 1:52pm