DSAccess woes
G'day all -
I'm an Active Directory guy trying to tackle an Exchange issue, so please forgive if my terminology is not right.
I work for a relatively large organisation, and in some countries we have repeating incidents whereas Exchange becomes unavailable - users can't logon and so forth. The postmasters have to stop MS Exchange store service and reboot Exchange systems for rapid
resolution. there's one site where the problem is the most acute and two Exchange servers have it regularly. The setup is rather standard - Exchange 2003 on Windows 2003 with less than 1000 mailboxes per server (40GB RAM, iSCSI with NetApp storage for the
stores, NetApp Snapdrive, Antigen, ~25% of mailboxes are BlackBerry-enabled).
The best I got for the logs footprint is this:
Event 2103 MSExchangeDSAccess
Process EMSMTA.EXE (PID=9999). All Global Catalog Servers in use are not responding
<long list of domain controllers follows>
There are some other events that are occuring as those exchange servers have a problem:
Event 2085 MSExchangeDSAccess
Process EMSMTA.EXE (PID=9999). No Global Catalog server is up in the local site 'USERSITE'.
Event 9098 MSExchangeSA
The MAD Monitoring thread was unable to read its configuration from the DS, error '0x80004005'.
And few more - anything that can complain will complain about my AD not being available. If we point Exchange servers to certain local DCs (in the same rack), the issue doesn't go away. It still happens although appears to recover faster that discovery is
enabled, so user impact is reduced but not avoided - sometimes users are disconnected during busy hours.
Here's my predicament: I'm certain that Active Directory works just fine. We have completed all checks possible against the DS and I'm pretty sure that conectivity is not a problem - nothing stands out in the captures of traffic b/w Exchange and AD. Site
topology could use some optimisation but generally not a problem and is eliminated by pointing to local DCs anyway.
I'd like to identify the root cause of this. What else can I check? Could those problems be related to WMI? Once as all users were disconnected I tried to restart WMI and the service wouldn't stop - hung stopping. Could this be related to storage?
This is really very interesting issue (although involving 9-year old technology). Please help. All reasonable ideas appreciated.
-= F1 is the Key =-
March 6th, 2012 1:42am
On Tue, 6 Mar 2012 06:42:42 +0000, S. Pidgorny wrote:
>
>
>G'day all -
>
>I'm an Active Directory guy trying to tackle an Exchange issue, so please forgive if my terminology is not right.
>
>I work for a relatively large organisation, and in some countries we have repeating incidents whereas Exchange becomes unavailable - users can't logon and so forth. The postmasters have to stop MS Exchange store service and reboot Exchange systems for
rapid resolution. there's one site where the problem is the most acute and two Exchange servers have it regularly. The setup is rather standard - Exchange 2003 on Windows 2003 with less than 1000 mailboxes per server (40GB RAM, iSCSI with NetApp storage for
the stores, NetApp Snapdrive, Antigen, ~25% of mailboxes are BlackBerry-enabled).
>
>The best I got for the logs footprint is this:
>
>Event 2103 MSExchangeDSAccess Process EMSMTA.EXE (PID=9999). All Global Catalog Servers in use are not responding <long list of domain controllers follows>
>
>There are some other events that are occuring as those exchange servers have a problem:
>
>Event 2085 MSExchangeDSAccess Process EMSMTA.EXE (PID=9999). No Global Catalog server is up in the local site 'USERSITE'.
>
>Event 9098 MSExchangeSA The MAD Monitoring thread was unable to read its configuration from the DS, error '0x80004005'.
>
>And few more - anything that can complain will complain about my AD not being available. If we point Exchange servers to certain local DCs (in the same rack), the issue doesn't go away. It still happens although appears to recover faster that discovery
is enabled, so user impact is reduced but not avoided - sometimes users are disconnected during busy hours.
>
>Here's my predicament: I'm certain that Active Directory works just fine. We have completed all checks possible against the DS and I'm pretty sure that conectivity is not a problem - nothing stands out in the captures of traffic b/w Exchange and AD. Site
topology could use some optimisation but generally not a problem and is eliminated by pointing to local DCs anyway.
>
>I'd like to identify the root cause of this. What else can I check? Could those problems be related to WMI? Once as all users were disconnected I tried to restart WMI and the service wouldn't stop - hung stopping. Could this be related to storage?
>
>This is really very interesting issue (although involving 9-year old technology). Please help. All reasonable ideas appreciated.
The first thing I'm going to point out is that 40GB of RAM is 36GB too
much for Exchange 2003. I hope you meant 4GB. If not, someone's burned
a pile of money!
Let's assume that you have the /3GB switch in the boot.ini file (and
if you don't the presence of that much memory on the machine will have
the same effect). Right away the available non-pooled memory available
to the O/S is about half of what it would be if you didn't use the
/3GB switch. So, I'm going to suggest that you first monitor the paged
and on-paged memory pools for exhaustion.
1000 mailboxes isn't anywhere near the 3,800 number where Exchange
starts to choke, but the number of I/Os that may be outstanding may be
sucking up that limited amount of non-paged memory pool.
You may also have a memory leak in a driver. If you have a file-based
Anti-Virus on the machines that driver may be the culprit (by design,
not necessarily a bug). Microsoft has tools to monitor the memory
pools. Using ProcessExplorer (together with the MS debugging package)
can also be helpful in finding memory and handle leaks.
For AD stuff you should be familiar with dcdiag and netdiag. When the
problem's happening can you log on to the server? If you can, does
dcdiag or netdiag detect any problems?
---
Rich Matheisen
MCSE+I, Exchange MVP
--- Rich Matheisen MCSE+I, Exchange MVP
Free Windows Admin Tool Kit Click here and download it now
March 6th, 2012 5:40pm
G'day -
Thank you for the suggestions! You are correct, that's 4GB of RAM. I couldn't see any symptoms of memory leak into system areas. There are no 2019/2020s warnings from Srv service, I can logon with domain credentials, and traffic captures show uninterrupted communication
with the domain controllers. We have eliminated Symantec AV software that is notorious for gobbling up the pools - the issue persisted.
Active Directory is healthy and communications appear to be okay, even though Microsoft CTS on the case were adamant that the problem is in that space.
I have found something that looks like a root cause indicator and a workaround. As the issues occur, some of the errors logged by MSExchangeDSAccess mention wmiprvse.exe process:
Event Type: Error
Event Source: MSExchangeDSAccess
Event Category: Topology
Event ID: 2104
Description: Process WMIPRVSE.EXE -EMBEDDING (PID=9999). All the DS Servers in domain are not responding.
Event Type: Warning
Event Source: MSExchangeDSAccess
Event Category: Topology
Event ID: 2121
Description: Process WMIPRVSE.EXE -EMBEDDING (PID=6088). DSAccess is unable to connect to any domain controller in domain emea.example.net although DNS was successfully queried for the service location (SRV) resource record used to locate a
domain controller for that domain.
As that happens, I cannot query DSAccess via WMI (as per KB 313711), receiving WBEM_E_PROVIDER_LOAD_FAILURE error:
WINMGMTS:{authenticationLevel=pkt,impersonationLevel=impersonate}!
\\EXCHANGESRVR\ROOT\MicrosoftExchangeV2:Exchange_DSAccessDC
C:\tmp\ex\queryDSAccess.vbs(12, 1) (null): 0x80041013
When I kill the wmiprvse.exe process, DSAccess is reinitialised:
Event Type: Information
Event Source: MSExchangeDSAccess
Event Category: General
Event ID: 2068
Description: Process WMIPRVSE.EXE -EMBEDDING (PID=99999). DSAccess initialized successfully.
From that point on, all errors from MSExchangeDSAccess and some more disappear from the logs, and I can query DSAccess via WMI. So I think the issue is with WMI but still cannot quite pinpoint the underlying cause.-= F1 is the Key =-
March 9th, 2012 10:42pm
To find the cause you may well have to take a procdump at the time of the issue. If you have a case open with MS, then ask them to take that route.Sukh
Free Windows Admin Tool Kit Click here and download it now
March 9th, 2012 11:22pm
On Sat, 10 Mar 2012 03:42:45 +0000, S. Pidgorny wrote:
>G'day - Thank you for the suggestions! You are correct, that's 4GB of RAM. I couldn't see any symptoms of memory leak into system areas. There are no 2019/2020s warnings from Srv service, I can logon with domain credentials, and traffic captures show
uninterrupted communication with the domain controllers. We have eliminated Symantec AV software that is notorious for gobbling up the pools - the issue persisted. Active Directory is healthy and communication??s appear to be okay, even though Microsoft CTS
on the case were adamant that the problem is in that space. I have found something that looks like a root cause indicator and a workaround. As the issues occur, some of the errors logged by MSExchangeDSAccess mention wmiprvse.exe process: Event Type: Error
Event Source: MSExchangeDSAccess Event Category: Topology Event ID: 2104 Description: Process WMIPRVSE.EXE -EMBEDDING (PID=9999). All the DS Servers in domain are not responding. Event Type: Warning Event Source:
>MSExchangeDSAccess Event Category: Topology Event ID: 2121 Description: Process WMIPRVSE.EXE -EMBEDDING (PID=6088). DSAccess is unable to connect to any domain controller in domain emea.example.net although DNS was successfully queried for the service
location (SRV) resource record used to locate a domain controller for that domain. As that happens, I cannot query DSAccess via WMI (as per KB 313711), receiving WBEM_E_PROVIDER_LOAD_FAILURE error: WINMGMTS:{authenticationLevel=pkt,impersonationLevel=impersonate}!
\\EXCHANGESRVR\ROOT\MicrosoftExchangeV2:Exchange_DSAccessDC C:\tmp\ex\queryDSAccess.vbs(12, 1) (null): 0x80041013 When I kill the wmiprvse.exe process, DSAccess is reinitialised: Event Type: Information Event Source: MSExchangeDSAccess Event Category: General
Event ID: 2068 Description: Process WMIPRVSE.EXE -EMBEDDING (PID=99999). DSAccess initialized successfully. From that point on, all errors from MSExchangeDSAccess and some more disappear from the logs, and
>I can query DSAccess via WMI. So I think the issue is with WMI but still cannot quite pinpoint the underlying cause.
WBEM_E_PROVIDER_LOAD_FAILURE means that WMI was unable to load a
provider. Since stopping/starting WMI seems to "cure" the problem I
don't think the problem has anything to do with the normal errors
associated with unregistered DLLs or missing/damaged registry values.
This might provide some help:
http://msdn.microsoft.com/en-us/library/aa392570(v=vs.85).aspx
"...The system drops cache entries through the cache aging process,
loss of RPC connectivity, user control, or due to some change in the
provider registration"
The provider registration hasn't changed. Nobody's messed with
dcomcnfg. The DSAccess is used every 15 minutes, so it's probably not
aging out of the cache. Loss of RCP connectivity is the likely
culprit, even if it works again a few seconds later. What remains a
mystery is why the provider can't be loaded again!
This describes an easy way to get at that
MSFT_WmiProvider_LoadOperationFailureEvent class information:
http://blogs.infosupport.com/win32_network-adapter-provider-load-failure/
Good luck with WMI spelunking. Whenever presented with WMI problems I
like King Authur's advice in Monty Python and the Holy Grail: Run
away! Run away! :-D
---
Rich Matheisen
MCSE+I, Exchange MVP
--- Rich Matheisen MCSE+I, Exchange MVP
March 10th, 2012 3:11pm
Good luck with WMI spelunking. Whenever presented with WMI problems I
like King Authur's advice in Monty Python and the Holy Grail: Run
away! Run away! :-D
---
Rich Matheisen
MCSE+I, Exchange MVP
--- Rich Matheisen MCSE+I, Exchange MVP
Tis but a scratch! It's just a flesh wound!
Free Windows Admin Tool Kit Click here and download it now
March 10th, 2012 3:21pm
Looks like there was a DSAccess hotfix that is applicable here - The Wmiprvse.exe process crashes on an Exchange Server 2003 server
(http://support.microsoft.com/kb/947485). Not exactly my symptoms but newer code may help.-= F1 is the Key =-
March 13th, 2012 10:51pm
In case anyone's interested: the above hotfix did dot resolve the issue in my environment. So I have a workaround (terminating the wmiprvse.exe process that is DSAccess) but no cause or the fix.I wonder if current versions of Exchange also rely on that process-= F1 is the Key =-
Free Windows Admin Tool Kit Click here and download it now
April 1st, 2012 12:32am