All Agents in the SCOM server grayed out
Hi All,
A strange thing encountered today, most of our SCOM agent-->under Agent managed our grayed out. We tried to ping some of the clients and are getting response sucessfully.
We have also restared the SCOM application and daatabase server. Please recommend what can be the issue.
Regards,
Gaurav
Thanks and Regards, Gaurav Jain
April 28th, 2011 6:17am
Stupid question but,
Hi do you have Management Servers handling Agents?
Are these Healthy?
/RogerThis posting is provided "AS IS" with no warranties, and confers no rights.
Free Windows Admin Tool Kit Click here and download it now
April 28th, 2011 6:35am
First check state of your management servers. Check if anybody placed management servers and specifically the rms in maintenance mode.Bob Cornelissen - BICTT (My BICTT Blog)
April 28th, 2011 6:43am
Yes, we do have MS server handling agents and it is showing Critical StatusThanks and Regards, Gaurav Jain
Free Windows Admin Tool Kit Click here and download it now
April 28th, 2011 6:44am
In that case first try to focus on that separate MS server and see what is wrong with it. see health explorer of the box, operations manager event log and so on. It would be likely a problem of that one box.Bob Cornelissen - BICTT (My BICTT Blog)
April 28th, 2011 6:48am
well there is your explanation why the agents are grey.
If you have more than two ms, you should have agent failover set up.
If you do have failover setup, you need to examine why this is not happening.
Now, why is the MS Critical?
/RogerThis posting is provided "AS IS" with no warranties, and confers no rights.
Free Windows Admin Tool Kit Click here and download it now
April 28th, 2011 6:49am
How can I check the rms service status??Thanks and Regards, Gaurav Jain
April 28th, 2011 7:05am
in scom console. go to monitoring. Operations Manager - management server - management server health state. Make sure all the management servers listed in both screens there are green. And if not check what is going on by opening the health explorer on
these. Fixing an RMS would be first prio, followed by the other management servers. Of course you can also proceed after that by the event log and services.msc to check if you see 3 System Center blabla services running on the RMS.Bob Cornelissen - BICTT (My BICTT Blog)
Free Windows Admin Tool Kit Click here and download it now
April 28th, 2011 7:09am
Yes, we do have MS server handling agents and it is showing Critical Status
Thanks and Regards, Gaurav Jain
Hi
Check the operationsmanager event log on the Management Server handling the agents to see any problems there.
Also, right click the management server and go to health explorer to see what the issue is. Just because it is critical doesn't mean that it is a major issue. There are lots of monitors that can create a critical alert.
You might also want to consider flushing the agent health state on the MS though be aware that you might lose a small amount of data if you do this:
This is done by:
- stopping the System Center Management Service.
- rename the “\Program Files\System Center Operations Manager 2007\Health Service State” folder e.g. add a suffix of old to the name of the folder
- then start the stopping the System Center Management Service
If everything starts ok, you can delete the original folder that you renamed.
Cheers
GrahamView OpsMgr tips and tricks at
http://systemcentersolutions.wordpress.com/
April 28th, 2011 7:42am
Hi Graham,
I found that the Health Status of the MS srever is back to healthy State. I am checking the Ops manager logs
Thanks and Regards, Gaurav Jain
Free Windows Admin Tool Kit Click here and download it now
April 28th, 2011 8:56am
I also found event id:10801Thanks and Regards, Gaurav Jain
April 28th, 2011 9:31am
It seems like you have an issue with your Opsdb.
Check whether it has enough space left in the dbase. It can take a while before the queus are emptied into the Opsdb.
Do you also have error ID 33333 on your servers?
Maybe post one of the events here so we can check further and find clues whether it's an issue with a stale object or the dbase itself.It's doing common things uncommonly well that brings succes.
Free Windows Admin Tool Kit Click here and download it now
April 28th, 2011 10:05am
Hi Dieter,
FYI..........
THe Event ID: 10801 says:
Discovery data couldn't be inserted to the database. This could have happened because of one of the following reasons:
- Discovery data is stale. The discovery data is generated by an MP recently deleted.
- Database connectivity problems or database running out of space.
- Discovery data received is not valid.
The following details should help to further diagnose:
DiscoveryId: dcffd7d6-3090-13a9-5a62-e295e1f8d9c8
HealthServiceId: e5dc4ef9-7641-8bd4-7b6b-7517d9b3053a
Health service ( MCDCSFPDC004.kldc.APMEA.McDcorp ) should not generate data about this managed type ( Microsoft.Windows.Computer ) object Id ( F4C5852D-0DB3-25C3-3A42-16DBD8FAE651 )..
For more information, see Help and Support Center at
http://go.microsoft.com/fwlink/events.asp.
Event ID:33333
Data Access Layer rejected retry on SqlError:
Request: p_ManagedEntityInsert -- (BaseManagedEntityId=f4c5852d-0db3-25c3-3a42-16dbd8fae651), (TypedManagedEntityId=f4c5852d-0db3-25c3-3a42-16dbd8fae651), (ManagedTypeId=ea99500d-8d52-fc52-b5a5-10dcd1e9d2bd), (FullName=Microsoft.Windows.Computer:MCDCSFPDC002.APMEA.McDcorp),
(Path=), (Name=MCDCSFPDC002.APMEA.McDcorp), (TopLevelHostEntityId=f4c5852d-0db3-25c3-3a42-16dbd8fae651), (DiscoverySourceId=0329ebc8-e990-8584-c695-ccd8bba6931e), (HealthServiceEntityId=e5dc4ef9-7641-8bd4-7b6b-7517d9b3053a), (PerformHealthServiceCheck=True),
(TimeGenerated=4/28/2011 12:24:30 PM), (RETURN_VALUE=1)
Class: 16
Number: 777980008
Message: Health service ( MCDCSFPDC004.kldc.APMEA.McDcorp ) should not generate data about this managed type ( Microsoft.Windows.Computer ) object Id ( F4C5852D-0DB3-25C3-3A42-16DBD8FAE651 ).
For more information, see Help and Support Center at http://go.microsoft.com/fwlink/events.asp.
Thanks and Regards, Gaurav Jain
April 28th, 2011 10:29am
Hi Gaurav,
Could you start by assigning the MCDCSFPDC004.kldc.APMEA.McDcorp server proxy? so in scom - administration - agent managed (assuming it is an agent, if it is a management server you will find it there). find the agent. properties. second tab. make
sure the checkbox is selected. Also do this for domain controllers, exchange boxes, isa servers, scom servers, sccm servers, cluster servers (nlb and failover clustering). See if this error disappears after that. And move onto the next error you see.Bob Cornelissen - BICTT (My BICTT Blog)
Free Windows Admin Tool Kit Click here and download it now
April 28th, 2011 10:46am
Hi Bob,
As suggested, I checked the MCDCSFPDC004.kldc.APMEA.McDcorp and found that it is in Healthy state and it is not grayed out.
The second tab regarding security for MCDCSFPDC004.kldc.APMEA.McDcorp is unchecked.
Regards,
Gaurav
Thanks and Regards, Gaurav Jain
April 28th, 2011 11:08am
please check that checkbox for the second tab on the properties of that agent. So enable that one please.Bob Cornelissen - BICTT (My BICTT Blog)
Free Windows Admin Tool Kit Click here and download it now
April 28th, 2011 11:10am
Bob,
One more thing I would like to mention that this particular agent is in KLDC domain and all agent in this domain are not grayed out, we have other domain agent of AUS and AP which are grayed out.
Regards,
Gaurav
Thanks and Regards, Gaurav Jain
April 28th, 2011 11:18am
Yeah, but those are not all related. The event you just sent us above here is about that specific agent trying to talk on behalf of another entity. It needs to have that proxy checkbox set in order to do that and for you to loose that one specific event
in your logs. (a domain controller agent talks on behalf of the domain, a cluster node talks on behalf of the cluster resources... thats what the proxy checkbox is for). The source machine does not get grey because of this.. its just that you get lots of alerts
and monitoring is not completely functional. You will see that it helps.
Next, what other evens do you see...are all the agents in a certain domain grey? or just some?
Is there any relationship of the grey agents with one specific management server? (meaning that the others connected to another MS would be green in the most cases..)Bob Cornelissen - BICTT (My BICTT Blog)
Free Windows Admin Tool Kit Click here and download it now
April 28th, 2011 11:28am
Hi
Are these domains in the same Forest? If not, have you configured an Forest Level Trust ? Or are you using certificates?
http://technet.microsoft.com/en-us/library/bb735408.aspx
Cheers
Graham
PS For enabling agent proxy - you'll need to do this for a number of management packs e.g. AD (if this is a domain controller), Exchange and also for clusters (enable for phyical nodes) plus others.
View OpsMgr tips and tricks at
http://systemcentersolutions.wordpress.com/
April 28th, 2011 11:28am
Thanks Bob for the information.
The Events I can see are 20022,10103,35169,33334 etc...
Event ID:31552
Failed to store data in the Data Warehouse.
Exception 'SqlException': A network-related or instance-specific error occurred while establishing a connection to SQL Server. The server was not found or was not accessible. Verify that the instance name is correct and that SQL Server is configured to allow
remote connections. (provider: Named Pipes Provider, error: 40 - Could not open a connection to SQL Server)
One or more workflows were affected by this.
Workflow name: Microsoft.SystemCenter.DataWarehouse.StandardDataSetMaintenance
Instance name: State data set
Instance ID: {836F1D56-3331-EAAF-38FF-26A1A9FBA69F}
Management group: APMEA01
For more information, see Help and Support Center at http://go.microsoft.com/fwlink/events.asp.
Well all agents in AUS and AP are gray. All domains have only one MS server.Thanks and Regards, Gaurav Jain
Free Windows Admin Tool Kit Click here and download it now
April 28th, 2011 11:36am
Hi Graham,
Is there any risk to consider flushing the agent health state.
FYI.....the issue still persists.
regards,
Gaurav
Thanks and Regards, Gaurav Jain
April 28th, 2011 11:39am
And, of course, the very first thing to do is read the announcements above the forums - please see the article on troubleshooting gray agent state.Microsoft Corporation
Free Windows Admin Tool Kit Click here and download it now
April 28th, 2011 11:45am
Hi
You will lose whatever data is in the cache - so it is not something that I would do lightly or recommend having to do frequently. But it won't break anything that isn't already broken. It will just force a refresh of all the agent configuration.
What monitor is showing critical? Is it just a CPU or Memory Monitor? If you right click the Management Server and choose Health Explorer, you can see the monitor(s) that are unhealthy.
Cheers
GrahamView OpsMgr tips and tricks at
http://systemcentersolutions.wordpress.com/
April 28th, 2011 11:47am
Can you setup a connection from a server that is "grey" to any of the management servers using the SCOM port:
telnet <management server name> 5723
So at least you know for sure it is not network related.Regards,
Marc Klaver
http://jama00.wordpress.com/
Free Windows Admin Tool Kit Click here and download it now
April 28th, 2011 11:47am
Graham,
Did the flushing the agent health state by stopping the servive and remaining the file.
The issue still persists. FYI...MS is in healthy state...
There are three domains in a forest AUS,AP and KL which this common MS.
KL domain agents are not grayed out; however other domain clients are....
Regards,
Gaurav
Thanks and Regards, Gaurav Jain
April 28th, 2011 12:00pm
Hi Marc,
You mean from any grayed out agent to the MS.
Regards,
GauravThanks and Regards, Gaurav Jain
Free Windows Admin Tool Kit Click here and download it now
April 28th, 2011 12:01pm
Yes :)Regards,
Marc Klaver
http://jama00.wordpress.com/
April 28th, 2011 12:23pm
Will I dont have remote access to the agents.
I did tried from SCOM server (MS) to agent and got the following results:
C:\WINDOWS>telnet 10.64.89.124 5723
Connecting To 10.64.89.124...Could not open connection to the host, on port 5723
: Connect failed
C:\WINDOWS>ping controller0093.aus.apmea.mcdcorp
Pinging controller0093.aus.apmea.mcdcorp [10.64.89.124] with 32 bytes of data:
Reply from 10.64.89.124: bytes=32 time=131ms TTL=119
Reply from 10.64.89.124: bytes=32 time=129ms TTL=119
Reply from 10.64.89.124: bytes=32 time=138ms TTL=119
Reply from 10.64.89.124: bytes=32 time=130ms TTL=119Thanks and Regards, Gaurav Jain
Free Windows Admin Tool Kit Click here and download it now
April 28th, 2011 12:46pm
And you can also reset the cache of the agent. stop agent, delete c:\program files\system center operations manager 2007\health service state\*.* and start it again. Watch the operations manager eventlog on that agent for the next 5 minutes. It should
start establishing connection, it will find out it has to download all management packs (lot of 1201 events) and it should continue from there. See if there are any errors in that.
Bob Cornelissen - BICTT (My BICTT Blog)
April 28th, 2011 12:49pm
Hi All,
I found the KB article http://support.microsoft.com/kb/981263.
What do you say can hotfix resolve?
Thanks and Regards, Gaurav Jain
Free Windows Admin Tool Kit Click here and download it now
April 28th, 2011 12:49pm
Bob,
Found the Event ID:29104
OpsMgr Config Service failed to send the dirty state notifications to the dirty OpsMgr Health Services. This may be happening because the Root OpsMgr Health Service is not running.
For more information, see Help and Support Center at http://go.microsoft.com/fwlink/events.asp.
Thanks and Regards, Gaurav Jain
April 28th, 2011 12:57pm
Graham,
Found some ops manager logs with Event ID: 20022
The health service {2BD436BB-7BEE-CD05-201E-4D0DC68C9024} running on host AEDXBDC01.ap.APMEA.McDcorp and serving management group APMEA01 with id {E59B9ECA-E516-1BF5-48EA-386678D48B94} is not heartbeating.
For more information, see Help and Support Center at http://go.microsoft.com/fwlink/events.asp.
Event ID: 10103
In PerfDataSource, could not find counter OpsMgr DW Synchronization Module, Data Items/sec, All Instances in Snapshot. Unable to submit Performance value. Module will not be unloaded.
One or more workflows were affected by this.
Workflow name: Microsoft.SystemCenter.DataWarehouse.CollectionRule.Performance.Synchronization.DataItemsPerSecond
Instance name: MCDCSFPWEB010.kldc.APMEA.McDcorp
Instance ID: {6B811AC7-B264-4ACA-4DFB-7F188794864B}
Management group: APMEA01
For more information, see Help and Support Center at
http://go.microsoft.com/fwlink/events.asp.
Event ID: 29104
OpsMgr Config Service failed to send the dirty state notifications to the dirty OpsMgr Health Services. This may be happening because the Root OpsMgr Health Service is not running.
For more information, see Help and Support Center at
http://go.microsoft.com/fwlink/events.asp.
Event ID: 10103
In PerfDataSource, could not find counter OpsMgr DB Write Action Modules, Avg. Batch Size, All Instances in Snapshot. Unable to submit Performance value. Module will not be unloaded.
One or more workflows were affected by this.
Workflow name: Microsoft.SystemCenter.HealthService.CollectionRule.Performance.AvgBatchSize
Instance name: MCDCSFPWEB010.kldc.APMEA.McDcorp
Instance ID: {6B811AC7-B264-4ACA-4DFB-7F188794864B}
Management group: APMEA01
For more information, see Help and Support Center at
Evemt
ID:26319
An exception was thrown while processing Connect for session id uuid:543dc59e-dc56-4c23-9b06-1a8cf1614e47;id=17.
Exception Message: The creator of this fault did not specify a Reason.
Full Exception: System.ServiceModel.FaultException`1[Microsoft.EnterpriseManagement.Common.UnauthorizedAccessMonitoringException]: The creator of this fault did not specify a Reason. (Fault Detail is equal to Microsoft.EnterpriseManagement.Common.UnauthorizedAccessMonitoringException:
The user APMEA\APMEA-vspoljarec does not have sufficient permission to perform the operation.).
For more information, see Help and Support Center at
http://go.microsoft.com/fwlink/events.asp.
YOUR
SUGGESTIONS PLEASE
Regards,
Gaurav
Thanks and Regards, Gaurav Jain
Free Windows Admin Tool Kit Click here and download it now
April 28th, 2011 1:10pm
The other way around does not work. You must try this from an agent. Perhaps you can ask the local administrator to test this?Regards,
Marc Klaver
http://jama00.wordpress.com/
April 28th, 2011 2:07pm
I would say: delete C:\Program Files\System Center Operations Manager 2007\Health Service State\*.*
Otherwise you just deleted the agent itself :)
I usually rename the "Health Service State" to "Health Service State.old", just because I am paranoia I delete something I shouldn't.Regards,
Marc Klaver
http://jama00.wordpress.com/
Free Windows Admin Tool Kit Click here and download it now
April 28th, 2011 2:12pm
o good catch Marc!! i wrote it down too quickly. i will edit that in my response as well. somebody walked in just as i was typing it.
Bob Cornelissen - BICTT (My BICTT Blog)
April 28th, 2011 2:16pm
Could you confirm whether it is all the agents in domains AP and AUS are greyed out? Or just some of them?
If all, are firewall ports open between the domains?
Alos, in which domain are the core OpsMgr servers located?
View OpsMgr tips and tricks at
http://systemcentersolutions.wordpress.com/
Free Windows Admin Tool Kit Click here and download it now
April 29th, 2011 4:52am
Graham,
It is all agents that is grayed out in AUS and AP.
Yes, Firewall ports are open. The server is in the KLDC domain in which all agents are not in grayed out state.
Thanks,
Gaurav
Thanks and Regards, Gaurav Jain
April 29th, 2011 7:07am
If it is all the agents in those domains then it is very unlikely to be an agent specific issue. Are you able to logon to a server in eith the AUS or AP domains?
If so, do a telnet managementservername:5723 - does this connect? If not, it is likely to be a firewall issue.
If this works then how were the agents installed? Manually? AD Integration? Push from OpsMgr console?
Are the agents listed under Administration, Device management, Pending Installation - if so, can you approve them?
Cheers
GrahamView OpsMgr tips and tricks at
http://systemcentersolutions.wordpress.com/
Free Windows Admin Tool Kit Click here and download it now
April 29th, 2011 7:23am
Graham,
Earlier today we have run some port query from the SCOM Server to client:
Query from SCOM Server to Client 10.252.216.42 on tcp5723
=============================================
Starting portqry.exe -n 10.252.216.42 -e 5723 -p TCP ...
Querying target system called:
10.252.216.42
Attempting to resolve IP address to a name...
IP address resolved to sgunspcd33.ap.apmea.mcdcorp
querying...
TCP port 5723 (unknown service): NOT LISTENING
portqry.exe -n 10.252.216.42 -e 5723 -p TCP exits with return code 0x00000001.
@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
telnet from Client 10.252.216.42 on tcp5723 to SCOM Server
Connection is eatablished > no issue on port tcp5723 from client to server
The Agents are installled Manually and are listed under Administration, Device management-->Agent managed
Thanks,
Gaurav
Thanks and Regards, Gaurav Jain
April 29th, 2011 7:48am
Hi Graham,
Earlier today we have run some port query from the SCOM Server to client:
Query from SCOM Server to Client 10.252.216.42 on tcp5723
=============================================
Starting portqry.exe -n 10.252.216.42 -e 5723 -p TCP ...
Querying target system called:
10.252.216.42
Attempting to resolve IP address to a name...
IP address resolved to sgunspcd33.ap.apmea.mcdcorp
querying...
TCP port 5723 (unknown service): NOT LISTENING
portqry.exe -n 10.252.216.42 -e 5723 -p TCP exits with return code 0x00000001.
@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
telnet from Client 10.252.216.42 on tcp5723 to SCOM Server
Connection is eatablished > no issue on port tcp5723 from client to server
The Agents are installled Manually and are listed under Administration, Device management-->Agent managed
Thanks,
Gaurav
Thanks and Regards, Gaurav Jain
Free Windows Admin Tool Kit Click here and download it now
April 29th, 2011 8:29am
Hi Marc,
Earlier today we have run some port query from the SCOM Server to client:
Query from SCOM Server to Client 10.252.216.42 on tcp5723
=============================================
Starting portqry.exe -n 10.252.216.42 -e 5723 -p TCP ...
Querying target system called:
10.252.216.42
Attempting to resolve IP address to a name...
IP address resolved to sgunspcd33.ap.apmea.mcdcorp
querying...
TCP port 5723 (unknown service): NOT LISTENING
portqry.exe -n 10.252.216.42 -e 5723 -p TCP exits with return code 0x00000001.
@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
telnet from Client 10.252.216.42 on tcp5723 to SCOM Server
Connection is eatablished > no issue on port tcp5723 from client to server
The Agents are installled Manually and are listed under Administration, Device management-->Agent managed
Thanks,
Gaurav
Thanks and Regards, Gaurav Jain
April 29th, 2011 10:49am
well the direction that is most interesting is the one from agent towards server. And that works as you say.
So those agents are not listed in the Administration, Device management--> Pending ; right? No updates or actions for those agents waiting there?
wow, what to check next...
Alright in Administration, Device management--> management servers. check the properties of that management server (actually all of them) and go to the properties of those and go to the tab for Security. Check that the checkbox for "Allow this server
to act as a proxy blablabla" is set. It should be turned on. It should be as you say all machines connect to the same MS, both green and grey ones.
On the agent side you could a double check of the settings. So if you go to add/remove programs and find the scom agent you can click change. Take the modify option. Next the Modify management group option and check first of all if the management group name
is correct there (Case sensitive!). Next check the management server (should be fqdn name of the correct server) and port 5723. If those are all OK and exactly what you expect them to be than you can cancel this wizard (we are not making changes). Just to
make sure all those names are correctly spelled and the name that is specified should resolve back in DNS to what you expect it to be. We are checking this just in case somehow this setting got changed by whatever automated process or whatever.Bob Cornelissen - BICTT (My BICTT Blog)
Free Windows Admin Tool Kit Click here and download it now
April 29th, 2011 11:47am
Hi Bob,
Firstly thank you for the suggestion on this issue.
Correct Bob if I check the pending management, I only find one agent with Installation in progress. Rest of the agents are under Agent managed.
As mentioned earlier, we have clients from diff domain i.e. APMEA, AUS, AP and KLDC under which KLDC agents are not greyed out reset are grayed out.
Talking out the MS, we have only one server which is in KLDC domain. The checkbox under Server Proxy for "Allow this server to act as a proxy..." is already ticked.
Regarding the agent side troubleshooting you asked out of 57 clients, 8 are green rest gray. I will ask an onsite team to check the details suggested by you.
Regards,
GauravThanks and Regards, Gaurav Jain
April 29th, 2011 12:44pm
Hi!
I'm a bit curious about the "The user APMEA\APMEA-vspoljarec does not have sufficient permission to perform the operation".
What account is this.
Do you have similar in the ther domains.
Are these accounts ok? Password not expired?
/RogerThis posting is provided "AS IS" with no warranties, and confers no rights.
Free Windows Admin Tool Kit Click here and download it now
April 30th, 2011 2:26am
Hi Roger,
Just checkd the AD, yes these accounts are fine; pwd not expired.
These are enterprise admin accounts.
Thanks,
GauravJ
Thanks and Regards, Gaurav Jain
April 30th, 2011 3:30pm
HI Graham,
Could you please suggest on this ongoing issue?
Regards,
Gaurav Jain
Thanks and Regards, Gaurav Jain
Free Windows Admin Tool Kit Click here and download it now
May 2nd, 2011 6:57am
Hi Gaurav, So have the onsite guys checked a few of the scom clients for their settings? correct server name fqdn for the managenment server, can that agent resolve the name of that box correctly. can it telnet 5723 to that management server? of course
are the other settings correct as well (management group name is case sensitive, port 5723, agent running as localsystem?).
What specific events are in the operations manager event log on those clients right after they startup? (so restart the system cener management service).Bob Cornelissen - BICTT (My BICTT Blog)
May 2nd, 2011 7:09am