Management Server grayed out (Network Steve Forum)

Management Server grayed out

Hello, One of our management server remains grayed out. The RMS is fine. I restarted the service System Center Management. The Management Server came green healthy but after 30 minutes it is grayed out again. I resbooted the Managerement server. The Management Server came green healthy but after 30 minutes it is grayed out again. it is not stable !!! any idea? It happened a little bit after 7 pm last night and the operations manager log has filled up desgtroyed any log from this time. the first error available is by 6:35 a.m. this morning Log Name: Operations Manager Source: HealthService Date: 5/5/2011 6:32:56 AM Event ID: 4506 Task Category: None Level: Error Keywords: Classic User: N/A Computer: opmgrms1.ad Description: Data was dropped due to too much outstanding data in rule "Microsoft.Linux.SLES.11.LogicalDisk.DiskReadsPerSecond.Collection" running for instance "/" with id:"{12945CB4-9927-6A00-572E-2CC215817557}" in management group "SCOM-MED". and Log Name: Operations Manager Source: HealthService Date: 5/5/2011 1:08:22 PM Event ID: 2115 Task Category: None Level: Warning Keywords: Classic User: N/A Computer: opmgrms1.ad.medctr.ucla.edu Description: A Bind Data Source in Management Group SCOM-MED has posted items to the workflow, but has not received a response in 9840 seconds. This indicates a performance or functional problem with the workflow. Workflow Id : Microsoft.SystemCenter.DataWarehouse.CollectEntityHealthStateChange Instance : opmgrms1.ad.medctr.ucla.edu Instance Id : {EA878F39-4DF3-0145-F7C2-50BE6A431D96} Nothing in the System log ... neither in the Application log The Agents using this Management Server look okay... looking at http://blogs.technet.com/b/kevinholman/archive/2008/04/21/event-id-2115-a-bind-data-source-in-management-group.aspx The server took longer to go back to gray but the event 2115 are still coming in with the new threshold found in the article,. Thanks, Dom System Center Operations Manager 2007 / System Center Configuration Manager 2007 R2 / Forefront Client Security / Forefront Identity Manager

May 5th, 2011 3:48pm

hello Dan, There has not been new Management Pack for the last three months. We had suppress 95% of the alerts and tuned for the past two years making the system working slow but working for the past 8 months... what is strange is everything was working better with the old hardware ... slow but working... the RMS server was hosting its own database and the datawarehouse is hosting its own as well and it was working!!! now the hardware is "supposely" faster and better but we have issues ... !!! The OperationsManager is on a SQL Cluster out of the RMS and we have more issues !!! it is faster when it works for the10-30 minutes after the restart of the service but it seems creating more traffic than before... Working on the rules now: Rule: Data Warehouse performance collection: writer average batch processing time Target: Data Warehouse Connection Server Object: OpsMgr DW Writer Module Counter: Avg. Batch Processing Time, ms Collect Operations Manager DB Write Action Modules\Avg. Processing Time Collects Avg. Batch Size performance counter. Collection Server OpsMgr DB Write Action Modules Avg. Processing Time Thanks, Dom System Center Operations Manager 2007 / System Center Configuration Manager 2007 R2 / Forefront Client Security / Forefront Identity Manager

Free Windows Admin Tool Kit Click here and download it now

May 5th, 2011 5:38pm

hello Dan, There has not been new Management Pack for the last three months. We had suppress 95% of the alerts and tuned for the past two years making the system working slow but working for the past 8 months... what is strange is everything was working better with the old hardware ... slow but working... the RMS server was hosting its own database and the datawarehouse is hosting its own as well and it was working!!! now the hardware is "supposely" faster and better but we have issues ... !!! The OperationsManager is on a SQL Cluster out of the RMS and we have more issues !!! it is faster when it works for the10-30 minutes after the restart of the service but it seems creating more traffic than before... Working on the rules now: Rule: Data Warehouse performance collection: writer average batch processing time Target: Data Warehouse Connection Server Object: OpsMgr DW Writer Module Counter: Avg. Batch Processing Time, ms Collect Operations Manager DB Write Action Modules\Avg. Processing Time Collects Avg. Batch Size performance counter. Collection Server OpsMgr DB Write Action Modules Avg. Processing Time I have created a Dashboard with the four views containing the 2 Servers (OM - DW) x 2 Rules (Avg.Bacth Size, Avg. Batch processing Time) Thanks, Dom System Center Operations Manager 2007 / System Center Configuration Manager 2007 R2 / Forefront Client Security / Forefront Identity Manager

May 5th, 2011 5:38pm

Have you added more agents? or ... just a guess ... do you have workflows running on the RMS? HP blade hardware MP? Using the RMS as a watcher node?Microsoft Corporation

Free Windows Admin Tool Kit Click here and download it now

May 5th, 2011 6:49pm

Have you added more agents? or ... just a guess ... do you have workflows running on the RMS? HP blade hardware MP? Using the RMS as a watcher node? Microsoft Corporation Hi Dan, 1. the number of agents has grown from 550 to 600 within the last 6 months... 2. I will have to check this as there are some workflows running on the RMS for sure I will need to identify them. Most of the workflows are running on the Management Servers ( the one which are grayed out ... one new one today failing!!!) 3. Yes. We have the HP Blade hardware MP set since about 12 months now 4. I think you catch the bottleneck... the RMS was used as a watcher node to start the Manual Ping (really not a good idea!!!) and as the list has been expanded it might be the issue, even if these alerts are still coming in propely and are the only one doing so. I will move the Watcher node soemwhere else. But this has happened at least three weeks ago... and the issue came only last night !!! Also the servers which are grayed out are the Management Servers not the Root Management Server (Watcher Node). but why this is happening when we have more "performent hardware than before and better architecture having the SQL Operations Manager Database out of the RMS... there still something strange as except the hardware change for the SQL Database nothing changed recently... Using the dashboard I am trying to confirm this is the issue but I need to expand the dashboard to each instance now for each counters... - discoverywriteitemmodule - eventwritemodule - performancesignaturewritemodule - performancewritemodule - sqlwritemodule - statechangewritemodule I am also checking all the 2115 sources as it seems not only on the datawarehouse but more workflows outside are involved too. I am reducing the number of machines pinged to see if the Management Servers are coming back up. Also I place a request for VMs to be created and dedicated to the Watcher Nodes. with 1,000 Pings (including 600 servers) items I will check how many do I need I think at least 3 but if I do a cross pinging I might need 6 or 7, am I right? I saw an article with only 99 items on the watcher node recommended is it ok... as the MS engineer had planed for 700-750 for us during his installation mission. Thanks, Dom System Center Operations Manager 2007 / System Center Configuration Manager 2007 R2 / Forefront Client Security / Forefront Identity Manager

May 5th, 2011 7:21pm

Free Windows Admin Tool Kit Click here and download it now

May 5th, 2011 7:23pm

Have you added more agents? or ... just a guess ... do you have workflows running on the RMS? HP blade hardware MP? Using the RMS as a watcher node? Microsoft Corporation Hi Dan, 1. the number of agents has grown from 550 to 600 within the last 6 months... 2. I will have to check this as there are some workflows running on the RMS for sure I will need to identify them. Most of the workflows are running on the Management Servers ( the one which are grayed out ... one new one today failing!!!) 3. Yes. We have the HP Blade hardware MP set since about 12 months now 4. I think you catch the bottleneck... the RMS was used as a watcher node to start the Manual Ping (really not a good idea!!!) and as the list has been expanded it might be the issue, even if these alerts are still coming in propely and are the only one doing so. I will move the Watcher node soemwhere else. But this has happened at least three weeks ago... and the issue came only last night !!! Also the servers which are grayed out are the Management Servers not the Root Management Server (Watcher Node). but why this is happening when we have more "performent hardware than before and better architecture having the SQL Operations Manager Database out of the RMS... there still something strange as except the hardware change for the SQL Database nothing changed recently... Using the dashboard I am trying to confirm this is the issue but I need to expand the dashboard to each instance now for each counters... - discoverywriteitemmodule - eventwritemodule - performancesignaturewritemodule - performancewritemodule - sqlwritemodule - statechangewritemodule I am also checking all the 2115 sources as it seems not only on the datawarehouse but more workflows outside are involved too. Microsoft.SystemCenter.CollectAlerts Microsoft.SystemCenter.CollectDiscoveryData Microsoft.SystemCenter.CollectEventData Microsoft.SystemCenter.CollectPerformanceData Microsoft.SystemCenter.CollectPublishedEntityState Microsoft.SystemCenter.CollectSignatureData Microsoft.SystemCenter.DataWarehouse.CollectEntityHealthStateChange Microsoft.SystemCenter.DataWarehouse.CollectEventData Microsoft.SystemCenter.DataWarehouse.CollectPerformanceData 33 events of each within 1 hour!!! I am reducing the number of machines pinged to see if the Management Servers are coming back up. Also I place a request for VMs to be created and dedicated to the Watcher Nodes. with 1,000 Pings (including 600 servers) items I will check how many do I need I think at least 3 but if I do a cross pinging I might need 6 or 7, am I right? I saw an article with only 99 items on the watcher node recommended is it ok... as the MS engineer had planed for 700-750 for us during his installation mission. Thanks, Dom System Center Operations Manager 2007 / System Center Configuration Manager 2007 R2 / Forefront Client Security / Forefront Identity Manager

May 5th, 2011 7:25pm

on the second event involved in this darkness (Event ID 4506) there is only one MP which seems to be involved: Cross-Platform MP Data was dropped due to too much outstanding data in rule "Microsoft.Linux.SLES.11.LogicalDisk.FreeMegabytes.Collection" running for instance "/opt/IBM" with id:"{925DF086-DC2F-C099-8AB4-0577428FB6AB}" in management group "SCOM-MED". Data was dropped due to too much outstanding data in rule "Microsoft.Linux.RHEL.5.LogicalDisk.DiskBytesPerSecond.Collection" running for instance "/" with id:"{72553262-DE70-8DBC-202D-9292ED9679DB}" in management group "SCOM-MED". all events have the prefix Microsoft.Linux.SLES.11.LogicalDisk.... or Microsoft.Linux.RHEL.5.LogicalDisk.... Still checking ... Thanks, Dominique System Center Operations Manager 2007 / System Center Configuration Manager 2007 R2 / Forefront Client Security / Forefront Identity Manager

Free Windows Admin Tool Kit Click here and download it now

May 6th, 2011 4:13pm

May 7th, 2011 5:49am

Hi Dom, The cross plat monitoring is also a bigger hit on the management server than for instance windows agents. Consider moving that to another management server and not the RMS if there are more than a few crossplat agents. What is the number of crossplat hosts you are monitoring? Pinging a thousand boxes with the RMS is also one of the possible bottlenecks. As it is the RMS doing all the work. You might consider not pinging the servers you are already monitoring with scom agents to start with. Next you could consider increasing the sample time somehow (for example in stead of every 2 minutes move it up to 3 or 4 minutes between samples). And of course also have another management server do most of this work and not the RMS. Bob Cornelissen - BICTT (My BICTT Blog) Hi Bob, 1. Cross-Platform Host through their MP: 100 Cross-Platform Host through Multi-Host Ping MP: 250 2. Yes correct I discover this and I am trying to move the watcher nodes to other servers dedicated to this but so far they do not pop up on the Ping Watcher Role view even the registries have been updated properly on each watcher node. I need to ping the servers as our environment do not "believe" in "heartbeat failure" and "failed to connect to computer" previously used so I had to backuped up to a Simple MP done through the Muti-Host Ping MP 3.0 from System Central ... The set on the ping is 300 so I think this means ms so it is already 5 minutes... I have also several Management Server and it seems to be them not supporting the load as even the RMS is node watcher it is still up and "pinging"... the individual MSs are grayed out and not working anymore... I am trying adding several severs as Watcher Nodes but I might need to add the Management Server function as well a it does not seems to work for now... I have another thread opened. http://social.technet.microsoft.com/Forums/en-US/operationsmanagergeneral/thread/4665376b-e98c-47e6-b04b-ea43ec0cbb44 Thanks, Dom System Center Operations Manager 2007 / System Center Configuration Manager 2007 R2 / Forefront Client Security / Forefront Identity Manager

Free Windows Admin Tool Kit Click here and download it now

May 7th, 2011 6:21pm

Hi Dom, Unfortunate that they dont believe in the scom monitoring for heartbeats and failed to connect, which is actually going further than just a ping (ping is the very last thing to stop on a machine and the first to start). But anyway, if you can not convince them of that you might have double monitoring (by using ping mp). I think that even a small change in interval will do a lot. Perhaps go to 360 for the ping (and yes the value you see is seconds). For the ping mp I have seen that you have other thread open, so I am sure you will get through that one in there. I would advise to bring the cross plat monitoring to a dedicated management server, because this has a high number of workflows running. Remember to distribute the runas accounts to that management server as well, otherwise it will not monitor those boxes.Bob Cornelissen - BICTT (My BICTT Blog)

May 8th, 2011 5:11am

Free Windows Admin Tool Kit Click here and download it now

May 9th, 2011 12:12am

Free Windows Admin Tool Kit Click here and download it now

May 9th, 2011 1:32pm

Hi Dominique, I am sure that offloading the cross plat monitoring to a different box will help a lot because of the number of workflows it needs to keep in the air for this. So how big is your environment? You state that your evalualtion will be for 250 crossplat boxes or will this be the total amount? Well all the usual discussions to take a physical or a virtual are true in this case as well. I have used virtuals to run a very high load of workflows on (for example monitoring vmware clusters with a lot of objects). My preference for larger deployments will go towards physical. Simply because it is dedicated hardware, not shared with other virtuals. And a double quad core with a good amount of RAM is easier available on physical in most cases. The sizing guide will tell you about how much you can load on one machine. I think it was 500 cross plat agents on one MS box. But dont forget that you might do basic monitoring and that you might go beyond that. For instance also picking up additional hardware things through SNMP (also a lot of workflows there!), or you might want to monitor a lot of custom processes or log file entries, or you might want to run third party mp's (novell, bridgeways and others). All of this increases the load in the end and it is really testing by putting load on the machine and seeing how much it can take in your specific case. If you add a lot of additional stuff you might want to place less than 500 agents on one MS. Just do this in increments of 30 at a time for instance and watch the counters on the MS.Bob Cornelissen - BICTT (My BICTT Blog)

May 9th, 2011 1:54pm

Hi Bob, - cross plat monitoring : 250 Servers - Total environment 1,000 Servers VM: because it is easy for me to get it deployed but it will be - 4 Gb RAM - 2 CPU Dual 3.00 Ghz - Network Speed: 1 Gb/s - Mircrosoft Windows Server 2008 R2 Enterprise 64-bits - Drive 50 GB So I would prefer too a physical machine but our process will be longer... Let me check for the sizing guide as I have http://aspoc.net/archives/2007/10/16/opsmgr-2007-database-and-data-warehouse-size-calculator/ http://blogs.technet.com/b/momteam/archive/2007/10/15/opsmgr-2007-database-and-data-warehouse-size-calculator.aspx or the huge one... http://www.microsoft.com/downloads/en/confirmation.aspx?FamilyID=B0E059E9-9F19-47B9-8B01-E864AEBF210C Not sure if there is a step for only the Cross Platform MP I will check the web site for tCross Platform itself... I think it is scenario 3 which might fit the specifc server but it will be only for Unix or Linux and only a Management Server (No database) Role: Management Server Hardware: • 2 disk RAID 1 • 4 GB RAM • Dual Proc I have so far less than 500 Linux/Unix/ etc... platforms ... so hopefully a VM could handle this load... or I will have the need of more memory as for Management Server it seems to be the only parameter changing between 3-4-5-6 (from 4 GB RAM to 8 GB RAM) this should make me able to handle between 500 and 1000 serevrs for this management Server except the load or workflow... or maybe refurbish an existing server decomissionned... from its original purpose... Yes definetely I will need more than the basic monitoring. SNMP will be also on the list sooner or later correct. Performance: processor, memory, etc... nworks... I will follow the 30 additional at once only and see how it works... Thanks, Dom System Center Operations Manager 2007 / System Center Configuration Manager 2007 R2 / Forefront Client Security / Forefront Identity Manager

Free Windows Admin Tool Kit Click here and download it now

May 9th, 2011 5:29pm

Hi Bob, As I have two threads which could be definetely link http://social.technet.microsoft.com/Forums/en-US/operationsmanagergeneral/thread/4665376b-e98c-47e6-b04b-ea43ec0cbb44 Role: Management Server Hardware: • 2 disk RAID 1 • 8 GB RAM • Dual Proc I might go to physical for the Cross-platform and remain on VM for the Ping MP. Let me know your feelings? Thanks, DomSystem Center Operations Manager 2007 / System Center Configuration Manager 2007 R2 / Forefront Client Security / Forefront Identity Manager

May 9th, 2011 5:52pm

Hi, Based on my research, I would like to suggest the following: 1. Clear the HealthService queue on the problematic server: 1) Stop System Center Management service. 2) Go to C:\Program Files\System Center Operations Manager 2007\, and rename the “Health Service State” folder. 3) Restart System Center Management service. 2. Check Antivirus exclusions settings: Antivirus exclusions for Operations Manager 2007 http://blogs.msdn.com/b/nickmac/archive/2008/07/18/antivirus-exclusions-for-operations-manager-2007.aspx Antivirus Exclusions for MOM and OpsMgr http://blogs.technet.com/b/kevinholman/archive/2007/12/12/antivirus-exclusions-for-mom-and-opsmgr.aspx 3. Please also try the methods in the following post: SCOM: How to troubleshoot gray agent states in System Center Operations Manager 2007 http://blogs.technet.com/b/schadinio/archive/2010/07/20/scom-how-to-troubleshoot-gray-agent-states-in-system-center-operations-manager-2007.aspx Meanwhile, I would like to share the following with you for your reference: The new and improved guide on HealthService Restarts. Aka – agents bouncing their own HealthService http://blogs.technet.com/b/kevinholman/archive/2009/12/21/the-new-and-improved-guide-on-healthservice-restarts-aka-agents-bouncing-their-own-healthservice.aspx Hope this helps. Thanks. Nicholas Li - MSFT Please remember to click “Mark as Answer” on the post that helps you, and to click “Unmark as Answer” if a marked post does not actually answer your question. This can be beneficial to other community members reading the thread. Hello Nicholas, 1. it is already a current practice used several times but does not seem working on this case. 2. I place all these exclusions and see how it affects the issue. 3. I read this KB but it does not help really as all the steps have been and seems working for a while and then failing again. the last item from Kevin is set now... waiting for the alerts... My threshold are already set as defined by the aricles for the value... Monitor: Health Service Handle Count Threshold Class Management Server 10000 Management Server Agent 10000 Exchange 2007 Computer Group 5000 Group Management Server Computer Group 30000 So no problem with the overrides. waiting for alerts on: Health Service Handle Count Health Service Private Bytes Monitoring Host Handle Count Monitoring Host Private Bytes After 5 minutes two management servers came healthy... Then one of them became gray again after 30 minutes but I did not get any alert on the console....!!! the Ping MP has been removed, no alert, no email, no member in the PingTarget and the PingWatcher Role. back to the 2115 and 4506 Thanks, Dom System Center Operations Manager 2007 / System Center Configuration Manager 2007 R2 / Forefront Client Security / Forefront Identity Manager

Free Windows Admin Tool Kit Click here and download it now

May 9th, 2011 6:51pm

Yes I think you can do that. Also for the ping mp you could start at a lower number and increase the numbers if the server can handle it.Bob Cornelissen - BICTT (My BICTT Blog)

May 10th, 2011 1:47am

If you monitor from a VM, I would go for steps at least. If the vm host does not run a lot of intentsive other virtual guests than it is do-able. Monitoring vmware also has a big hit as it monitors a lot of objects when going to larger numbers of hosts and guests. Nworks is the most scalable and stable solution for this and in my opinion well worth it (am a big fan for years). It is not just about memory. It will need more and more as you add more monitoring, but in the end it is about the number of workflows and stuff it needs to do at any given time. AT some point it just can not handle it and it is not a given that this will be at 100% cpu of 100% memory usage. Keep an eye on the performance counters already in the moniotring -> opsmgr MP under the managemenet packs folder.Bob Cornelissen - BICTT (My BICTT Blog)

Free Windows Admin Tool Kit Click here and download it now

May 10th, 2011 1:56am

Hello Bob, Surpise this morning... The registry deleted yesterday to remove the Ping on the RMS are back... I will have to check what happened... I don't think there is any automatic process.. isn't it ? Monitoring > Operations Manager > Management Server Performance > Workflow Count RMS: 7.239 (Flat for days...) MS1, MS2, DMS1: 0 Monitoring > Operations Manager > Management Server Performance > Active File Uploads (Average 70) Peak: 939!!!! RMS ONLY Monitoring > Operations Manager > Management Server Performance > Console and Connection Count: 6 to 12 RMS ONLY I don't see any counters for Memory and/or CPU in this folder!!! Thanks, Dom System Center Operations Manager 2007 / System Center Configuration Manager 2007 R2 / Forefront Client Security / Forefront Identity Manager

May 10th, 2011 12:21pm

Hi Dom, The memory and cpu counters belong to the Windows Server 200x management pack and can be found there (or at top level Computers, right-click the computer and select open performance view. Next to the workflow count the Module count is also interesting. What is interesting is that the management servers have a zero workflow count? I dont know why the registry entries are back. Perhaps you have ghosts.Bob Cornelissen - BICTT (My BICTT Blog)

Free Windows Admin Tool Kit Click here and download it now

May 10th, 2011 12:45pm

Hi Bob, Memory and CPU are empty in the Performance Vioews for all Servers, so i checked the Health Explorer and it is Healthy but there is no State Change Events at all on RMS and MSs...!!! and other Entity Health are Healthy as well but no State Change Events whatsoever... starnge The Avaibility has State Change Events filled up no problem... The Monitors are Enabled. Module Count: 29,000 + MSs all at 0 but it might be because they are grayed out and do not report...!!! but also the one which is healthy !! I don't seem to catch any performace on the MSs except the RMS itself... Thanks, Dom System Center Operations Manager 2007 / System Center Configuration Manager 2007 R2 / Forefront Client Security / Forefront Identity Manager

May 10th, 2011 2:15pm

Hi Bob, I am thinking for now to remove the Cross-Platform Management Pack for 1-2 days to see if the Management Server is able to come back? Just delete it from Administration > Management Packs > Right Click delete on all MPs containing UNIX or Linux in their description. it is about 24 items... Any better way to do it? Thanks, DomSystem Center Operations Manager 2007 / System Center Configuration Manager 2007 R2 / Forefront Client Security / Forefront Identity Manager

Free Windows Admin Tool Kit Click here and download it now

May 11th, 2011 8:03pm

You could put the unix boxes in maintenance mode.Bob Cornelissen - BICTT (My BICTT Blog)

May 12th, 2011 3:08am

This will stop the workflows for RHEL...cross-platform MP... will it be sufficient to stop all traffic for UNIX/Linux machines? I did it (Monitoring > Unix/Linus Servers ) I will restart the MS which is always grayed out to see how it behave now... Thanks, Dom System Center Operations Manager 2007 / System Center Configuration Manager 2007 R2 / Forefront Client Security / Forefront Identity Manager

Free Windows Admin Tool Kit Click here and download it now

May 12th, 2011 4:30pm

This will stop the workflows for RHEL...cross-platform MP... will it be sufficient to stop all traffic for UNIX/Linux machines? I did it (Monitoring > Unix/Linus Servers ) I will restart the MS which is always grayed out to see how it behave now... I am opening anew thread as the server is grayed out again with all servers Unix/linux in Maintenance Mode. http://social.technet.microsoft.com/Forums/en-US/operationsmanagergeneral/thread/d6e3eb4a-a3a2-444d-8971-811b90e77e5e Thanks, Dom System Center Operations Manager 2007 / System Center Configuration Manager 2007 R2 / Forefront Client Security / Forefront Identity Manager

May 12th, 2011 4:30pm

This topic is archived. No further replies will be accepted.