By John Alvord, IBM Corporation
An IBM Tivoli Monitoring [ITM] site was hard down and the customer believed that the Quarterly earnings report process was seriously at risk. During a 20 hour long day spent recovering files I discovered that once again a customer had abused some MS_Offline type situations. That created an unstable Tivoli Enterprise Monitoring System [TEMS] environment and after a crash the site was not operational. The MS_Offline abuse was one of the two factors that led to this result.
ITM has a product provided situation called MS_Offline. It fires when agent stops providing node status updates or commonly heartbeats. It is a great feature whether you have 200 agents or 20,000 agents. You get awareness of down conditions quickly and can then take steps to recover.
Many customers create their own varieties of MS_Offline. For example Windows agent offlines can be routed to one particular group and Unix OS Agent offlines to another group.
There are great dangers lurking in MS_Offline situation usage. Some commonly used settings can stress the TEMS severely to the point of failure. In addition, the exact logic behind MS_Offline has not been documented and so assumptions and myths have arisen. Severe problems cause excess CPU use at the least and often full TEMS failures.
If you are not interested in the gory details skip to the end where there is a Recommendations section of what to do and not do to avoid severe problems.
We will start by showing exactly how MS_Offline works from the top down. After that we will go bottom up.
A section below was added concerning ITM 623 FP5 and ITM 630 FP3 which makes some fundamental improvements during remote TEMS recycles and FTO switches.
MS_Offline Top Down
You can view the MS_Offline situation in the Situation Editor under the All Managed Systems category. The Formula looks like this:
( Status == *OFFLINE AND Reason != ‘FA’)
The detailed formula shows the underlying table INODESTS
( #’INODESTS.O4ONLINE’ == *OFFLINE AND #’INODESTS.ONLINE’ != ‘FA’)
When the MS_Offline situation is started, the above formula is translated into SQL that looks like this in the SITDB table:
SELECT ORIGINNODE, HOSTADDR, HOSTINFO, THRUNODE, NODE, PRODUCT, ONLINE, O4ONLINE, GBLTMSTMP, NODETYPE, VERSION
WHERE SYSTEM.PARMA(“SITNAME”, “MS_Offline”, 10) AND
SYSTEM.PARMA(“NUM_VERSION”, “0”, 1) AND
SYSTEM.PARMA(“LSTDATE”, “1120919102949000”, 16) AND
SYSTEM.PARMA(“SITINFO”, “TFWD=Y;OV=N;TDST=0;”, 19) AND
INODESTS.O4ONLINE = ‘N’ AND
INODESTS.ONLINE <> ‘FA’ ;
All of the related attributes are captured at the same moment and that is seen in the SELECT columns. The SYSTEM.PARMA values are passed into the routines that process the situation. The last two tests are the formula translated into SQL.
During situation start logic the SQL above is Lodged in the TEMS dataserver [SQL processor]. That means the SQL is compiled to make sure all the tables and columns are known. The “WHERE” clause is compiled into two binary objects that are used to filter rows.
When everything is validated, the result is added to the list of situations which will run periodically based on the Sampling Interval.
Most situations we think about run at Agents on other systems. MS_Offline is a probe based situation. Each time it is evaluated every row of the table is checked. The row may be emitted as a potential result depending on the filter which is derived from the WHERE clause and other factors.
Tthe situation monitor logic [SITMON] evaluates the result rows and may produce an event. For example, if a managed system had not been offline and now was seen in a result, an event is created. If the situation had been offline previously the result is ignored.
You can already see there is a substantial impact from running this sort of situation. If there are 10,000 agents the table on QA1DNSAV.DB will be about 8megs. Reading that number of rows with that amount of data every minute will cost a lot of CPU. If there are 15 such situations, the TEMS can be seriously dragged down. The worst case I ever saw was 87 MS_Offline situations and the TEMS failed about once a day.
MS_Offline and Persist
Persist is a situation setting whereby a result must occur Persist number of times in a row before it is recognized as an event. It is used for tracking processes that can flip in and out of an alert zone and everything is normal. When it occurs Persist times in a row, something is bad and an event is generated.
MS_Offline and Persist greater then one has an stunningly high impact. When Persist is off or 1, the dataserver will emit a single result row per offline agent. 10,000 agents might result in 40-50 results. When Persist is greater then one, the dataserver will emit a result for every single agent no matter what its status is. That is required to allow the “in a row” logic to operate.
This workload is enough to cripple many TEMS environments. I have seen this a dozen times over the last few years. Please do NOT make that choice. At best you chew up a lot of extra CPU. At worst your TEMS goes unstable, storage use grows and eventually TEMS fails. Then both you and IBM Support spend a lot of time figuring it out and recovering.
Just Say NO!!!
I worked with one site that created a MS_Online situation that was equally destructive for the exactly same reason of producing excessive results.
In every single case I worked with, stability and reasonable CPU was only achieved by setting Persist to one or off after a few days or some painful months. Please skip that particular experience.
The Situation Audit blog post produces a number of MS_Offline advisories including Persist>1.
MS_Offline bottom up – startup Phase
At hub TEMS startup, the table TNODESAV is used to populate the INODESTS table. When it is populated, the Reason attribute is set to “FA” [Framework Architecture]. The theory is that agents are given a period of time to send in node status. Otherwise they would all appear as offline. There is little value in a flood of offlines and then a parallel flood of onlines a few minutes later.
Twelve minutes later, the Reason attribute is set to blank. This can be configured by the TEMS environment variable CMS_FA_GEN_CLEANUP which has a minimum value of 0 [minutes] and a maximum of 120. From what I can tell this has never been documented and I am sure that is never tested. So experiment at your own risk.
Because of this startup condition, every MS_Offline type situation must include this test
*AND *VALUE ManagedSystem.Reason *NE FA
That is especially important if you have action commands with the situation. A hub TEMS can easily crash if thousands of action commands run at the same time. I have worked on TEMSes that crashed every time they started for exactly this reason. See more about action commands later.
MS_Offline bottom up – single hub TEMS
Agents are configured with a heartbeat time – CTIRA_HEARTBEAT – which defaults to 10 minutes. On the first time at startup, the Agent will attempt to connect through two heartbeat cycles. After that it tries potential TEMS servers defined in the CT_CMSLIST every 10 minutes. If it is connected to a non-first choice, the agent periodically tries to connect to the first choice. That way the agents will workload balance the remote TEMSes over time.
When contact is made, the agent sends a node status update which identifies itself – type of agent, time, etc. It also sends the CTIRA_HEARTBEAT value. The TEMS records that information in the INODESTS table. It also sets a timer at the heartbeat time plus a grace period of three minutes.
If the agent sends another node status update before the default 13 minutes, the old timer is canceled and a new one is created. If the agent does not send a new node status, the timer expires. At that point the hub TEMS marks the agent as offline in the INODESTS table.
If the agent does a graceful shutdown, it is marked offline immediately.
At that point the MS_Offline situation sees the Offline status and generates a status and the time first time an event.
MS_Offline bottom up – remote TEMS processing
In large environments there will be many remote TEMS – each connecting to 1000 or more agents. Remote TEMSes are subnode agents. In some ways they are like agents, but they have responsibility over other agents. Every remote TEMS must keep track of online and offline status and propagate that to the hub TEMS as appropriate.
Remote TEMS behave much like a hub TEMS. It gets the node status update and propagates it to the hub TEMS. It tracks the heartbeat time with the same 3 minute grace period. When an agent does not give an update, the timer runs and the remote TEMS reaches over and updates the hub TEMS INODESTS table. There are some efficient logic to achieve this efficiently.
The TEMSes have a similar heartbeat process but use 3 minutes plus 3 minute grace time. If a TEMS is judged as offline then all its agents are switched to offline at the same time. That is identical logic with all subnode agents. Other examples include Agentless Agent servers and Universal Agent. When the server goes offline all the agents are set to offline at the same time.
One result is that the you can have a sudden flood of Offline events in a case like this… followed by Offline event closes as the Agent switch to another remote TEMS and connect. There is no workaround for this processing as currently implemented.
ITM 623 FP5 and ITM 630 FP3 changes in remote TEMS processing and FTO switching
At these levels, when a remote TEMS goes offline the agents are given a grace period to reconnect via another remote TEMS. No change is needed in existing situations. This is the same grace period used during hub TEMS startup and the same environment variable CMS_FA_GEN_CLEANUP can be used to change the grace period from the default 12 minutes. One large customer chose 30 minutes.
The same logic is used for FTO switches when the backup hub TEMS takes on the primary role.
These changes will dramatically reduce the number of spurious MS_Offline type events. This eliminates any need to the Persist>1 configuration mentioned elsewhere.
The record of the change is here:
APAR IV52579 – ALLOW GRACE PERIOD FOR OFFLINE EVENTS WHEN RTEMS GOES OFFLINE.
It is documented in the ITM 623 FP5 readme. It is not documented in the ITM 630 FP3 readme because the APAR was inherited from the ITM 623 level. The FTO improvement is not documented because it was uncovered during rigorous QA testing and not by a customer. No APARs are created for those cases. Several other minor issues were improved such as avoiding unnecessary MS_Offline events when a managing agent [like Agentless Agent components] goes offline and its managed agents are out of touch temporarily.
MS_Offline and Action commands
ITM Agent action commands run in parallel. If you define an action command and a remote TEMS goes offline for whatever reason, you may well experience a TEMS failure as hundreds of action commands attempt to run at the same time. This can and has caused TEMS failures in the past. In one case a customer stopped an Agentless Agent server [for maintenance] with 900 agents. 900 Agents went offline at the same time. 900 action commands to send emails were performed at the same time. The TEMS crashed as it exceeded the maximum process space.
Please consider carefully before using action commands in MS_Offline situations.
Agent timeout factors Details
If an agent is attempting to connect and the system running is stopped, the time out is almost immediate.
If an agent is attempting to connect, the system is running but the TEMS is not started, there will be system dependent delay of perhaps 10-20 milliseconds.
If an agent has connected to a TEMS and performs a normal Remote Procedure Call for the first time, there is a 60 second delay before failure is declared.
If an agent has connected to a TEMS and performs a normal Remote Procedure Call for the second or subsequent time, there is a 30 second delay before failure is declared.
The first time connecting, an agent will attempt a connection and then retry through 2 heartbeats before switching. After the first time, there is a single retry before switching.
Sample Timeout Estimates
How long after an outage it take before an MS_Offline event will be generated?
If the agent is set up by default, the CTIRA_HEARTBEAT is 10 minutes. Including the grace period of 3 minutes, the maximum time for the remote TEMS to recognize the lack of node status update is 13 minutes. At that point the hub TEMS INODESTS table will be updated. If MS_Offline runs every 15 minutes, that implies a maximum of 15 more minutes before an event is generated.
Maximum delay = 28 minutes
Average delay = 14 minutes
Minimum delay = 0 minutes.
The minimum delay would be when the agent was just about to send a node status update at +12:55 and then failed. The remote TEMS fired a few seconds later and by chance the MS_Offline was evaluated just a few seconds after that. It is simpler to think of it as 28 minutes maximum and 14 minutes average,
These times can be pushed lower at the cost of 1) more overhead and 2) more false alerts. The false alerts arise because the network may start working again or some blocking process completed. When the agent sends node status again say at 15 minutes, it will be marked online. When the MS_Offline is evaluated some minutes later the temporary offline condition will never be seen,
There is often a desire for very fast identification of issues. MS_Offline is not well suited for the goal. It would be much better to using a ping or tcp_ping to access a specific service. Then you can identify issues down to the sub-minute range.
You can observe the MS_Offline and Persist>1 effect. Just run with Persist=1 and later Persist>1 and capture total CPU time used over that period. In one case I used the AIX topas command. The difference was between 50-70% CPU and 10-15%. This is NOT a reasonable use of CPU time.
Avoid using MS_Offline type situations very much. NEVER use a MS_Online type situation. If used make sure the sampling interval is reasonable – like 15 minutes. If you need quicker response to a failed system or process use something like the OS Agent Ping attributes.
This document has presented some details about the MS_Offline and recommendations for its usage. Please update this blog with comments or questions.
2013/04/26: The environment variable to control FA cleanup time was corrected to CMS_FA_GEN_CLEANUP. The error message when the value exceeds the defined limits incorrectly specified the wrong environment variable name.
2014/10/09: Added notes about the APAR which corrected the problem of temporary offline events after a remote TEMS recycle.
Note: Photo is a California Condor taken on my roof a few years ago.
The photo means that MS_Offline is a heavy and potentially dangerous tool. You should know everything about it before making heavy use.