Sitworld: Rational Choices for Situation Sampling Intervals

fogandhummingbirds

John Alvord, IBM Corporation

jalvord@us.ibm.com

Inspiration

I was working on a customer remote Tivoli Enterprise Monitoring Services [TEMS] performance problem. The remote TEMS was showing high CPU utilization and was crashing every couple days.

The TEMS Audit report showed that results data from Agents were flowing in at 8 megabytes a minute. That is roughly 10 times a sustainable rate for typical servers so the remote TEMS instability was expected.  It was an unusual report. Usually one or half dozen situations dominate the workload. This time there were many situations sending at a fairly high rate.

Overview

I reviewed the situation definitions. The sampling interval on most had been set to 1 minute.  As you likely know IBM Tivoli Monitoring [ITM] product provided situations have a default sampling interval of 15 minutes.

I also noted that many  situations were in a true state. When a situation formula is false, a 0 row negative result is sent and then nothing more. When a situation formula is true, the result row(s) are resent each sampling interval.

Clearly in this environment the choice made had been to detect issues very fast. They had not chosen 30 seconds but 1 minute detection is very low indeed.

A common Life experience came to mind. A family is driving to see friends several hours away. After some time a bored child in the back seat starts asking every minute:

“Are we there yet?”

The driver gets more and more frustrated until the trip is over. That child is an agent, reporting his state over and over. The driver is the TEMS is receiving the data and really can do nothing more. The trip ends when the condition [driving] ends in the fullness of time. This is exactly what was happening on a large scale at this customer.

Situations, Agents and TEMS – Deep Dive

There will always be a delay between an issue occurring and situation evaluation that detects it. The situation author chooses that time as the sampling interval. The default time is 15 minutes. The choice is between

  1. low interval time and rapid response and higher CPU cost
  2. higher interval time and slower response and lower CPU cost.

Most customers accept an ITM agent using some CPU percentage. However those same customers would never agree to a 25% CPU usage by the ITM agent. There is a balancing act between the two choices.

There is a second delay factor which must be accounted for. The TEMS the agent reports to also has a process that looks for results using the exact same sampling interval. It typically starts up after all of the agents are started so in an ideal case, the results are ready for processing when the TEMS process starts. However, if any agent was not started when the situation was started, it’s results will arrive later. For example, assume a sampling interval of 30 minutes.

  1. time 00:00 – situations start on all but agent123:KUX [which is not running yet]
  2. time 00:00 – Agents evaluate situation and send in results.
  3. time 00:30 – situation SQL starts at the TEMS – which processes the results and then events are created or not, as usual
  4. time 05:00 – agent123:KUX starts up
  5. time 05:05 – agent123:KUX sends in results which is stored
  6. time 30:00 – most agents send in results
  7. time 30:30 – TEMS situation processing starts and processes the pending results as usual.
  8. time 35:05 – agent123:KUX sends in results which is stored

In this case the agent123:KUX result is not processed until some 25 minutes later. In this case the agent results will always be processed 25 minutes. The effect is easy to see with long sampling intervals.

Here is a worst case example:

Because of the dual TEMS/Agent factor, for a sampled situation, the maximum delay between a condition and the situation event is twice the sampling interval. Here is an example:

  1. time 00:00 – situations start on all but agent123:KUX [which is not running yet]
  2. time 00:00 – Agents evaluate situation and send in results.
  3. time 00:30 – situation SQL starts at the TEMS – which processes the results and then events are created or not, as usual
  4. time 05:00 – agent123:KUX starts up
  5. time 05:05 – agent123:KUX sends in 0 result meaning no issue
  6. time 06:00 – issue exists at agent agent123:KUX
  7. time 30:00 – most agents send in results
  8. time 30:30 – TEMS situation processing starts and processes the pending results as usual.
  9. time 35:05 – agent123:KUX sends in results on new issue which is stored
  10. time 60:30 – situation SQL runs at the TEMS and the issue at 06:00 generates an event

In this example the agent123:KUX situation result is not processed until some 54 minutes later. This double sampling interval maximum delay needs to be taken account in any planning.

Business Rules on Recovery Action

Any issue should have a business rule which states how to resolve the case and a standard recovery time. There are two components to that time: detection time and human action time.

If a human action time will take one hour, the business rule target time must be more. The maximum human time could be more then one hour based on staffing and competing priorities. For this purpose let’s focus on one hour.

Detection time for ITM Situations is the Sampling Interval times two.  If you set the interval to 30 minutes, the actual time between issue and detection will be 60 minutes maximum and 30 minutes average.

When the business rule target time is chosen, it can certainly not be 1 hour since that does not take into account detection time.

If you set the sampling interval too low, you can destabilize the remote TEMS as in this case.

The customer should make a rational choice for detection and recovery action that satisfies the business rules.

ITM and Business Rule Choices

The user has many choices:

  1. Decrease the Situation sampling interval
  2. Decrease the human action time by increasing staffing or improved efficiency.
  3. Increase the business rule target time
  4. Change the detection rules to reduce events.

Choice (1) has the severe downside of increasing the remote TEMS overhead. In this particular case the overhead was so high that the remote TEMS became unstable. This could be resolved by using a more powerful server. It could also be resolved by creating eight remote TEMS and splitting the agents between them equally. That sounds pretty expensive.

Choice (2) had the downside of the cost of hiring more staff. It could also be handled by developing an automation process to do the work. That solution probably cannot use ITM services since the remote TEMS would be unstable in some conditions. Both those solutions sound expensive.

Choice (3) has the downside of a somewhat slower reaction time. That might mean a target resolution time set to 1.5 hours. With that change the situation sampling interval could be set to 15 minutes. The remote TEMS would then be stable. The upside is a stable remote TEMS and no additional expenses.

Choice (4) would require some analysis and decision. Here are some examples:

  1. If you have a check to make sure that on some Linux servers either rpcbind or portmap is running, maybe that should be a once a day check since the issue almost never arises after the system starts up.
  2. If there is a check for a mount point 10% free, maybe a 5% test is just as good.
  3. If you have a check against Unix load factor and you check for both 15 minute running average and 5 minute running average, maybe only one check is actually needed.
  4. Choices can be isolated to agent subsets. Very critical agents could get a lower sampling interval while others get a standard 30 minutes.

The challenge here is that a short Sampling interval on an Open situation can cause remote TEMS overhead. At best you will see excessive CPU use and at worse you will see a TEMS failure.  Those should be managed and a rational balance should be chosen.

 Short Sampling Interval and Remote TEMS Auto-close of Events

You may see a situation event close and reopen with short Sampling Interval even when the underlying condition remains unchanged.

On the Agent side much of the situation and other activity [real time data, historical data collection, historical data export] happens in single threaded a one-at-a-time model. Thus it is not unusual at a low level to see a 3-5 minute delay between when the situation should have run and when it actually runs and send results to the TEMS. It is not that common but it does happen.

On the TEMS side, if no results are received for 3 times the sampling interval, the TEMS arbitrarily closes the situation event. When the next result is received [a few minutes later] a new situation event is generated. There is a way to affect this timing using the TEMS environment variable IRA_MIN_NO_DATA_WAIT_TIME, however you should use the very latest maintenance level because some defects were found recently.

ITM Weird Science

While researching this issue, I discovered a public IBM document see here which shows exactly how to change the sampling interval down as low as 1 second. It is oriented toward Historical Data UAdvisor situations but it would logically work on any case.

DO NOT try this on your production ITM.  I expect Agent side processing would break down and the previous section on Auto-close of events would run wild. But for a bit of fun you can look it over and on a test system do some weird science experiments. I can guarantee you that there is no Development or QA testing of this case! You can let me know here, but don’t even think about opening a PMR!!

Summary

I have presented the factors you should consider when rationally selecting a sampling interval for an ITM Situation. After you make the informed choices, you can achieve a balanced ITM environment with nothing under serious stress and business rules that are achievable. And use TEMS Audit from time to time to keep the environment tuned for efficient operation.

Sitworld: Table of Contents

Note: Hummingbirds after sunset with Pacific Ocean fog creeping into the valley.

 

2 thoughts on “Sitworld: Rational Choices for Situation Sampling Intervals

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: