Sitworld: ITM TEMS Stress Tester Experiment

ITM TEMS Stress Tester Experiment, Version 0.10000

By John Alvord

IBM Corporation, 21 March 2013

Overview

There are times when the TEMS comes under heavy stress. There are sometimes severe outcomes such as a crash, out of storage failures and loss of communication with other TEMSes and agents. Often there are few outward signs before the failure. These sorts of failures are very costly in lost time and support efforts.

 We might avoid some failures if we could measure stress using Analytics and thereby predict potential failures. Often logs contain data to explain the condition. Other times simple traces can be used to identify workload stressors. [see Note 1]

Theory

 The present experiment is based on observations that periodic processing becomes more erratic in TEMS as the stress grows. A periodic process is driven by a timer based situation IBM_TEMS_stress with a sampling interval of some minutes. It is distributed to a TEMS. The situation has an action command which runs a Perl Script. The Perl script determines the current time to a microsecond level of accuracy and computes a sliding window calculation of the

Mean                           – Average time between processing intervals

Standard Deviation     – The statistical variation between processing intervals

The average time stays close to the target since after completing each cycle, the new target evaluation time is set achieve the original goal. Thus if the sampling interval was 60 seconds and the evaluation was completed at time 63 seconds, the next target would be 57 away. The average stays OK but the variations grow wider.

The “sliding window” means that only the only a recent set of measurements are used. That is important so that recent conditions are given more importance. The oldest observation is dropped after the window limit is passed.

  

Package Parts

There are two program objects included here: a situation and a Perl program which you can access using the following link.

An example of the situation is included: a file IBM_TEMS_stress.xml which can be loaded using this command

 ./tacmd createSit  i IBM_TEMS_stress.xml

The situation example uses a 2 minute sampling interval. You may adjust this time as needed. The time must be in coordination with the action command script. On a fairly powerful AIX server this was measured as 0.30 CPU seconds per invocation.

The action command is set to run on every true cycle. In addition, the action command must run on the TEMS since this is a probe based situation [Note 2].

The processing example program is also included – itm_stress.pl.  If you do discover issues and make corrections, please return them to the author for general usage. Logic for the sliding window calculations are documented in the comments with references.

You can also access the files via this Github repository https://github.com/jalvo2014/stresstest.

Tailoring for the environment

 Toward the beginning of itm_stress.pl there are several constants you may change. Here is what the section looks like:

 my $local_dir = "/tmp/";                      # location of stress files

$local_dir = "c:" . $local_dir if $gWin == 1; # Stress file in Windows

my $local_file = "stress.txt";                # current data file

my $local_log  = "stress.log";                # Progress log

my $local_window = 60;                        # number of entries in sliding window

$local_dir is the directory where the two output files are kept.

$local_file is the current calculation file and keeps all data needed for the logic.

$local_log is the closest thing to an output file right now.

$local_window is the number of entries in the sliding window.

The test for Windows needs work.

The $local_window value and the situation sampling interval define how long a time will be measured. The default here is 60 cycles of 2 minutes each or 120 minutes.

  Sample Stress Log Data

 Here is an example stress.log entry

 Added entry 1363759724.27927 60 120.000050981842 0.00122896460560101 120.000627994537 0.00456397946637675 1130320010844000

Here is what the data items mean

#1

1363759724.27927

Epoch seconds & microseconds

#2

60

number of samples – up to window limit

#3

120.000050981842

average time in the window

#4

0.00122896460560101

Sum of difference squared

#5

120.000627994537

Most recent difference

#6

0.00456397946637675

Standard Deviation estimate

#7

1130320010844000

ITM timestamp

             

In this case the TEMS was not under any significant stress and so standard deviation was very low. Epoch is a standard Posix measurement of number of seconds since 1 Jan 1970.

  Summary

The hypothesis is that the standard deviation will be small for a TEMS under little stress and will increase when the stress is increased. That has been true in some unit testing environments.

 The present goal is to run this process in some customer environments and see what conclusions can be found. If the conclusions bear out the theory, then the standard deviation can be used as a trigger for an internal alarm using one of several mechanisms such as a monitored log message, a marker file or a universal message. This alarm could show long before a failure condition.

Original publish date: 2013/3/20

Sitworld: Table of Contents

Future Extensions

1) This situation could run on an agent with the action performed at a reporting TEMS. In this way communications stress could be measured as well.

2) Using a Workflow policy, this could run on a remote TEMS and the command run on the hub TEMS. In this way the combined hub/remote TEMS stress could be measured.

 3) If the theory proves valid and useful, the standard deviation could be used to generate alerts in several different ways.

 

Note 1:  See TEMS audit process technical note:

Sitworld: TEMS Audit Process and Tool

 
Note 2:  Situations that run at a remote agent are known as Intelligent Remote Agents. Situations run as called functions from the TEMS dataserver as known as probe based situations. This is a timer based situation running on the TEMS and is probe based. In such cases, the action command must be defined to run on the TEMS.

 

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: