ITM TEMS Stress Tester Experiment, Version 0.10000
By John Alvord
IBM Corporation, 21 March 2013
Overview
There are times when the TEMS comes under heavy stress. There are sometimes severe outcomes such as a crash, out of storage failures and loss of communication with other TEMSes and agents. Often there are few outward signs before the failure. These sorts of failures are very costly in lost time and support efforts.
We might avoid some failures if we could measure stress using Analytics and thereby predict potential failures. Often logs contain data to explain the condition. Other times simple traces can be used to identify workload stressors. [see Note 1]
Theory
The present experiment is based on observations that periodic processing becomes more erratic in TEMS as the stress grows. A periodic process is driven by a timer based situation IBM_TEMS_stress with a sampling interval of some minutes. It is distributed to a TEMS. The situation has an action command which runs a Perl Script. The Perl script determines the current time to a microsecond level of accuracy and computes a sliding window calculation of the
Mean – Average time between processing intervals
Standard Deviation – The statistical variation between processing intervals
The average time stays close to the target since after completing each cycle, the new target evaluation time is set achieve the original goal. Thus if the sampling interval was 60 seconds and the evaluation was completed at time 63 seconds, the next target would be 57 away. The average stays OK but the variations grow wider.
The “sliding window” means that only the only a recent set of measurements are used. That is important so that recent conditions are given more importance. The oldest observation is dropped after the window limit is passed.
Package Parts
An example of the situation is included: a file IBM_TEMS_stress.xml which can be loaded using this command
./tacmd createSit –i IBM_TEMS_stress.xml
The situation example uses a 2 minute sampling interval. You may adjust this time as needed. The time must be in coordination with the action command script. On a fairly powerful AIX server this was measured as 0.30 CPU seconds per invocation.
The action command is set to run on every true cycle. In addition, the action command must run on the TEMS since this is a probe based situation [Note 2].
The processing example program is also included – itm_stress.pl. If you do discover issues and make corrections, please return them to the author for general usage. Logic for the sliding window calculations are documented in the comments with references.
You can also access the files via this Github repository https://github.com/jalvo2014/stresstest.
Tailoring for the environment
Toward the beginning of itm_stress.pl there are several constants you may change. Here is what the section looks like:
my $local_dir = "/tmp/"; # location of stress files $local_dir = "c:" . $local_dir if $gWin == 1; # Stress file in Windows my $local_file = "stress.txt"; # current data file my $local_log = "stress.log"; # Progress log my $local_window = 60; # number of entries in sliding window
$local_dir is the directory where the two output files are kept.
$local_file is the current calculation file and keeps all data needed for the logic.
$local_log is the closest thing to an output file right now.
$local_window is the number of entries in the sliding window.
The test for Windows needs work.
The $local_window value and the situation sampling interval define how long a time will be measured. The default here is 60 cycles of 2 minutes each or 120 minutes.
Sample Stress Log Data
Here is an example stress.log entry
Added entry 1363759724.27927 60 120.000050981842 0.00122896460560101 120.000627994537 0.00456397946637675 1130320010844000
Here is what the data items mean
#1 |
1363759724.27927 |
Epoch seconds & microseconds |
#2 |
60 |
number of samples – up to window limit |
#3 |
120.000050981842 |
average time in the window |
#4 |
0.00122896460560101 |
Sum of difference squared |
#5 |
120.000627994537 |
Most recent difference |
#6 |
0.00456397946637675 |
Standard Deviation estimate |
#7 |
1130320010844000 |
ITM timestamp |
In this case the TEMS was not under any significant stress and so standard deviation was very low. Epoch is a standard Posix measurement of number of seconds since 1 Jan 1970.
Summary
The hypothesis is that the standard deviation will be small for a TEMS under little stress and will increase when the stress is increased. That has been true in some unit testing environments.
The present goal is to run this process in some customer environments and see what conclusions can be found. If the conclusions bear out the theory, then the standard deviation can be used as a trigger for an internal alarm using one of several mechanisms such as a monitored log message, a marker file or a universal message. This alarm could show long before a failure condition.
Original publish date: 2013/3/20
Sitworld: Table of Contents
Future Extensions
1) This situation could run on an agent with the action performed at a reporting TEMS. In this way communications stress could be measured as well.
2) Using a Workflow policy, this could run on a remote TEMS and the command run on the hub TEMS. In this way the combined hub/remote TEMS stress could be measured.
3) If the theory proves valid and useful, the standard deviation could be used to generate alerts in several different ways.
Note 1: See TEMS audit process technical note:
Sitworld: TEMS Audit Process and Tool