John Alvord, IBM Corporation
jalvord@us.ibm.com
The following information has been updated and included in the new TEMS Audit distribution which is documented in a blog post here.
Overview
TEMS Audit is a program that summarizes a TEMS diagnostic log when traces have been set to record workload. The goal is to identify workload elements which are causing high processing loads and instability. After identifying those elements, the workload or configuration can be changed to have a stable TEMS. A teammate recently pointed out how he needed help when first using it to know what was important. To make usage easier an advisory section has been added.
The installation, run parameters and usage information is described in this blog post:
Sitworld: TEMS Audit Process and Tool
After the program is run, the result is a comma separated variable or csv file which can be viewed in a spreadsheet display. The file contains a series of reports and the technote contains detailed listing of the contents. This Guide shows how to use the Advisory messages.
The diagnostic trace data available will control what reports and alerts are shown.
The following section has been superseded by the above blog post. The advisory message explanations are now part of the report itself.
Advisory Messages – Overview
The title lines are followed by advisory statements and those are followed by the detailed reports. Here is an example
Advisory: 12 Filter object too big situations
Advisory: Results bytes per minute 1056% higher then nominal [500000]
Advisory: IBMUX_FileMonitoring2_Warn high rate 5554041 [96.02%]
Each advisory refers to a specific area of concern that should be investigated. Many of the advisory messages are created based on nominal [normal/expected] values and calculated rates. Here are the nominal values
# rate nominal notes
results 500000 result bytes/min
trace 1000000 trace bytes/bin
soap 30 SOAP requests per minute
workload 50 Objects totaling workload if results>nominal
remotesql 1200 remote SQL failures during these first seconds
The values are in the temsaud.pl and can be overridden using a temsaud.ini file in the above form. Here is an extended discussion of the meaning.
- results
During the diagnostic trace capture, each set of results are recorded in a diagnostic log. This number is the averaged number of results per minute. 500,000 bytes per minute or lower has been seen in many environments which run stably for long periods. This could be made larger or smaller based on the system performance and competing workloads. The largest number I ever saw was 92 megabytes/minute and the remote TEMS was extremely unstable.
- trace
The impact of tracing itself is rarely a factor. A typical environment with no diagnostic tracing shows 2,000 bytes per minute. In one case a customer was accidentally running with KBB_RAS1=ALL on a remote TEMS and the trace bytes was 192 megabytes per minute. The remote TEMS was unstable. The nominal value chosen was about 5% of that value.
- soap
These are usually SQL requests from tacmd or from user written scripts or programs. SQL in general is a resource intensive process. Severe problems have been seen when the SOAP rate is at 60 per minute or higher. The raw number doesn’t reflect the true impact since the amount of data returned is important, but this value was selected to give some degree of warning.
- workload
If the result rate is higher then nominal, the workload elements composing this per cent of the total are displayed. These are often situations but can be real time data requests from TEP workspace views or SQL. Very often just a few situations compose the bulk of the total.
- remotesql
Most work between the TEMS tasks is performed using remote SQL. By default this has a 600 second timeout. During remote TEMS initialization several large tables are copied from the hub TEMS. If the 600 second timeout occurs, the remote TEMS will be unstable. The remotesql nominal value is the number of seconds since startup when a failure should be highlighted as needed more investigation since it may be one of these startup failures.
You can make changes to the nominal values by updating the temsaud.ini file and placing it in current directory where temsaud.pl runs.
Advisory Message – No Expert Advisory messages
There were no conditions that suggested a problem to research and resolve.
Advisory Message – <num> Filter object too big situations and/or reports
Situations or SQL for real time data will usually have a WHERE clause to define what data is required. The “WHERE” clause is compiled into binary objects [plan and pool]. If either is more then 32767 bytes long, those objects are not transmitted to the agent. In that case the agent runs in unfiltered mode and can sent a river of data back to the TEMS causing instability.
The recovery action is to rework the situations or SQL. Divide situations into multiple situations with fewer tests. For SQL set fewer tests and do post retrieval filtering.
Advisory Message – Results bytes per minute <percent> higher then nominal <max>
A TEMS workload [remote in large environments] is largely driven by incoming results workload. This message warns if the observed incoming load higher than the nominal value.
The recovery action has three phases. First examine the workload and make changes to increase efficiency. Second, work at reducing any competing processes or running on a more powerful system. Third create another remote TEMS and split the agents between them.
Advisory Message – Trace bytes per minute <num> higher then nominal <num>
Tracing at an intensive level can cause TEMS instability. The first number is the actual trace bytes per minute observed and the second is the nominal.
The recovery action is to review tracing. If it is not needed, set tracing to the default “error” and recycle the TEMS.
Advisory Message – <num> early remote SQL failures
When there are early remote SQL failure(s) observed that can result in TEMS instability. The usual cause is a high latency link between the hub TEMS and the remote TEMS.
The recovery action is to set a TEMS environment variable KDS_SYNDRQ_TIMEOUT to a higher value such as 1800 (seconds). See this technote for more details:
A second approach is to create a new hub TEMS at the distant location.
Advisory Message – <situation> high rate <rate> [<percent>]
If the result rate is high, this advisory message is produced which tells you the situation involved, the situation result rate and the percent of total results.
The recovery action is to stop the situation and rework it to be more efficient.
Advisory Message – <situation> possible truncated results – max result <max>
If a situation maximum return result is seen that is near 16 megs – that suggests that result data is being truncated and events may be lost. A situation formula running at the agent but evaluated at the TEMS [for example if using a TIME test] might accumulate 30 megs of data to return to the TEMS. However the agent side logic silently truncates the data to a maximum number of rows so total is less than 16 megabytes. That can lead to events that are not created.
The recovery action is to rework the situation to avoid the condition.
Advisory Message – SOAP requests per minute <num> higher then nominal <num>
SOAP requests cause high TEMS resource usage. If the rate is higher than nominal this message is produced as a warning.
The recovery action is to rework the SOAP processes to reduce the frequency of SOAP calls.
Summary
This post explains how to use a new TEMS Audit facility to improve TEMS stability and efficiency.
Sitworld: Table of Contents
Photo Notes: A Gorilla Reflecting Sadly on the Daily News
My wife’s parents gifted her with this many years ago.