Version 2.38000 3 March 2023
John Alvord, IBM Corporation
jalvord@us.ibm.com
Inspiration
There have been cases every year where a TEMS was running with high enough CPU/Storage resource usage that the customer was concerned. In some cases, the TEMS experienced a steady storage growth and failure after some days. This condition is usually caused by workloads, configuration or environmental conditions. Situations are the most common workload but SOAP calls, historical data collection and Portal Client workspaces can be involved. In addition, there may be important error messages in the diagnostic log.
Background
In 2008 I began work on a year-old customer problem that took until Spring 2010 to resolve. The conclusion was that a certain type of situation caused a severe storage fragmentation and a TEMS failure in 10-12 days. The customer decided to recycle the affected remote TEMSes once a week. That was probably the single most expensive PMR [for IBM and customer] I ever worked on.
Some 6 months later in Dec 2010 I had a customer with six AIX servers running at 95% utilization and all from remote TEMS processing. The solution was presented but the customer was unconvinced. I wrote a Perl program to summarize the results in a spreadsheet file. The customer was convinced, made the changes and those six systems dropped to 10% utilization. In March 2011, I published the process and tool as a technical note and it is now widely used.
Recent changes
Since the last published version 1.37000 [28 advisories] 38 new advisories have been added and many report sections to explain the environment further. At level 1.64000 the documentation has been reworked for simplicity and clarity.
- The TEMS Audit report itself includes an explanation of each advisory and that is not duplicated in this document. The temsaud.pl program can be viewed to see each advisory explanation.
- The separate word file documents have been eliminated and information added here, revised and extended.
- Each report section is documented along with needed traces.
These changes will continue. The source in every case is based on diagnosing a specific customer issue and when possible adds an advisory or report section which would speed the diagnostic effort – even allowing the customer to correct the condition whenever possible. In recent months there have been several advisories and/or report sections added each month.
TL;DR – Too Long; Didn’t Read
Sometimes you want immediate results and to heck with the details. Here are the instant use instructions. Unpack the most recent zip file and copy the temsaud.pl and testaud.ini files to a convenient location where Perl is installed [all Unix and Linux environments] and on the system where a TEMS is running. Let’s pick /tmp and the TEMS is installed at the default directory. Now run this command
perl temsaud.pl -v -logpath /opt/IBM/ITM/logs
and if there should happen to be more than one candidate inventory file,
perl temsaud.pl -v -logpath /opt/IBM/ITM/log hostname_ms_kdsmain.inv
and then view the report file temsaud.csv. The end of that file includes an explanation of each advisory. If you want more information about the TEMS workload then add the needed traces documented later.
Linux and Unix systems usually have Perl installed.
On Windows, you can install the community edition at www.activestate.com. You can also just copy the needed files to a system where Perl is installed.
For a z/OS TEMS trace, capture the RKLVLOG sysout log into a file and run temsaud.pl with the -z option. I worked to make that automatic but have not found the secret sauce.
Easy access to TEMS Audit report file – Just ask for it!
If you have a Case or PMR and send in a pdcollect of a TEMS, a TEMS Audit report will usually be created. [There are rare cases where files are missing and the report cannot be created]. You can request that the TEMS Audit report be sent to you by the support engineer working on the Case/PMR.
Installation
Perl is usually preinstalled in Linux/Unix systems. . For Windows you may need to install from www.activestate.com or any other source. Alternatively you can copy the needed log files to a system where Perl is installed. The program only uses Perl core services, No CPAN modules are needed.
TEMS Audit has been tested with
This is perl 5, version 20, subversion 1 (v5.20.1) built for MSWin32-x64-multi-thread
zLinux with Perl v5.8.7
Files
There are two TEMS Audit program files in a zip file temsaud.2.34000.
1) temsaud.pl Perl program
2) temsaud.ini Option file for setting nominal values
Install them somewhere convenient.
Run Time Options.
There are many but temsaud.pl does not require any of them.
Options:
-b Show HEARTBEATs in Managed System report
-cmdall When cmd trace is present, report all commands
-expslot <mins> Historical export report, display slot size – default 60 mins
-flip Produce detailed report on agent location flipping, ITM 630 FP7
-h display help information
-jitall produce advisory on minor level heartbeat jitter
-jitter produce jitter reports, default no jitter reports
-logpath Directory path to TEMS logs – default current directory
-o Specify the output report filename
-odir Specify directory where output files will be stored
-ossdir What directory to write send status report
-nohdr Remove variable header lines – used for regression testing
-rd Result detail reports
-ri <secs> result workload report by interval, default 60 seconds
-rdslot Seconds in result detail slots, default 60 seconds
-sqldetail produce a SQL detail report by source and table involved
-sr SOAP Report minute by minute – soap_detail.txt
-ss produce a report send status – used for multiple remote TEMSes
-sum produce a one line summary file
-v Produce some progress messages in STDERR
-work Copy files to work directory before analyzing.
Environment variable Windows – TEMP, Linux/Unix tmp
-workpath Directory path to work directory, default is the system
-z z/OS RKLVLOG log
-eph Capture ephemeral file report
-ephdir Where to store ephemeral file report
-dup Create a DEDUP.CSV file for use in DUP2DO package.
The remaining parameter [if present] is a log file specification. Often no specification is needed. It will look at the -logpath setting and determine the current logs from the inventory file. Sometimes there is more than one candidate and you can specify it like
Hostname_ms_kdsmain.inv
You can also specify log files to study as a partial name such as
ibms98htems_ms_52052f96
That setting analyzes multiple log sections with the same file name starting string. That can be extremely useful of the .inv file is missing or was not sent with the diagnostic files to be studied. It is also the way to analyze a diagnostic log set that is not the most current.
In hands-free operation, do not specify a specific file. Instead use -logpath to specify a path to a normal TEMS logs directory or leave it off for the current directory. In hands-free mode the directory is examined to determine the current active log segments. If there are multiple .inv files the one with the most recent diagnostic logs is chosen. The log segments are processed in the sequential time order. When -work option is set, the diagnostic logs are copied to a work directory to avoid the issue where a log is deleted while being used. Specify the. inv file in case there are more than one candidate.
Note: If there is a one hour or higher time gap between the first and next segment, the first segment is processed only for certain messages. That said there are times when you will need to manually assemble a single combined log to analyze. That can be true if traces are turned on dynamically. Some of the reports use the time to calculate rates and isolating just the traced log makes that more accurate.
The default output report is temsaud.csv. Use the -o option to specify your own name for the report.
One section of the report is advisories. If the maximum advisory impact level is equal or greater than 50, the program ends with exit code 1 and otherwise 0. That can be used in shell files to control other actions – like sending email.
Run examples:
perl temsaud.pl cmsall.log
perl temsaud.pl -v
Reports and Diagnostic tracing needed
The reports are intended to be self-contained and easy to use. With no tracing beyond the default ERROR you get a lot of useful information. To get full value you need to set tracing. See post on how to set tracing. The following shows a report and how to get extra value by setting diagnostic tracing.
Header
This labels the report and TEMS Audit version number, when run, the runtime parameters, the logs used [if -v is specified] and a line which is the TEMS nodeid and *LOCAL for hub TEMS or *REMOTE for remote TEMS. Occasionally the last line is missing because not in the logs. The header can be suppressed using the -nohdr option.
TEMS Audit report v1.63
Start: 2017-05-24 15:44:38
Runtime parameters: -sum -rd -flip -odir /ecurep/tmp/ITM_Health/Awn0gXM2ky/ -logpath /ecurep/pmr/4/8/48002,080,678/2017-05-24/48002.080.678.pdcollect-w94486.tar.Z_unpack/autopdzip/autopd/IBM/ITM/logs/logs w94486_ms_59243128
“working on 0 /ecurep/pmr/4/8/48002,080,678/2017-05-24/48002.080.678.pdcollect-w94486.tar.Z_unpack/autopdzip/autopd/IBM/ITM/logs/logs/w94486_ms_59243128-01.log”
HUB_w94486 *LOCAL
Advisories
The header is followed by advisory messages. These are indications of errors and warnings from the diagnostic log. The advisories are sorted in reverse order by impact. These impact values selected based on years of experience however feel free to use you own judgement. These are Advisory and not The Law. The following is just one recently created. There are 56 advisories at ths\is writing and more are added all the time.
Advisory Message Report – *NOTE* See advisory notes at report end
Impact,Advisory Code,Object,Advisory,
200,TEMSAUDIT1052E,HUB,Hub TEMS has lost connection to HUB 84 times
100,TEMSAUDIT1050W,PCB,Agent connection churning on [1] systems total[5] – See following report
90,TEMSAUDIT1055W,Agent,Agent Location Flipping Changes Detected [11] – See following report
90,TEMSAUDIT1061E,TEMS,Situation [ddb_sanprb_x07c_aix_gen] with unknown attribute [K07_SANPATH.MonitoringSolution] – [*IF ( ( *VALUE K07_SANPATH.RC *EQ ’20’ ) *OR ( *VALUE K07_SANPATH.RC *EQ ’21’ ) *OR ( *VALUE K07_SANPATH.RC *EQ ’22’ ) *OR ( *VALUE K07_SANPATH.RC *EQ ’23’ ) ) *AND *VALUE K07_SANPATH.MonitoringSolution *EQ ‘aix’]
80,TEMSAUDIT1016W,Onlines,3 Agents with repeated onlines
80,TEMSAUDIT1035W,diagnostic,7 worrying diagnostic messages – see later report section
80,TEMSAUDIT1054W,SOAP,SOAP Errors Types [2] Detected – See following report
50,TEMSAUDIT1021W,timeout,4 Agent time out messages
0,TEMSAUDIT1005W,Stack,ulimit stack [4294967296] is above nominal [10240] (kbytes)
The pattern of each advisory is an impact level, an advisory identification, a category and a text message. Some reference later report sections. The first one above
200,TEMSAUDIT1052E,HUB,Hub TEMS has lost connection to HUB 84 times
Impact: 200 The highest current level.
ID: TEMSAUDIT1052E a sequential number
Category: HUB the hub TEMS is involved
Text: Hub TEMS has lost connection to HUB 84 times
At the end of the temsaud.csv report you find the following explanations of each advisory. Here is the one for the issue above.
Advisory code: TEMSAUDIT1052E
Impact:200
Text: Hub TEMS has lost connection to HUB count times
Tracing: error
(58D77E39.0002-11A0:ko4mgtsk.cpp,133,”ManagedTask::sendStatus”) Connection to HUB lost – stopping situation status insert
Meaning: This is produced when one hub TEMS task can no longer connect with the dataserver function of the hub TEMS. This is a severe error which is only recovered with a hub TEMS recycle.
The hub TEMS will not recover without a recycle.
This is a most severe issue and needs prompt diagnosis. It can be caused by many reasons including TEMS database problems, network problems and excessive workload.
Recovery plan: Contact IBM support for aid in diagnosis.
<end of advisory explanation>
Reports
The reports are documented in the temsaud.csv report at the end. Here is an example:
TEMSAUDIT1067E
Text: Unable to get attributes for table <table> [<count> times]
Tracing: error
(591ACD2E.0000-7C:kshcat.cpp,296,”RetrieveTableByTableName”) Unable to get attributes for table tree TOBJACCL
Meaning: The SOAP request failed. This hints of an out of storage
issue, a broken attribute collection or mssing application
support files.
Recovery plan: Work with IBM Support to resolve issue.
Tracing Guide in a Separate Document
The Tracing guide is documented in a separate post because of a document size limit:
sitworld: TEMS Audit Tracing Guide Tracing Guide Appendix
The tracing appendix is in a separate post because of a document size limit. It shows exactly how to set diagnostic traces for TEMS Audit both statically and dynamically.
Summary
This document shows how to collect performance data from a TEMS. As a result you can make operations more efficient and complete.
Versions:
If you wish to track recent updates access the github project https://github.com/jalvo2014/temsaud.
Here are recently published versions. The github repository has change logs if you are interested.
temsaud.2.38.000
Handle kdsvlunx warning
temsaud.2.37.000
Handle .inv case with “cms” in the filename, should be ignored;
Alert on some failures to resolve target addresses
temsaud.2.34000
Add advisory for database file errno seen
add advisories for apparent TEMS crashes
Add advisory for TCP SSL handshake stalls
Add report on duplicate Sub-node agents
Extend advisory on index errors
Add window duplicate in minutes
temsaud.2.33000
Modify advisory about same Primary/Mirror TEMS nodeids;
Add advisory on non-suid kdsvlunx;
Add result lengths for zTPF agents;
Change crit warn on large files to more than 4 gigs;
Add crit if trace rate is more than 10 times nominal
advisory about bad rawname messages
Add advisory when FTO hub TEMS have same nodeid
Add duplicate agent time average to judge impact
Add advisory for HOSTS hostname duplicated
Add advisory for large files and report section
make open/write do overwrites in Windows
temsaud.2.30000
track and report Stage 2 Global Access Table progress
Warn if SDA different between the two FTO hub TEMSes
temsaud.2.29000
Add Stage II logs to report044
Add advisories on hub/remote resync
improve newpcb/online logic heuristics
temsaud.2.28000
Ignore foreign address of :::*
Add advisory 1140E when TEMS too busy
Add advisory 1141W when TEMS/Agent coordination fails
collect report085 duplicate possibles
in report085 look back only 3 lines
2.27000
Extend REPORT076 information with Node counts when possible
2.26000
newPCB versus agent online report
2.25000 temsaud.2.25000
Add better duplicate agent report082
Add report about rpc errors
Summary
Identify and correct workload and configuration problems. I encourage anyone to share success stories, enhancement requests or problems found.
16 thoughts on “Sitworld: TEMS Audit Process and Tool”