Sitworld: TEPS Audit

cruise_ship_101

Version 0.53000 15 December 2015

John Alvord, IBM Corporation

jalvord@us.ibm.com

Follow on twitter

Inspiration

A recent case involved very slow Portal Client response time and general TEMS instability including failures. After some study the root cause turned out to be duplicate agent names.

ITM architecture depends  unique agent names – no duplications. However, there is no logic that enforces that rule and indeed duplicate agent names occur in normal processing. For example, if an agent with two remote TEMS configured loses contact with one remote TEMS because of some communications outage, it can connect via a secondary remote TEMS. During that reconnection logic, the hub TEMS will note the duplicate agent name and accept it. Later on the first remote TEMS will notice the agent going offline [node status update missing], update the hub TEMS the agent is offline and the hub TEMS does the right thing and ignores it. So that duplicate agent name case is a transient normal condition.

However, this case involved some 15 MQ agents all of which had been configured with the same name. This causes the duplicate name logic to occur steadily, 15 times every 10 minutes by default. This project reports duplicate names using a very simple TEPS trace and a report program.

Background

The TEMS must do work to manage the duplicate agent condition: stopping situations on the earlier agent instance and starting on the new agent instance. The TEPS maintains a topology of the ITM Agents and that topology map must be recalculated with every new arrival. Navigator Updates Pending alerts in Portal Client are also seen. In this case the effect was so strong that Portal Client was experienced very poor performance. In addition and worst, only one agent at a time was actually running situations. As the agents swapped in and out of connection situations kept starting and that produced a steady stream of situation events all apparently from the “same” agent and duplicating earlier events.

I worked on this once before  Diagnosing the “Navigator Updates Pending” message and that technote has been very effective. However it requires manual work and analysis to correct.

I also added a potential duplicate agent report to the most recent 1.16000 level of Sitworld: ITM Database Health Checker. That report identifies some cases [agents reporting to different remote TEMSs and Agents reporting with different maintenance levels] but not others [agents with same maintenance level reporting to same remote TEMS]. Therefore something more was needed,

This project works with a TEPS trace and produces a full report on duplicate agent names. That works because the TEPS receives a stream of Situation Status Updates as each agent goes online and perhaps later offline.

At 0.53000 a summary of incoming events is provided. This can be very useful in diagnosing a TEPS that is overloaded with “too many events”. More importantly, when there are event floods that usually means a violation of the prime ITM design: situation events should reflect rare and unusual events which can be worked on to prevent in the future. A violation causes excess work at the agent, the TEMS, the TEPS and the people who attempt to manage the event workload.

Preparing for the install

Perl is usually pre-installed in Linux/Unix systems. . For Windows you may need to install from www.activestate.com or any other source. The program only uses Perl core services and no CPAN modules are needed.

TEPS Audit has been tested with

This is perl 5, version 20, subversion 1 (v5.20.1) built for MSWin32-x64-multi-thread

zLinux with Perl v5.8.7

A zip file is found found TEPS_Audit.0.57000. There is one file tepsaud.pl. Install it somewhere convenient.

Run Time Options

Options:

-h                           display help information

-v                           Produce some progress messages in STDERR

-log                        Produce a log file, default yes

-logpath                  Directory path to TEMS logs – default current directory

-nohdr                    Remove variable header lines – used for regression testing

-istone                    Show instance=1 cases

The remaining parameter [if present] is a log file specification. The log files to study can be specified as a partial name such as HUB_ibms98htems_ms_52052f96. That setting analyzes multiple log sections with the same file name starting string.

In hands-free operation do not specify a specific file. Instead use –logpath to specify a path to a normal TEPS logs directory or leave it off for the current directory. In this mode the directory is examined to determine the current active log segments. The log segments are processed in the sequential time order.

Note: If there is a one hour or higher time gap between the first and next segment, the first segment is processed only for certain error messages.

The report is written to tepsaud.csv in the current directory.

Run command examples:

perl tepsaud.pl cmsall.log 

perl tepsaud.pl -v

Configuring TEPS trace for Duplicate Agent Report and Situation Event report

The diagnostic trace required is this: ERROR (UNIT:ctcmw IN ER ) (UNIT:kv4 IN ER)

See Appendix 1 on how to enable that trace. Generally a one hour trace is sufficient for this purpose.

In addition, if you have any of the following set in cq.ini

KFW_CMW_DETECT_AGENT_ADDR_CHANGE=N

KFW_CMW_DETECT_AGENT_HOSTNAME_CHANGE=N

KFW_CMW_DETECT_AGENT_PROPERTY_CHANGE=N

remove or comment them out and restart the TEPS before beginning the capture. They suppress the raw data we need to create the report.

TEPS Audit Report

The report has a lot of details, so a screen image is tough to understand.

There are two sections. The first is the advisory messages [eventually there will be more messages].

Column Description Example
Impact Importance 100
Advisory code documentation TEPSAUD1002E
Object relates to – agent in this case ibm_test:LZ
Message Node has connected via 62 ip addresses or versions or thrunodes: Likely duplicate name configuration error

The TEPS Node Status Update Report follows. Each agent has a first line

Column Description Example
Agent Name Managed System Name ibm_test:LZ
Node Status Count 291
Different Instances 62

Instances represent how the agent presented including four pieces of data

Agent Name

Thrunode [often remote TEMS]

Affinities [type of agent]

Hostaddr [including ip address

Column Description Example
(blank) to line up with header line
Node Status Count for this instance 7
Online Count for this instance 7
Offline Count for this instance 0
Product Two character Agent Code LZ
Version of Agent 06.23.03
Version addendum Other version information A=00:lx8266;C=06.23.03.00:lx8266;G=06.23.03.00:lx8266;
Thrunode How the agent connected REMOTE_ibmtest101
HostAddr ip address and hostname ip.spipe:#1.2.3.4[7757]<NM>ibm_test</NM>
HostInfo type of host Linux~
Affinities Type of agent and TEMA %IBM.STATIC134          000000000I000Jyw0a7
Timestamps When update occurred 1150911113710000:1150911114710000: …

IP and HOSTNAME and Property change report

There are cases where TEPS produces a shorter statement about conditions where an agent changes ip address, host name, or affinities change. If these are spotted a report section is generated for each. For example

The IP Change report starts with a header line

Column Description Example
Node Managed System Name ibm390:LZ

The first line is followed by instances to represent the observed change

Column Description Example
(blank) to line up with first line
IP Address ip address of agent 127.0.0.1
Count number of times seen 2

In these cases, the from ip address and the to ip address are counted separately. Both are involved and need to be investigated. If the from ip address is blank, the report is ignored since that is the common condition when an agent first connects.

There are similar reports for hostname changes and property changes. These often overlap and solving one case will solve all of them.

Situation Event Report

The TEPS Situation Event Report follows. Each situation name has a first line

Column Description Example
Situation Name Managed System Name ibm_test_situation
Situation event count Count of status 291

The first line is followed by instances to represent how and when the situation event status was observed

Column Description Example
(blank) to line up with header line
Node Agent where result calculated Primary:IBM-XXX1397-S:NT
Type Sampled=0, Pure=1 0
Atomize option distinguishing element arc_IBM_22
Count number of status 9
Open Count situation event opens 5
Close Count situation event closes 4
FullName display Long situation name ibm_test_situation
Sitmon Node which TEMS made event REMOTE_-IBM0067-S
Time_Status Time and event status 1151221143708000_Y:1151221143938003_N

The Time_Status is the ITM time stamp when processed by the TEMS and the Situation Event status. Y=Open, N=Close, S=Start, P=Stop and some rarer cases

Recovery Action Plan – Duplicate Agents

The first case was extreme. There were 115 agent names which showed with more than one connection instance. The next worst two were 62 and 54 connection instances.

The recovery action was very straightforward although laborious. Proper running of ITM requires that each agent have a different agent name. Each case needs to be investigated and corrected. For example in the case above – ibm_test:LZ – each of the 62 systems must be logged on and have the configuration manually changed and the agent recycled. In most cases, CTIRA_HOSTNAME is not needed. The hostname of the system is naturally different and without CTIRA_HOSTNAME specified the agent name will reflect the name of the agent. If you do to need to pick a CTIRA_HOSTNAME – it must be selected so all agents have unique names. In this case it is probably specified in the <installdir>/config directory in the lz.ini file or possibly in the lz.environment file.  Edit that file, delete that CTIRA_HOSTNAME line or specify it so the names are unique.

Do that for every single case of duplicate names… using the report.

When you think you have made progress, redo the TEPS trace  and see if any items remain. As the duplicate agent conditions are resolved ITM will begin running better and better.

The final step is to determine why this duplication exists. Often a “golden system image” is created and then cloned. Part of that process *MUST* be to configure the new agent so it has a unique name. Investigate that process thoroughly to avoid future problems.

This case was extreme, but I have seen a few such cases in almost every large ITM environment.

Recovery Action Plan – Excess Situation Events

In another case, the external symptom was Portal Client giving bad response times and sometimes failing. The Situation Status report showed that 70% of the situation event workload was coming from a single situation. It was alerting on cases where Windows services were in a non-running state. The summary showed that the events were opening and closing quite often. The situation was running on a lot of Windows systems and the combined effect was to bury the TEPS. At the same time, the situation was generating a lot of situation opens and quickly event closes. The recovery plan was to rethink that situation so it only alerted on rare and exceptional cases. The plan was to make the situation less complex by splitting into multiple situations, increase the persist count, and add more tests to ignore common conditions that did not need an alert.

Like many real life cases, once you have the information about what is driving the workload you can figure out how to make things better. In this case there were 550+ active situations and just a single one was causing the problem.

Summary

Identify and correct agent duplicate name configuration issues. If you find any anomalies which are hard to correct, please contact the author.

Versions:

Here are recently published versions, In case there is a problem at one level you can always back up.

TEPS_Audit.0.57000

Avoid a divide by zero condition

Sitworld: Table of Contents

Appendix 1 – Turning Tracing On and Off

A trace filter looks like this

ERROR (UNIT:ctcmw IN ER ) (UNIT:kv4 IN ER)

The first token is a general level of tracing. The unit: specifies a local source unit. An optional Entry= specifies a function name.

From ITM 623 FP2 and ITM 630 GA there is a tacmd function to turn on tracing. You must do a tacmd login first.

Linux/Unix

Turn on trace:

Linux/Unix: ./tacmd settrace -m <tepsname>  -p KBB_RAS1 -o ‘ERROR (UNIT:ctcmw IN ER ) (UNIT:kv4 IN ER)

Windows: tacmd settrace -m <tepsname>  -p KBB_RAS1 -o “ERROR (UNIT:ctcmw IN ER ) (UNIT:kv4 IN ER)

[one long line]

Turn off trace

Linux/Unix: ./tacmd settrace -m <tepsname>  -p KBB_RAS1 –r

Windows: tacmd settrace -m <tepsname>  -p KBB_RAS1 -r

A second way is to use the service console. If you are unfamiliar see this technote:

Dynamically modify trace settings for an IBM Tivoli Monitoring component 

Turn On:

ras1 set ERROR (UNIT:ctcmw IN ER ) (UNIT:kv4 IN ER)

Turn off:

ras1 set ERROR (UNIT:ctcmw ANY ER ) (UNIT:kv4 ANY ER)

The general rule is that an “any” specification will remove the effect of any prior set(s). The er [error] should be added back.

Sometimes you cannot use the diagnostic console. Here is what you do for the different platforms.

Linux/Unix: 

Add this to the TEPS ini – which has the name cq.ini

KBB_RAS1= ERROR (UNIT:ctcmw ANY ER ) (UNIT:kv4 ANY ER)

Windows:

In MTEMS GUI

– right click on TEMS line

– Select Advanced

– Select Edit Trace Parms

– In the “Enter RAS1 Filters:” input area add 

ERROR (UNIT:ctcmw ANY ER ) (UNIT:kv4 ANY ER)

To remove the traces, restore the line to “error” in all cases and recycle the TEPS.

Note: Electrical Heart of a New Cruise Ship

 

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: