Sitworld: ITM Agent Health Survey

barnswallowsatdawn

John Alvord, IBM Corporation

jalvord@us.ibm.com

Draft #26 – 16 November – Level 1.19000

Follow on twitter

Introduction

The ITM Agent Health Survey tool detects possible cases of agents that show online but are non-responsive. In many cases an agent recycle will restore normal operation. Often up level ITM maintenance will resolve the issue permanently. If you know the scope of the issue you can take remediation steps and thus improve the operating environment.

The agents tested are those connected to a hub or remote TEMS or to a specific list of agents. The resulting report has a summary followed by several reports – see below.

The Agent health survey tool has low hub TEMS impact and is reasonably fast. In large scale internal tests, the survey rate has ranged from 5 agents/second and higher. In customer environments the rate was lower but still fast.

This was accomplished using some new techniques so that many agents to be surveyed with a single SOAP request.

At level 1.05000 an option “um” was added to write a universal message to the hub TEMS and a sample situation to create events.

A zip file containing the program objects are health.1.20000. In the History section toward the end there are links to prior levels if needed.

Level 1.13000 supports Activestate Perl 5.20 and also uses LWP::UserAgent instead of SOAP::Lite for improved performance.

Level 1.18000 adds validity checking on cmd and um definitions.

Level 1.20000 corrects a serious defect where not all user defined agents [Agent Builder and Universal Agent] were checked or where checked inefficiently.

 

ITM Agent Health Survey Report

Here is an example of an test on a test ITM agent with 9,582 agents.

First section is possible unhealthy Agents – ones that did not respond to the mass request or the single retry request stage 1 or two.

If there were any unhealthy agents, summary reports are produced by TEMS, by TEMS version, by TEMA version and by Product.

If there were retries needed to get a response, the next section lists the agents and what the needed retry time was. By default there is a try at 15 seconds and later a try at 50 seconds.

If there were retries that returned an unexpected first operation log entry, the last section lists those cases. This is likely not a serious problem but it was interesting.

health_samp

ITM Agent Health Survey Installation

The agent health survey package includes one Perl program that uses CPAN modules. The program has been tested in several environments. Window had the most intense testing. It was also tested on AIX. Many Perl 5 levels and CPAN package levels will be usable. Here are the details of the testing environments.

From level 1.13000 the Activestate Perl used is 5.20.  Earlier levels were tested on 5.16. If you make use of the blog CPAN library below, use the 5.20 version of that package.

  1. ActiveState Perl in the Windows environment which can be found here: http://www.activestate.com/activeperl/downloads

perl -v

This is perl 5, version 20, subversion 1 (v5.20.1) built for MSWin32-x64-multi-thread (with 1 registered patch, see perl -V for more detail)

2) Perl on AIX 5.3

# perl -v

This is perl, v5.8.2 built for aix-thread-multi

(with 3 registered patches, see perl -V for more detail)

CPAN is a collection of free to use packages. In your Perl environment, there may be some installed CPAN modules and agent health survey may need more. Here are the modules used.

Getopt::Long              in CPAN Getopt-Long 2.42

LWP::UserAgent            in libwww-Perl 6.02

HTTP::Request::Common           in CPAN HTTP-Message  6.06

XML::TreePP;              in CPAN XML-TreePP 0.43

You might discover the need for other CPAN modules as the programs are run for the first time. The programs will likely work at other CPAN module levels but this is what was most recently tested.

The Windows Activestate Perl environment uses the Perl Package Manager to acquire the needed CPAN modules. The Agent Survey technote has an appendix showing usage of that manager program with screen captures.

Please note: In some environments installing new CPAN packages is a major problem. Internet access may not be available or Perl may be a shared resource which you do not have the right to change. Changing such packages could negatively affect other programs.

To manage this case please see the CPAN Library for Perl Projects which has a package which can eliminate changing the installed Perl libraries.

Package contents

The supplied program is itm_health_survey.pl and a model health.ini file.

To install the Agent Health Survey package, unzip or untar the file contents into a convenient directory. On Linux/Unix you will need to use chmod/chusr/chgrp to define the file as executable. The package also includes a model health.ini file. The soap control is required [see later for discussion]. The userid and password may be supplied in the agent.ini. In this case the health.ini file looks like this

soap <server_name>

user <user>

passwd <password>

The user and password credentials may be supplied from standard input. This increases security by ensuring that no user or password is kept in any permanent disk file. In this case the health.ini file would look like this:

soap <server_name>

std

The std option can also be supplied on the command line -std. In either case, a program must supply the userid and password in this form

-user <userid> -passwd <password>

The program invocation would be something like this

mycreds | perl …

ITM Agent Health Survey Configuration and Usage

The Agent Health Survey package has controls to match installation requirements but the defaults work in most cases. Some controls are in the command line options and some are in the health.ini file. Following is a full list of the controls.

The following table shows all options. All command line options except -h and –ini and three debug controls can be entered in the ini file. The command line takes precedence if both are present. In the following table, a blank means the option will not be recognized in the context. All controls are lower case only.

command ini file default notes
-log log ./health.log Name of log file
-ini ./health.ini Name of ini file
-debuglevel 90 Control message volume
-debug off Turn on some debug points
-dpr off Dump internal data arrays
-h <null> Help messages
-v verbose off Messages on console also
-vt traffic off Create traffic.txt [large]
-pc pc <null> Limit survey by agent types
-tems tems <null> Limit survey by TEMSes
-agent agent <null> Agents to survey
-agent_list agent_list <null> text file with agents to survey
-ignore_list ignore_list <null> text file with agents to ignore
-um um off write universal message
n/a cmd <null> run command for unhealthy agent
n/a title <null> Configure first line of report
-all all off Produce sub-report of all agents
-agent_timeout agent_timeout 50 TEMS to Agent wait
-noretry noretry off Skip the single agent retry logic
-retry_timeout retry_timeout 15 Wait time during retry stage 1
-retry_timeout2 retry_timeout2 50 Wait time during retry stage 2
n/a soap_timeout 180 Wait for soap
-o o ./health.csv Output report file
-workpath workpath <null> Directory to store output files
n/a soap <required> SOAP access information
n/a soapurl <null> Recognized – use soap
-std std Off Userid/password in stdin
-user user <required> Userid to access SOAP
-passwd passwd null Password to access SOAP

Many of the command line entries and ini controls are self explanatory. The following options can be set multiple times:  -pc and -tems and -soap. All time base settings are in seconds.

soap specifies how to access the SOAP process with the name or ip address of the server running the hub TEMS. See next section for a discussion.

soapurl specifies how to access the SOAP process including the protocol and port number and target.

soap_timeout controls how long the SOAP process will wait for a response. One of the agent failure modes is to not respond to real time data requests.  This default is 180 seconds. It might need to be made longer in some complex environments. A value of 90 seconds resulted in a small number of failures [2 agents] in a test environment with 6000 agents.

-agent specifies specific agents to survey and can be set multiple times. -agent_list gives a filename which contains agents to survey. If both are present in command and/or ini file the effect is cumulative.  If -agent or -agent_list is used, you usually do NOT want to use -tems or -pc since those will eliminate some of the specified agents.

From version 1.08000 If the -agent_list has an entry which begins with a circumflex ^ [shift 6] the entry is considered a regular expression. The ^ character is the beginning of line anchor. If you specify ^abc the the managed systems which begin with “abc” will be considered of interest. If you wanted Linux OS Agents which began with abc you would use ^abc.*:LZ. That allows you to create a report on agents of interest based just on the name.

Controls to include [like -pc and -tems] and exclude [like -ignore_list] will operate independently. It is best to minimize the number of controls and test thoroughly so you can avoid surprising results.

Command lines supplied are printed in the report, however the -user and -password values are replaced by UUUUUUUU and PPPPPPPP.

ITM Agent Health Survey Package soap control

The soap control specifies how to access the SOAP process. For a simple ITM installation using default communication controls, specify the name or ip address of the server running the hub TEMS. If you know the primary hub TEMS a single soap control is least expensive.

If the ITM installation is configured with hot standby or FTO there are two hub TEMS. At any one time one TEMS will have the primary role and the other TEMS will have the backup role. If the TEMS maintenance level is ITM 622 or later, set two soap controls which specify the name or ip address of each hub TEMS server. The TEMS with the primary role will be determined dynamically.

Before ITM 622 you should determine ahead of time which TEMS is running as the primary and set the single soap control appropriately.

Connection processing follows the tacmd login logic. It will first use https protocol on port 3661 and then use http protocol on 1920. If the SOAP server is not present on that ITM process, a virtual index.xml file is retrieved and the port that SOAP is actually using is retrieved and used if it exists.

Various failure cases can occur.

  1. The target name or IP address may be incorrect.
  2. Communication outages can block access to the servers.
  3. The TEMS task may not be running and there is no SOAP process.
  4. The TEMS may be a remote TEMS which does not run the SOAP process.
  5. The SOAP process may use an alternate port and firewall rules block access.

The recovery actions for the various errors are pretty clear. If (5) is in effect, consider running the survey package on a server which is not affected by firewall rules. Alternatively, always make sure that the hub TEMS is the first process started. If it must be recycled, then stop all other ITM processes first and restart them after the TEMS recycle. See this blog post which shows how to configure a stable SOAP port at the hub TEMS.

If the protocol is specified in the soap control only that protocol will be tried.

soap https://<servername&gt;

When the port number is specified in the soap control, 3661 will force https protocol and 1920 will force http protocol.

soap <servername>:1920

The ITM environment can be configured to use alternate internal web server access ports using the HTTP and HTTPS protocol modifiers. For this case you can specify the ports to be used

soap https://<servername&gt;:4661

or if both have been altered

soap https://<servername&gt;:4661

soap http://<servername&gt;:2920

The logic generally follows tacmd login processing. There are two differences: ipv6 is not supported and port following ITM 6.1 style is not included. SOAP::Lite does not support ipv6 at present. ITM 6.1 logic could be added but is relatively rare and was not available for testing.

ITM Agent Health Survey Events

From level 1.05000 the survey tool can be configured to write a universal message and thereby can create events. Add  -um to the command line or a um control to the health.ini file. The default severity is 5 and the category is KO4ALM90. You can chose your own severity and category code like this in the ini file. KO4ALM was chosen because in ITM 623 FP4 and ITM 630 FP2 a product provided situation TEMS_Alert was delivered.

um 9 XXXAGT99

The default message text will look like this:

Possible unhealthy agent [nmp180:LZ] thrunode [REM_NMP183]

From version 1.06, you can specify the message text with substitutions. This is specified only in the ini file. Here is an example ini file entry

um 5 KO4ALM90 Possible unhealthy agent [${msg_agent}] thrunode [${msg_thrunode}]

These are the available substitutions

${msg_agent}

${msg_thrunode}

${msg_hostaddr}

${msg_hostinfo}

${msg_product}

${msg_version}

The data comes from the node status table. Add any text desired to guide the recovery process, such a suggestion to recycle the agent.

There is an example exported situation TEMS_Alert_AHS.xml included included in the zip. It can be loaded by

./tacmd login…

./tacmd createsit -i TEMS_Alert_AHS.xml

After creating the situation associate it with a TEP navigation node such as *ENTERPRISE. Also use the EIF tab to specify an event destination. That way the events will flow into the normal ticketing and resolution process.

These are pure events and so will be generated each time the survey tool is run.

Even though the message/severity/category gets written into the universal message console, the text will not be seen in the disk operations log. The easiest way to view them is using a TEP workspace view with a Universal Message Console table display added. The TEP only displays the hub TEMS. You can learn more about the UMC table here Viewing the Universal Message Console (UMC) in ITM 6.x including SQL to view the UMC at the hub TEMS or a remote TEMS or any other ITM process. It also shows how to increase the size of the in storage wrap around table.

ITM Agent Health Survey Commands

From level 1.07000 the survey tool can be configured to run a command for each possible unhealthy agent. This is specified only in the ini file. Here is an example ini file entry

cmd echo Possible unhealthy agent [${msg_agent}] thrunode [${msg_thrunode}] >/tmp/test.log

The command should be appropriate for the platform where it will run. Each command will run one at a time.

These are the available substitutions

${msg_agent}

${msg_thrunode}

${msg_hostaddr}

${msg_hostinfo}

${msg_product}

${msg_version}

The data comes from the node status table.

ITM Agent Health Survey Report Controls

From version 1.08000 on, you can specify -all [or ini “all” control and get an initial report of all agents of interest and whether each one was responsive or not. At the same level you can set the ini “title” control. The following words will appear as title or first line of the report.

ITM Agent Health Survey Install Validation Test

Start with a short run. The goals here are

  1. Ensure Perl is installed with the needed CPAN packages
  2. Validate SOAP communication controls
  3. Access and review of the hub TEMS tables
  4. Access and review of agent operations logs.
  5. Clear observed problems

Here is an example command

perl itm_agent_survey.pl -v -tems <tems_name> -pc ux

The -v option writes all the log messages to the screen. The -tems option specifies a tems where agents report to. The -pc option says what agents to study. Later on you can specify multiple -tems and -pc options.

Here is a second example command where the externally supplied CPAN modules have been installed in the directory inc. In addition all the output files are written into the /tmp directory.

perl -Iinc itm_health_survey.pl -v -tems <tems_name> -pc ux -workpath /tmp

ITM Agent Health Survey Exit Code

From level 0.99000 onward, the survey program returns an exit code 1 if there were potentially unhealthy agents, That can be used like this

perl  itm_health_survey.pl -v -tems <tems_name> -pc ux -workpath /tmp || <command to email or to take recovery actions>

The || means to run the second command only if the first command had a non-zero exit code.

ITM Agent Health Survey Intensive Debug trace

When the itm_survey.pl program does not produce correct results or stops unexpectedly, you should gather addition documentation. The -debuglevel 300 option will generate an extensive log trace. The survey.log will be much larger than normal and thus the survey should be limited.

The -vt or traffic option dumps the http data to a traffic.txt file. This can be extremely large and should be used only on a limited basis. In one case a 10,000 agent survey generated a 2 gigabyte file.

ITM Survey Limitations

http6 and http6s protocols are not yet supported.

Summary

The Agent Health Survey tool was derived from Agent Survey. Agent Health Survey is designed for a single purpose and tuned for maximum speed. It also ends after a single cycle rather then a constant cycling.

Sitworld: Table of Contents

Feedback Wanted!!

Please report back experience and suggestions. If  Agent Health Survey does not work well in your environment, repeat the test and add  “-debuglevel 300” and after a retry send the health.log [compressed] for analysis.

History and Earlier versions

health.1.20000

Correct handling of user agent system generated MSLs like Agent Builder and Universal Agent

Appendix 1:  Recovery Process

Minimal Time Investment Strategy

Recycle the identified agent(s) and then rerun the Agent Health Survey.  Experience shows that many agents will recover normal function. The ones that remain can be studied closely.

Overall Investigation Strategy

Study the health report. After the list of possible unhealthy agents there are four reports summaries using these categories:

  1. TEMS
  2. Product type
  3. TEMS maintenance level
  4. TEMA agent support library level

There may be patterns where the issue can be addressed simply. For example if the problem is concentrated on one or a few remote TEMS, those may be overloaded or have some other issue. If the problems are concentrated on earlier TEMA levels, an upgrade may be in order.

One of the reports is agents that are responsive but needed extra time. Those can point to overloaded agent systems or overloaded TEMSes. It could also mean agents where the operations log has been overlaid for some reason.

Agent Investigation Strategy

  1. Look carefully at the system the agent is running on. Is it running normally? Here are some recent examples of badly running systems
    • Intel Linux system with 6600+ processes, many of them some migration/1 threads. This was clearly a system in deep trouble.
    • Windows system where the “performance counters” needed to be rebuilt. The Windows OS Agent was running at very high CPU and doing no effective work.
    • Linux system where the /proc system directory was unable to be read.
    • Mount point or disk full
  2. Create a PMR and start working with IBM Support
  3. Run pdcollect on the agent.
  4. Collect 2 cores a few minutes apart when the agent is deemed unresponsive. See following section for details on how to collect core by type of system.
  5. Dynamically or manually, set the following trace on the agent in case the agent goes in this state again. See following trace section.
  6. Collect the trace logs with pdcollect and cores after the problem recurs.

Collecting a core or dump file

In every case you should involve a system administrator to validate and make needed changes in the procedure.

Windows

The process is fully documented in this document State capture for ITM process in Windows

Linux, Solaris and HP/UX

Set ulimit -c to unlimited. This usually means updating the /etc/security/limits.conf file and logging back into the userid where the agent is run from.

After that is configured the command is gcore –c <core_name> <pid>

z/OS

Use the DUMP command.

DUMP COMM=<title for the dump

AIX

1. Confirm that fullcore is enabled on the system by running:

lsattr -El sys0 -a fullcore,pre430core -H

Ensure fullcore has “true” and pre430core has “false”.

If fullcore isn’t enabled, then use this command from a root user to cause a full core file to be generated on the system:

chdev -l sys0 -a fullcore=true  -a pre430core=’false’

2. Reproduce the problem.  Confirm it generates a new core file by locating it (usually in $CANDLEHOME/[architecture]/ms/bin) and confirming it has a new timestamp. If it occurs at random, then wait for the next occurrence.

3. If the issue is a loop or deadlock, you can take a dump by determining the process id of kdsmain

ps -ef | grep kdsmain

and then using this command like this:

gencore <pid> core.<pid>

4. Confirm it is a full core file.  You can use the “od -x” command to make sure you have a complete core:

od -x [coreFile] | more

0000000  0077 0000 0fee ddb1 0000 0000 0000 0a70

0000020  0000 0000 0000 1a60 0000 0000 0000 119b

0000040  0000 0053 0000 0000 0000 0000 0000 2bfb

In the second group of numbers “0077”, the core file flags are “77”. The following is how each of the flag bits are interpreted:

  x’01’ – full core

  x’02’ – core was generated by AIX V4 or higher

  x’04’ – core contains mstsave structures

  x’08’ – core contains big data

  x’10’ – core contains the u_block structure

  x’20’ – core contains the user stack

  x’40’ – core contains at least one module

  x’80’ – core was truncated

5. After confirming it’s a full core file, get snapcore output.  This is needed to correctly debug the core on a system here, since we do not have the same libraries as the customer.  Do the following:

snapcore -r <core file> <Full_Path_To_binary>

where <Full_Path_To_kdsmain> is the fully-qualified path to the binary. For example, for a kdsmain core at /usr/IBM/ITM3/aix523/ms/bin/kdsmain, the command would be:

snapcore -r core /usr/IBM/ITM3/aix523/ms/bin/kdsmain

The “-r” arguments cleans the /tmp/snapcore directory if there is any, and then creates the new snapcore data in the file /tmp/snapcore/snapcore_<PID>.pax.Z.

Appendix 2: Tracing notes

You may be able to turn tracing on dynamically using these commands

tacmd login

tacmd settrace –m <agentname> -p KBB_RAS1 –o ‘error (unit:kra ST ERR OUTPUT)(UNIT:kraafira all)’

More likely on a unhealthy agent you will need to set the tracing and then recycle the agents. The idea is that an agent going into unhealthy state may do it again sometime after a recycle. This prepares the environment for data capture.

Windows: in MTEMS

Right click on Agent name

Select Advanced

Select Edit Trace Parms

Place the following into the “Enter RAS1 Filters:” listbox

error (unit:kra ST ERR OUTPUT)(UNIT:kraafira all)

Linux/Unix non-instanced agents [like OS Agents]

Add the following to the xx.ini file in <installdir>/config

KBB_RAS1=error (unit:kra ST ERR OUTPUT)(UNIT:kraafira all)

Recycle the agent

Linux/Unix instanced agents

In this case there will be a file <installdir>/config/xx_<instance>.config. That file is only rewritten during an itmcmd config procedure. For this case add to the end of the .config file:

KBB_RAS1=’error (unit:kra ST ERR OUTPUT)(UNIT:kraafira all)’

Photo Note: Fledgling barn swallows as the fog starts to lift.

 

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: