Sitworld: ITM Silver Blaze – Agent Responsiveness Checker

blog-kogo-02

By John Alvord

IBM Corporation

jalvord@us.ibm.com

Please note. While still interesting, this project has been largely superseded by

Sitworld: ITM Agent Health Survey

Inspiration

I have a great job. People come to me with puzzles and I get paid to investigate. A recent customer had 400+ Solaris systems running Unix OS Agent at ITM 622 FP5 and earlier levels. By chance they identified a single instance of a Unix OS agent that was not running situations. They were naturally worried there could be other cases.

Introduction

 In ITM, there are occasionally agents that report online but are not running situations. When real time data is requested the request times out. I call them non-responsive agents and have puzzled for a years about how to detect them easily.

 If you suspect a non-responsive agent, you can attempt to view real time data and observe the time out condition. That requires expensive manual work for each agent and you can never be sure if things remain good. Once you find a single such case you will worry every day and night. A single situation not firing can be costly.

 With this new inspiration, I remembered a famous Conan Doyle short story about Sherlock Holmes titled Silver Blaze. Sherlock resolved a mystery by noting that a watch dog did not bark in the night.

Silver Blaze Overview

 There are three components to the Sliver Blaze scheme: A situation, a workflow policy and a Perl program. For an example implementation Right-click/Save As… ===>zip file. This goal of this example is to identify all non-responsive Linux OS Agents. The files can also be found here: https://github.com/jalvo2014/silverblaze

 A)    Timer situation: IBM_cycle_101

Formula: *IF *VALUE KLZ_System_Statistics.System_Name *NE xxx

Sampling Interval: 15 minutes

Run at Startup: No

Distribution: *LINUX_SYSTEM

The situation does not need to run at startup since it used only by a workflow policy. The KLZ_System_Statistics attribute group is used because it has a System_Name – or agent name – among the attributes.

B) A very simple Workflow Policy:  IBM_Policy_101

blog-policy-02

The take action command

blog-action-02

The action command is

touch   /tmp/&{WaitOnSituation1:KLZ_System_Statistics.System_Name}.touch

The Linux/Unix touch command creates a zero length file or updates time on an existing file. See Appendix 1 for Windows batch file wintouch.bat to accomplish the same thing. The Windows example is included in the example files.

 The example workflow policy has the same distribution as the situation: *LINUX_SYSTEM. This means the policy is active on each TEMS where a Linux OS agent connects. The Take Action options force the command run on the hub TEMS. The workflow policy correlation is “managed system”.

 When the policy is started [or auto_started], the situation is automatically started on each Linux OS Agent. The situation runs in results-only mode and does not create events. Every 15 minutes the agent sends a new result to the TEMS. The workflow processes the result and then runs a command on the hub TEMS.

 In this example the /tmp/ directory was used for the touch files. You can of course pick any target directory.

 After the situation sends results and the workflow policy runs, the /tmp directory fills with files having the names of Linux OS Agents which are active and processing situations.

C)    An example Perl program itm_unresp.pl  identifies problem OS Agents using this logic:

a.       Determine what agents are online using tacmd listsystems

b.      Collect the names of the touch files

c.       Print out names of online agents which do not have touch files

d.      Print out names of touch files which are not listed as online agents

e.       Print out the names of touch files which are late by some predetermined number of seconds.

This example Perl program is configured with user specified values at the beginning which tell how many seconds is considered late, user/password for tacmd, target directory for files, etc. The itm_unresp.pl has been tested on both Linux and Windows.

For Linux/Unix, the location of Perl is specified in the first line of itm_unresp.pl

#!/usr/bin/perl -w 

If that is different on the system that itm_unresp.pl will run on, you will need to change that first line. On Windows the perl program libraries are present in the PATH after installation and that line is ignored. The “-w” enables certain warnings.

Controls for itm_unresp.pl

These controls are in the beginning of the itm_unresp.pl source. Modify them to match your requirements.

Variable

Default

Explanation

$local_hub_server 

xxx

server hub name or ip address where hub TEMS runs

$local_userid

userid

valid userid for tacmd login

 $local_password

password

password for userid

$local_agent_code

lz

product code of agent [ux/nt/lz etc]

$local_dir        

/tmp/

Directory where touch files stored – Windows c:/temp/

$local_ext       

touch

File extension of touch files

$local_late_secs 

1800

touch file lateness factor

$fake_node

0

When 1, adds a fake nodeid to test logic

Example Results Log

In the test environment, the log looked like this:

Online agent check start

node xxx180:LZ modify[1364144520] late[1364144533]

node xxx182.xxx.xxx.ibm.com:LZ modify[1364144517] late[1364144533]

Online agent XXX185.xxx.xxx.ibm.com:LZ missing from touch files

Online agent check complete

In this test lateness was defined as 30 seconds and the situation sampling interval was set to 60 seconds. This setup deliberately forced lateness messages. The modify value is the epoch seconds when the file was last modified. The late value is the current epoch time minus the lateness seconds defined in itm_unresp.pl.

 The missing touch file message was produced by an option to add a fake online node.

 There is a message type “node $node not in online capture” which means there is a touch file present but the agent is not currently online. I suspect that means the agent has gone offline and the touch file should be deleted. That logic is not yet implemented.

Alternative setups

 It might be inconvenient or impossible to run the itm_unresp.pl program on the hub TEMS. If so pick any system with a Windows/Linux/Unix OS Agent. Change the workflow policy so the touch [or Windows wintouch.bat] command runs on that Agent. Then you can run the itm_unresp.pl summary program on that same system with the same results.

 Linux/Unix systems usually come with Perl already installed. If your target is a Windows system, then install Perl from www.activestate.com which has an excellent free version. The itm_unresp.pl program only uses built in or core facilities.

Outstanding Customer Results

 Using the Silver Blaze scheme the customer determined that 167 agents were stalled after roughly an hour of effort. A study of 115 agent operations logs revealed evidence of a defect corrected at ITM 622 FP6 when the TEMA threading logic was reworked. An upgrade to ITM 623 FP2 was already underway and was thereby accelerated. Updating each Unix OS Agent was sufficient to resolve the issue for all ITM agents running on each system. In the meantime, stuck agents were recycled as needed and monitoring continued.

 The underlying ITM issues have been resolved over time, but not everyone runs the latest maintenance level. In addition, the problem can be environmental like a mount point full or some competing process in a loop. [See Appendix 2 five APAR fix examples.]

 Having a centralized facility to identify non-responsive agents will speed resolving such issues. Until the problem can be corrected, early identification and recycling will reduce the exposure time running agents in a non-responsive mode.

Summary

This scheme provides a way to view non-responsive agents reliably. It can also be used as a long term checker for these issues. After an initial scan and cleanup, the sampling interval should be changed to once a day or so.

Original Publish Date: 2013/3/28

Sitworld: Table of Contents

Future Goals

 Make log better looking.

 Right now the itm_unresp.pl program needs to be run manually and then resulting log checked manually. Connect problem results to a monitored log to produce events.

 Handle multiple agent types with one tacmd listsystems call.

 Do you have any other ideas or edit suggestions? Please comment in blog entry or send email tojalvord@us.ibm.com. If you find improvements to the scheme, please let me know so everyone can benefit.

Appendix 1: Windows and the touch command

Windows does not have the touch command. In addition, managed system names contain a colon [:] which is illegal in a filename in Windows file system.

 Here is a small batch file wintouch.bat that does same thing as Linux/Unix touch:

 ==================================================

@SetLocal EnableDelayedExpansion

@set FILE=%1

@type nul >>%FILE::=_%.touch & copy %FILE::=_%.touch +,, >nul 2>&1

@EndLocal

==================================================

 Unless you are seriously deep into Windows geek-land, this surely seems mysterious. Here is an explanation:

 The leading @ character suppresses the echoing of commands.

 SetLocal  EnableDelayedExpansion makes sure that environment variables are substituted line by line instead of all at once during the pre-execution phase.

 Set FILE=%1 takes the first bat file argument [managed system name] and sets it into an environment variable FILE. That will normally be the managed system name.

 %FILE::=_% creates a string from the FILE environment variable where each colon [:] is translated into an underline [_]. This avoids using the colon in the file name.

 type nul >>filename   – appends the null file to a given file, creating a zero length file if not present.

 The ampersand [&] means to run the first command and then run the second command.

 copy filename +,,  – this copies a file onto itself, thereby updating the modify time. It can be expensive if the file is large but this is a zero length file.

>nul 2>&1  – this suppresses any standard output and standard error output from copy.

 The file extension [here touch] must match the itm_unresp.pl program.

 In practical use, you would create a wintouch.bat file based in the example zip file and save it at on Windows system in a known position. In the Workflow Policy take action command set the fully qualified name of the wintouch.bat command file. The itm_unresp.pl command is aware of the changed form of the Agent name and will make the right tests when run on a Windows system.

Appendix 2: Non-responsive agent APAR fix examples

 These are examples of APAR fixes which handled cases where an agent might end up non-responsive. The list is not complete but area ones I remember. These are rarely observed when an agent is running with up to date maintenance. There are also many environmental problems which can have the same result.

 First is a case where the Agent Support [TEMA] threading model needed work to avoid a deadlock. It could theoretically happen on any Linux/Unix environment but in practice was only seen on the Solaris Unix OS Agent. Corrected in ITM 622 FP6  and in ITM 623 FP1.

 DEPLOYMENT COMMAND SOMETIMES HANGS ON SOLARIS                

http://www-01.ibm.com/support/docview.wss?uid=swg1IV01053    

Second is a case where a slow cinfo caused miss-behavior, fixed in ITM 622 FP4.

 WATCHDOG CONTINUES TO RESTART OS AGENT

http://www-01.ibm.com/support/docview.wss?uid= swg1IZ85924

Third is a case of an environmental problem that was not handled well, fixed in ITM 622 FP4.

 NETWORK INTERFACE WITH MULTIPLE ADDRESSES BREAKS CINFO -R      RESULTS IN KCAWD TERMINATING AGENT DUE TO HOSTNAME RESOLUTION

http://www-01.ibm.com/support/docview.wss?uid= swg1IZ86997

Fourth is a case where watchdog stopped the wrong process, corrected on ITM 622 FP6.

 WATCHDOG CALLS TO CINFO SOMETIMES DO NOT TERMINATE.

http://www-01.ibm.com/support/docview.wss?uid= swg1IZ99020

Fifth is a case where a process type situation might take 30+ minutes to evaluate, corrected on ITM 622 FP6.

 ITM KUXAGENT HAS PERFORMANCE PROBLEM IN RESOLVING TTY NAMES FOR PROCESSES ON HP-UX                                             

http://www-01.ibm.com/support/docview.wss?uid= swg1IV02284

These are rare cases and are often never seen by customers. 

 

One thought on “Sitworld: ITM Silver Blaze – Agent Responsiveness Checker

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: