Sitworld: Eliminating Duplicate Agents –

orchid

John Alvord, IBM Corporation

jalvord@us.ibm.com

Draft #5 – 20 August 2020 – Level 0.61000

Follow on twitter

NOTE: This was an interesting idea… but I kept running into new special cases that prevented the ideas from working reliably. As a result consider this a failed attempt. At the present time no work is underway. The underlying issue continues to create severe practical monitoring problems. Most clients I talk to are aware and have manual procedures in place to resolve them as encountered.

Inspiration

Duplicate agent names in a ITM environment present a multi-dimensional horror. That is when two different agents with the same name are connecting to a remote or hub TEMS. Here are the bad things that can happen:

1) Reduced monitoring because only one agent at a time can report results. As each agent connects, the earlier one is ignored and the new one has to be populated with the right situations. Effectively no agent gets to warn of issues reliably and thus monitoring is at best diminished and and at worst almost eliminated. It can even lead to some historical data always being all zeroes.

2) Each time an agent is seen to change in some way [different IP address, different remote TEMS, different agent type, different agent version number, different listening port etc] the TEPS needs to recalculate the Topology… a representation of what ITM processes are connected to the remote and hub TEMSes. That can be very expensive and duplicate agents cause the recalculation to happen frequently. TEPS performance gets very very bad – visible as Navigator Updates Pending keep showing up.

3) The process of dropping one agent instance and starting up another is a workload drain on the remote TEMS [mostly] and the hub TEMS somewhat. This sometimes causes TEMS instability including crashes.

4) Duplicate agents create human confusion and wasted work. An alert is seen at an agent [one of a duplicate pair]. An investigation is launched at its duplicate pair and – no trouble found! The issue continues and human work is wasted.

5) Historical data becomes suspect. Historical data involves the agent name and in cases of duplicate agents you get double data that really apply to the pair. More human confusion.

There are lots of reasons to avoid duplicate agents. Until now repairing those cases have required intensive manual work. This new tool automates some of the work and  reduces manual labor in correcting the condition. It doesn’t solve every problem but it speeds the process and exposes the data needed to begin manual corrections.

Parallel: If you watch the second Thomas Crown Affair movie (1999) – the story of an art robbery. The thief is notably dressed in a sharp suit, thin tie, a attache case and a bowler hat etc. In the climatic scene hundreds of hired actors dressed identically are walking at random in the art museum halls and rooms. The thief walks away with the prize. I often thought of this an an excellent parallel to having duplicate agents.

Background – How Duplicate Agent Arise

Here are some of the ways duplicate agents arise. This list is certainly incomplete because new methods are found from time to time. Agents names by default are formed from a 1) a prefix [like the Primary] seen in Windows OS Agents; 2) the hostname; and 3) a Suffix which indicates the agent type. Those are concatenated with colons [:] between the values, the result is truncated to 32 characters at most and that is the agent name. The system hostname can be modified by setting the CTIRA_HOSTNAME and CTIRA_SYSTEM_NAME environment variables to control the hostname and thus the agent name. That is a common theme in recovery actions and won’t be specifically called out in the following examples.

1) The hostname may be so long that the composed agent name is more than 32 characters long. If there are multiple agents on the same system, the truncation may create duplicate agent names even though the agents are of different types.

2) The same type of agent may be installed in different install directories on the same system. In one memorable case, a Unix OS Agent was installed at /opt/IBM/ITM and a few years later a revised Unix OS Agent was installed on 345 systems at  /opt/IBM/ITM/bin. In another case, the customer really wanted two OS Agents on a Linux system and just installed them in two installation directories. In a third case, a separate team installed an agent and was unaware of the existing agent.

3) Two different systems may be created with the same hostname. Thus two agents will be on different systems with the same name.

4) During a remote deploy operation, the old agent is stopped. Sometimes that shutdown fails and the old agent lingers with the same name. This can be identified because the two agents are from the same system and with the same listening port. That is “impossible” for TCP communications so it means the old agent is mostly stopped but is still sending node status updates even though it isn’t doing anything else. The recovery here is to stop the agent manually. Then kill -9 the remaining agent. Then start the agent again.

5) Two agents could have the same CTIRA_HOSTNAME and CTIRA_SYSTEM_NAME settings. Often that happens when a system cloning process fails to adjust the above settings. The worst case seen [so far] was 500+ Windows OS Agents with the same name.

6) The two agents could be in a cluster environment. That means two or more systems which run a single service like a database server… of course the database server only is action on one system at a time. In such a case a monitoring agent is sometimes set to run on all sides of the cluster. The hostname is often the same and that creates the duplicate agents condition. The correct recovery is to 1) start and stop the agent as the system being monitored is started/stopped, 2) use CTIRA_HOSTNAME and CTIRA_SYSTEM_NAME to make sure the agent has the same name and 3) configure KDEB_INTERFACELIST=xx.xx.xx.xx  where xx.xx.xx.xx is the Virtual IP Address for the cluster – a single ip address to access the system [like a database] regardless of which system in the cluster is active. In this way the TEMS sees a single system that occasionally stops and starts and NEVER changes name or ip address.

7) Some agents take on the role of managing agents. For example Tivoli Log Agent can run singly or it can run with multiple subnode agents. Those subnode agents are specified in a configuration list. It is all to possible for two subnode agents, connected to different managing agents, can have the same name. In that case you need to change the managing agent configuration list to ensure unique agent names.

8) The MQ agent has a unique way to specify a hostname. If that is not configured, the agent name will look like XXXXXX::MQ. That can create duplicate agents. The recovery here is to update the agent cfg and add “SET AGENT NAME(<hostname>)”. That fills in the blank name between the colons.

9) In ITM environments with z/OS agents, you may see some duplicate agents. There is no known way at present of resolving most of those issues.

10) Some large environments have little control of new agents being connected. You can eventually track down the issues but in the meantime you have a steady workload to recover.

A Semi-Automated Recovery Plan

Some of the above cases can only be recovered using manual configuration. The most labor intensive cases arise when there are many of them. It is not uncommon to see dozens to hundreds. Some are the simple case of a OS Agent and a Tivoli Log agent, same name on two different systems. Imagine if there were 120 of those pairs. You also see cases of 50 to 500 identical agents. Tackling the 500+ identical Windows OS Agents could take weeks and require access to each system to reconfigure the agents.

The goal of this project is to automate the most painful tasks. This requirement was a paradox for a long time until I recognized this:  if you have a duplicate agent pair, they are almost worthless for all the above reasons. Agent configured hostnames can be changed using a command like this

./tacmd setagentconnection -n <os_agent_name> -t <pc>
-e CTIRA_HOSTNAME=xxxxxxxxx CTIRA_SYSTEM_NAME=xxxxxxxx

Where

os_agent: the agent name of the OS agent on the system

pc: the product code [like NT for Windows OS Agent]

xxxxxxxx: the new hostname.

*note we always change CTIRA_HOSTNAME [ for TEMS awareness by agent name] and CTIRA_SYSTEM_NAME [for TEPS awareness] and make them the same to avoid confusion.

If there were only two duplicate agents, we could use the revised name as hostname-DUP1. When this completes – one of the two agents will have the old name and one will have a hostname with the characters -DUP1 appended. Thus there will be two different agent names and the duplicate agent name issue is resolved. More work is needed, of course.

Any agent should be integrated into the ITM environment. Every agent is added to a system generated Managed System List like *NT_SYSTEM. The agent name may also have been added to custom Managed System Lists. Finally the agent name may have been added to the direct distribution of some situations. There are also uses of Situation Group direct distribution which is awaiting development [a test case is needed!]

To complete the work the user must

1) Rename the hostname-DUPn agent appropriately.
2) Add the new agent to any custom Managed System Lists where the old agent was used.
3) Add the new agent to the distribution of any situations where the old agent was used.
4) Determine out why it happened and change processes so it doesn’t happen again.

The integration gets a little more lengthy if, say you had 151 duplicate OS Agents. However the benefit is immediate: You get some monitoring going on – which is better than none. You can make all the subsequent changes remotely – for these cases anyway. They may be cases where manual configuration is needed.

You may also have some cleanup work to do. One customer with two sets of 100+ agents with same name found that the automation partially failed. Some of the attempted changes were rejected because the agent was *OFFLINE. However many changes were successful and the manual work was seriously reduced. You could also repeat the automation process after the first wave of cleanup.

Package Installation

The package is dup2do.0.61000 and contains the dup2do.pl Perl program and a dup2do.cmd for Windows usage.

If this will be run on a Linux/Unix system, Perl is almost always installed in the system. If this will be run on a Windows system, you will have to install a Perl. My current choice is Strawberry Perl.

http://strawberryperl.com/

It is good quality without serious license restrictions. Of course if it will be used on a company system, you will likely want to contact your manager and check with legal.

DUP2DO Process

[dup2do stands for Duplication work to do]

This project uses files created by three other Sitworld projects.

Sitworld: TEMS Audit Process and Tool

The first step is to review the TEMS Audit report file  to see if duplicate agents have been identified. Here is a small extract of a good example:

TEMSREPORT082: Agent Flipping Report - Multiple Systems
Count,Agent,Hostaddrs, 
2,ibm_au_winatca5820:NT,ip.spipe:#9.13.192.26[REMOTE_AUULDPLITM020] ip.spipe:#10.114.95.68[REMOTE_AUBHDPLITM030],
2,ibm_id_cgkdcplesb01a:LZ,ip.spipe:#9.132.101.100[REMOTE_cgkibplitm010] ip.spipe:#10.132.187.63[REMOTE_cgkibplitm010],
2,ibm_id_cgkdcqwcom02:NT,ip.spipe:#9.132.101.30[REMOTE_cgkibplitm020] ip.spipe:#10.132.101.103[REMOTE_cgkibplitm010],
This involves three agent names that are reporting from two different systems each. You see a count, the agent name, the protocol/system and remote TEMS involved.

Use a new recent copy of TEMS Audit 2.25000 or later. Rerun it on your hub TEMS logs directory

perl temsaud.pl -v -logpath /opt/IBM/ITM/logs  -dup

This will produce a file dedup.csv for dup2do. Here is an extract from the same source.

ibm_au_winatca5820:NT,ip.spipe:#9.114.95.68,
ibm_au_winatca5820:NT,ip.spipe:#9.13.192.26,
ibm_id_cgkdcplesb01a:LZ,ip.spipe:#9.132.101.100,
ibm_id_cgkdcplesb01a:LZ,ip.spipe:#9.132.187.63,
ibm_id_cgkdcqwcom02:NT,ip.spipe:#9.132.101.103,
ibm_id_cgkdcqwcom02:NT,ip.spipe:#9.132.101.30,

Copy that file to the directory where dup2do will be run. You can leave it in place and copy in the second and third needed file.

*NOTE* If there are relatively few duplicate agents [or none!], you can manually clean them up and skip the following more complex process. The process saves a lot of time when there are many duplicate agents but for a few agents manual configuration works fine.

The second file comes from this project

Sitworld: Situation Distribution Report

There is an initial data capture program which runs at the TEPS

Windows – sitinfo.cmd
Linux/Unix – sitinfo.sh [in sitinfo.tar]

Run that as instructed which will create a set of LST files.

Next run this command

perl sitinfo.pl -lst -onerow

The result will be in sitinfo.csv report file. Here is a small excerpt.

ibm_cpuutil_gntf_gsmabase,Fatal,9.13.192.26,ibm_au_winatca5820:NT,M|ibm_nt_infinity_prod;,*IF *VALUE NT_Processor.%_Processor_Time *GE 95 *AND *VALUE NT_Processor_Summary.High_Process_Name *NE mcshield *AND *VALUE NT_Processor.Processor *EQ ‘_Total’,

ibm_dsp_gntc_win,Critical,9.13.192.26,ibm_au_winatca5820:NT,M|ibm_nt_infinity_prod;,*IF *VALUE NT_Logical_Disk.%_Used *GE 90 *AND *VALUE NT_Logical_Disk.Disk_Name *NE ‘_Total’,

ibm_dsp_gntf_win,Fatal,9.13.192.26,ibm_au_winatca5820:NT,M|ibm_nt_infinity_prod;,*IF *VALUE NT_Logical_Disk.%_Used *GE 95 *AND *VALUE NT_Logical_Disk.Disk_Name *NE ‘_Total’,

As you can see, it names the situation involved, the severity, the system IP address, the agent name, the distribution and the situation formula. The distribution field

M|ibm_nt_infinity_prod;

means it is distributed via a Managed System List [M] and gives the name.

Copy sitinfo.csv to the directory where dup2do will run. That might be in the logs directory.

The third file comes from the

Sitworld: Database Health Checker

That package comes with a shell file datasql.sh [in a datasql.tar container] or datasql.cmd. Follow the post instructions to run that using TEPS batch commands and it will create files QA1DNSAV.DB.TXT and QA1CNODL.DB.TXT. Those two files are also needed for DUP2DO. It is optional and if not present the dup2do_plus.csv file will be missing.

With those the files ready, run the dup2do command

perl dup2do.pl

Options

-dupsleep nnnn     add nnnn seconds after each tacmd setagentconnection. This should usually be 660 seconds if there are cases of 3 or more systems where the duplicate agents are running. After a rename the original agent may not be online for a while. It does mean the process may run for a long time [even overnight], but the results are beneficial.

-dupall  hostname   for this hostname, rename all ITM agents on that system. Default is just the OS Agent. Reviewing the dedup.csv manually. See iff there are cases where the OS Agent is duplicate and also  other agents. In that case you can set the dupall option which means all agents on that system will change hostname with a single command. You can set a dupall option for each such cases. Otherwise you will have to do manual setagentconnection commands for the non-OS Agents on each such agent and system.

The dup2do.pl command creates several output files.

dedup.sh – Linux/Unix shell command to run the needed tacmd commands for OS Agents. Following is an example from that command [a single line].

./tacmd setagentconnection -n ibm_au_winatca5820:NT -t NT -e CTIRA_HOSTNAME=ibm_au_winatca5820-DUP1 CTIRA_SYSTEM_NAME=ibm_au_winatca5820-DUP1

dedup.cmd – Windows shell command to run the needed tacmd commands
same as dedup.sh but for windows

dedup_nos.sh – Linux/Unix shell command to run the needed tacmd commands for other Agents.

dedup_nos.cmd – Windows command to run the needed tacmd commands for other Agents.

The above shell commands may contain comment lines about duplicate agents that require manual configuration. For example two TEPS might have the same name. Or an MQ agent has a duplicate name and needs a change to the cfg file to make the names unique.

The shell commands will run in waves with a 660 second delay between waves. That is because the duplicate agents not online need time to connect again. In extreme cases – like 150 duplicate names, it will still take a long time. The result is worth it.

There are also cases where duplicate agents are using indirect connections like with EPHEMERAL:Y or beyond a NATing firewall. In that case one can be converted but the process may need to be repeated.

dup2do.subnode.csv – Report on potential duplicate subnode agent names.

dup2do_correct.csv – Report on how to re-integrate the -DUPn agents into the ITM environment.
Example lines

MSL,ibm_nt_infinity_prod
add,ibm_au_winatca5820-DUP1:NT,

The new agent name has to be added to the above managed system list. The full report also shows the situations that need the agent name added to the distribution.

dup2do_edit.sh – Linux/Unix shell command to run the needed  commands

Example tacmd which will add in the new agents to the named managed system list. In this case there were actually many new agents needed adding.

./tacmd editsystemlist -e ibm_nt_infinity_prod -a ibm_au_aubhdpwcva001-DUP1:NT ibm_au_aubhdpwcva001-DUP2:NT …

dup2do_edit.cmd – Windows shell command to run the needed commands
same as dup2do.sh but for windows

The last file dup2do_plus.csv knits together the dedup.csv data and the QA1DNSAV.DB.TXT and QA1CNODL.DB.TXT data. It supplies supporting data when validating what will be happening. The second column “dup” means it came from the dedup.csv. The “msn” means it came from the TEMS database snapshot.

Case 1:  simplest case

9.22.71.58,dup,ibm_s4450024:06,ip.spipe:#9.22.71.58,
9.22.71.58,dup,ibm_s4450024:NT,ip.spipe:#9.22.71.58,

9.22.71.62,dup,ibm_s4450024:06,ip.spipe:#9.22.71.62,
9.22.71.62,dup,ibm_s4450024:NT,ip.spipe:#9.22.71.62,
9.22.71.62,msn,ibm_s4450024:06,Y,06,03.43.00,ip.spipe:#9.22.71.62[11853]<NM>ibm_S4450024</NM>,
9.22.71.62,msn,ibm_s4450024:NT,Y,NT,06.30.07,ip.spipe:#9.22.71.62[7757]<NM>ibm_s4450024</NM>,

The hostname ibm_s4450024 was used in two different systems on a NT agent and a 06 agent. Both systems were called out as duplicate agents. The TEMS database shows that only two are actually online at the moment, as system stole the other’s identity. This is a case where you can add -dupall ibm_s4450024 to the DUP2DO invocation and both NT and 06 agents will be get the new names.

Case 2 – more complicated

9.30.34.38,dup,blinsts:ibm_smlsxls032:UD,ip.spipe:#9.30.34.38,
9.30.34.38,msn,ibm_smlsxlm032:KUL,Y,UL,06.22.02,ip.spipe:#9.30.34.38[15949]<NM>ibm_smlsxlm032</NM>,
9.30.34.38,msn,ibm_smlsxlm032:08,Y,08,03.20.00,ip.spipe:#9.30.34.38[7757]<NM>ibm_smlsxlm032</NM>,
9.30.34.38,msn,ibm_smlsxlm032:LZ,Y,LZ,06.30.07,ip.spipe:#9.30.34.38[11853]<NM>ibm_smlsxlm032</NM>,

9.30.34.39,dup,blinsts:ibm_smlsxls032:UD,ip.spipe:#9.30.34.39,
9.30.34.39,msn,blinsts:ibm_smlsxls032:UD,Y,UD,07.10.00,ip.spipe:#9.30.34.39[15949]<NM>ibm_smlsxls032</NM>,
9.30.34.39,msn,ibm_smlsxlm033:08,Y,08,03.20.00,ip.spipe:#9.30.34.39[7757]<NM>ibm_smlsxlm033</NM>,
9.30.34.39,msn,ibm_smlsxlm033:LZ,Y,LZ,06.30.07,ip.spipe:#9.30.34.39[11853]<NM>ibm_smlsxlm033</NM>,

Here you see a case where the UL/08/LZ agents have proper names and different. But the UD agent has the same hostname on both systems. At this moment, the UD 9.30.34.38 on is not seen on TEMS because the UD on 9.30.34.39 with the same agent name has stolen the identity. In this case someone needs to manually fix up the UD agent on 9.30.34.38 so they two UD Agents will cooperate.

This may actually be a clustering case, where the UD agent(s) should be started and stopped with the DB2 systems, and the ip address forced to the Virtual IBM address that DB2 uses whatever system it is running on.

Manual Finishing Work

After the initial dedup.sh [or dedup.cmd] creates non-duplicated agents but with strange names.

Decide on what your site agent names should be. Do another tacmd setagetconnection command to set the agent name as desired. Carefully track the old and new -DUPn names and the eventual site agent names. Run a tacmd cleanms command [or the TEP equivalent] to remove the -DUPn names which show as offline after that process.

Use the dedup_nos.sh or dedup_nos.cmd commands to clean up non OS Agents.

In all cases, review the shell comment lines and do manual corrections as needed.

Update needed MSLs and Situation Distributions using the dup2do.csv report file or the dup2do.sh [or dup2do.cmd] commands. Track the original name, the -DUPn name and the selected site agent name to make the right choices.

After a hub TEMS recycle and running for while, redo the TEMS Audit and look at the TEMSREPORT082 report section to see what issues if any remain. Usually at this point a manual effort is required.  Here is an example:

3,MSSQLSERVER:ibm_my_kuldcpwsql06:,ip.spipe:#9.136.174.49[REMOTE_IBBFDPLITM020] ip.spipe:#9.136.174.30[REMOTE_IBBFDPLITM020] ip.spipe:#9.136.174.48[REMOTE_IBBFDPLITM020],
2,MSSQLSERVER:ibm_sg_sgdcwpwsql01,ip.spipe:#9.105.171.34[REMOTE_sinibplitm030] ip.spipe:#9.105.171.33[REMOTE_sinibplitm030],

The names are so long a truncation has occurred. In cases like these you need to examine each system and see what is happening. There could be clustering conflicts happening, but it could be subnode agents. Or maybe you should set the CTIRA_SUBSYSTEM_ID to a shorter value. That sets the first section of the agent name… which defaults to blank except for Windows OS Agent.

Note that z/OS agents are ignored in this process. Eliminating duplicate agent names is a future goal.

Summary

This tool and process will ease the effort of detecting and resolving duplicate agent name issues. This action will improve monitoring, reduce TEMS impact, reduce human confusion and help TEPS performance. The benefit is well worth the effort.

Sitworld: Table of Contents

History and Earlier versions

dup2do.0.61000
Handle case where only dedup.csv is present

dup2do.0.60000
On OS Agents, process even if missing in TNODESAV

dup2do.0.59000
Handle non-OS Instanced agents better

dup2do.0.58000
Handle OS agents and non-OS agents separately, track more error cases

dup2do.0.57000
Handle multiple levels better,warn multiple ephemerals

dup2do.0.56000
Handle more non-OS Agent cases, do not change HD/WPA agents

dup2do.0.55.000
Don’t use setagentconnection on MQ type agents
more non-os agent logic improvements

dup2do.0.54000
handle some non-OS Agent cases

dup2do.0.53000
handle managing agents better
make better output names
handle Situation Group distributions

dup2do.0.52000
handle long hostnames

dup2do.0.51000
correct sleep logic

Photo Note: Orchids Galore on the Kitchen Counter

Leave a comment