Sitworld: FTO Configuration Audit

coastal

Version 0.81000 17 March 2017

John Alvord, IBM Corporation

jalvord@us.ibm.com

Follow on twitter

Inspiration

On several recent cases,  the hub TEMS randomly became inoperative. After long study and diagnostic data collection, the conclusions were that the Fault Tolerant Option [FTO or Hot Standby or Mirror] configuration was incorrect. In one case several z/OS remote TEMSes were missing the CMS_FTO=YES control. In another, the distributed remote TEMS glb_site.txt file had one entry that pointed to another remote TEMS instead of the two hub TEMSes as required. These efforts took several months to discover and test, So I decided this aspect was ripe for an audit tool. That way any customer can make sure their FTO setup is configured correctly

Background

FTO works by having two hub TEMS configured together. At any one time one hub TEMS takes primary role [first one to start] and the other hub TEMS takes a backup role. There is a TEMS-to-TEMS conversation and new user data is propagated from the hub TEMS in primary role to the hub TEMS in backup role. The backup hub TEMS actually it accepts remote TEMS and Agent connections but shortly after tells them to “find another TEMS” and disconnects. At most recent levels it doesn’t run any situations.

The remote TEMS logic is simpler. First if FTO is not being used [CMS_FTO=NO or not defined] then at startup the glb_site.txt entries show what hub TEMS might be there. Each one is tried in turn until a successful connection is made. From then on that is the only hub TEMS that will be connected to until the next remote TEMS startup.

Second if FTO is being used [CMS_FTO=YES] the same initial logic is followed to find a working hub TEMS. The difference comes after a loss of hub TEMS connection: at that time the logic starts looking again for a working hub TEMS. In that way it will find the new hub TEMS in primary role after a switch over.

If the FTO configuration is not identical across all hub and remote TEMSes, things won’t work. The big surprise is how badly things fail, including hub TEMS breaking.

The rest of this post presents a new tool which will perform all the needed checking and report on discrepancies. The cases where a manual check is needed is also documented. By using this tool you can validate the configurations are correct and fix any issues before experiencing outages. Or, if you suspect this issue, you can rule it in or out quickly.

Preparing for the install

Perl is usually pre-installed in Linux/Unix systems. . For Windows you may need to install from www.activestate.com or any other source. The program only uses Perl core services and no CPAN modules are needed.

TEPS Audit has been tested with

This is perl 5, version 20, subversion 1 (v5.20.1) built for MSWin32-x64-multi-thread

zLinux with Perl v5.8.7

This rool runs on the same system as a TEPS connected to the current hub TEMS.

A zip file is found found ftoaudit.0.81000. There is one file ftoaudit.pl.

Run Time Options

Options:

-h                           [optional] supply ITM installation directory if not default. You can also [Windows SET CANDLE_HOME=xxxxx or Linux/Unix export CANDLEHOME=xxxxx before starting ftoaudit.pl.

-v                           show log messages during process

-debug                    run in debug mode

-debuglevel             default 99. If set to 300 log file is more detailed.

-work                     default C;\TEMP or /tmp – where to store report and log and working files –

-o                          default ftoaudit.csv – name of report file

Report Limitations

This logic will recover and cross-check all the environment variable CMS_FTO values.

The glb_site.txt checking works only on Windows/Linux/Unix remote TEMS and only when there is an OS Agent active on the same system.

Any z/OS remote TEMS will need manual checking. The KDCSSITE member is equivalent to the glb_site.txt. KDSENV will contain the CMS_FTO setting, if present.

FTO Configuration Audit Report

Here is a sample report. with interspersed comments

FTO Configuration Audit Report – Version 0.80000

Primary Hub TEMS – HUB_NMP180

Backup Hub TEMS – HUB_NMP182

==> lists the detected primary and backup hub TEMSes. If this is wrong, maybe the TEPS is not connected to the FTO primary hub TEMS,

Impact,Advisory Code,Object,Advisory

100,CMSFTO1006E,HUB_NMP180,Hub TEMS running FTO some remote TEMS not using same glb_site.txt – see later report

===> See following for list of all advisory messages

Remote TEMS glb_site.txt report

remote_tems,product,osagent,glb_site.txt

REM_NMP183,LZ,nmp183:LZ,ip.pipe:nmp180x|ip.pipe:nmp182|,

REM_NMP184,LZ,nmp184:LZ,ip.pipe:nmp180|ip.pipe:nmp182|,

===> note how the NMP183 has an extra added “x” where I forced an error.

Elapsed Time report hub TEMS 2.82865595817566

tems,var_elapsed,glb_elapsed,

REM_NMP183,2.79777908325195,2.81095504760742,

HUB_NMP180,3.29944014549255,2.84404110908508,

REM_NMP184,2.90863513946533,2.69522094726562,

HUB_NMP182,2.78772282600403,3.08135104179382,

===> Above report section is interesting and may detect cases of high latency between hub TEMS and other TEMSes. The elapsed time is larger than you might expect because there is a java startup close in the KfwSQLClient utility that gets used.

===> The end of the report contains an explanation of the advisory messages.

Advisory Trace, Meaning and Recovery suggestions follow

Advisory code: CMSFTO1006E

Text: Hub TEMS running FTO some remote TEMS not using same glb_site.txt – see later report

Impact: 100

Meaning: In FTO configuration remote TEMSes need to have a

configuration that specifies the two hub TEMS. These two hub

TEMSes are defined during configuration and the result is stored

in the glb_site.txt file.

This files will normally be identical. If they are not identical

then the FTO logic will break.

A following report section will detail the contents of each

glb_site.txt which should be thoroughly reviewed. It is possible

for differences to be present, such as one that uses resolvable

names and others that use ip addresses and all is well. More

commonly one or more is just referencing an incorrect address…

most are OK and some are wrong. In this case FTO logic will

break and this can cause hub TEMS instability and crashes.

Errors in the DNS resolving system or /etc/hosts file could make

the results inconsitent even though it looks OK.

The data is available if there is an OS Agent running on

the same system as the remote TEMS. In that case, the remote

TEMS glb_site.txt should be reviewed manually.

Recovery plan: Review the glb_site.txt report and reconcile

any differences. That usually means re-configuring the remote

TEMS.

Advisory Messages

CMSFTO1001W – Hub TEMS running FTO but no Backup hub TEMS found

CMSFTO1002E – Hub TEMS running FTO but Backup hub TEMS [tems_nodeid] not running FTO

CMSFTO1003E – Hub TEMS running FTO but remote TEMS [tems_nodeid] not running FTO

CMSFTO1004W – Hub TEMS not running FTO but a Backup hub TEMS[tems_nodeid] was found

CMSFTO1005E – Hub TEMS not running FTO but remote TEMS [tems_nodeid] is running FTO

CMSFTO1006E – Hub TEMS running FTO some remote TEMS not using same glb_site.txt – see later report

CMSFTO1007E – TEMS running with KGLCB_FSYNC_ENABLED=0: risk of database file damage and TEMS outage

*note* This is unrelated to FTO but it is concerning on any Linux/Unix system.

In the report itself, if an advisory is produced, the end of the report includes the impact and a discussion and a recovery plan. If this is unclear you can always contact IBM Support.

Summary

Identify and correct agent duplicate name configuration issues. If you find any anomalies which are hard to correct, please contact the author.

Versions:

Here are recently published versions, In case there is a problem at one level you can always back up.

ftoaudit.0.81000
Add check for non-TEPS system

Sitworld: Table of Contents

Note: View from Nepenthe Restaurant, Big Sur California

 

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: