Sitworld: ITM and the 1997 Kasparov vs. Deep Blue Chess Match

PfeifferBeach

John Alvord, IBM Corporation

jalvord@us.ibm.com

Follow on twitter

Inspiration

I was called in for a case where a hub TEMS was showing signs of severe communications overload. The CPU load was not high but the external symptoms were very bad. There were 13 remote TEMSes that sometimes all went offline at the same time. The Portal client session showed frequent long delays.

Diagnosis Part 1

A 38 hour trace of the hub TEMS had been captured. One trace element traced the inner workings of the TEMS SQL processor. It records a line every time processing of a table is complete

ERROR (UNIT:kdsstc1,Entry:”ProcessTable” ER ST)

The L3 engineer that called me in had noticed many of these types of rows

(51E74248.00A1-4:kdsstc1.c,2097,”ProcessTable”)

Table Status = 74, Rowcount = 0, TableName = SP2OS,

Query Type = Insert, TablePath = UIRA.UNIXOS

The traces seemed to come in bursts of 300 to 450 every 3 minutes.

I extended TEMS Audit to process and summarize these rows[not published yet] . The following image shows the results.

audit_sp2os

Clearly the SP2OS  needed to be investigated and understood.

Diagnosis 2

The second diagnostic which our communications L3 expert had identified showed like this:

(51AE718A.0005-49:kdebpli.c,176,”KDEBP_Listen”)

listen 16: PLE=1145E1D10, hMon=1C500107, bal=14, thr=56, pipes=28

The number following “listen” is the number of listening pipes. It is normal to see this grow initially to perhaps 2-5 and then stabilize. In this environment the number grew to the limit of 16. That was an actual true emergency and incoming communications started to be dropped. It is a tribute to the ITM communications logic and TCP that the environment staggered on and did useful things…. but the condition is very abnormal.

Searching for SP2OS

In that SQL diagnostic trace SP2OS is an attribute table. The ProcessTable results showed it was created by the Unix OS Agent.  From the dockux file which the TEPS uses I could see the attributes involved. Here is the start.

*OBJECT:    System, SP2System

*TABLE:     UNIXOS SP2OS B

*FILE:      UIRA.UNIXOS

Later you see the many attribute value definitions like this

*ATTR:      Total_Real_Memory

*CAPTION:   Total Real Memory (KBytes)

*COLUMN:    TOTREALMEM

The System attribute group is well known and often used. I had never noted the SP2System attribute group before and it looked like the attribute values were identical to System.  I cross checked the catalog file kux.cat and the data values were identical but there were some added attributes in the SP2OS table:

VROW

VROWNO

VSAMPLENO

VSYSTEMNAM

VTIMESTAMP

I looked through the history of the kux.sql file. That is the file that initializes objects connected with Unix OS Agent. On 16 October 1996 a Candle Corporation engineer checked in the changes that reflected SP2OS with a comment

# Added new situation UADVISOR_OMUNX_SP2OS and OMUNX.SP2OS rule for SP2 support.

Those comments were not present in the latest maintenance levels. In 2004 Candle was acquired by IBM – so this was 8 years previous.

Customer Environment Analysis

I reviewed the TNODESAV table – roughly equivalent to the INODESTS table. This environment had over 20K agents and there were 1895 Unix OS Agents – mostly AIX systems.

Searching for SP2

After some searching, I found part of the answer. In the middle 1990s IBM conducted some computer chess exhibitions using its supercomputer SP2 which was based on AIX technology. That computer had a maximum of 8 CPUs. There were two matches against the World Chess Champion – Gary Kasparov in 1996 and in 1997. Here is a reference – Deep Blue versus Gary Kasparov.

I have been unable to locate the Candle engineer involved. However the evidence shows that IBM Research and Candle worked to put together an early version of ITM for this SP2 supercomputer between the first and second matches. That was a true surprise to me since in Jan 1997 the first release was delivered for AS/400 systems only.

What does SP2OS DO!!!!!

The installed program objects are

TSITDESC – Situation Description

UADVISOR_OMUNX_SP2OS

SITDB Situation rules [SQL]

UADVISOR_OMUNX_SP2OS____________

OMUNX.SP2OS

These are installed on all hub and remote TEMS as part of Unix OS Agent support.

The situation has AUTOSTART=*SYN which means it runs in synchronous mode. The reevaluation interval is 3 minutes. When this is active, it first each three minutes by the clock – 10:00:00, 10:03:00, 10:06:00, etc.

Installing the SITDB situation rules means that the situations are effectively already started. There is no need for distribution. The actual running situation is OMUNX.SP2OS. That SQL rule includes

WHERE SYSTEM.PARMA(“NODELIST”,”*ALL”,4)

That means that each agent connecting to that remote TEMS will receive a start command for that object. Agents ignore start requests for unknown attributes. Thus OMUNX.SP2OS will start up on every Unix OS Agent in the ITM environment. It will run every 3 minutes synchronized to the clock. That is one row of result data about 3K bytes.

The remote TEMS has logic which understands in incoming requests and forward it onward to the hub TEMS. There is a open SQL request to minimize cost.

The hub TEMS recieves the incoming data and collects it in a virtual table. That means an in storage table. That table is a convenient place to view all the results which have been sent from all the Unix OS Agents. It is convenient, of course but it is horribly wrong!!

Nil Usage of the SP2OS Virtual Table

There is no product provided situations that use the table.

I examined a TEPS saveexport.sql [dump of TEPS database] and there were no product provided or customer queries associated with the table.

And why should there be? The attributes for System and SP2System are identical. Modern ITM improvements makes it simple to gather summarized ITM data from all agents directly.

SP2OS Associated Problems

The biggest problem occurred as 1985 Unix OS Agents sent in a total of 1985 rows of data all at the same clock time every 3 minutes. That overwhelmed ITM communications and caused a listen pipe shortage. That caused many other side effects such as the remote TEMS going offline en masse and bad response time for portal client sessions.

The ProcessTable trace showed that only 300-450 updates were processed even though 1985 agents were sending. That alone shows how much the communications was overloaded.

Recovery Action

The following SQL deletes the related objects at the hub TEMS and all the remote TEMS.

DELETE FROM O4SRV.TSITDESC                             

WHERE SITNAME = ‘UADVISOR_OMUNX_SP2OS’;

DELETE FROM O4SRV.TSITDESC AT(‘REMOTE_1’)

WHERE SITNAME = ‘UADVISOR_OMUNX_SP2OS’;

DELETE FROM O4SRV.SITDB                             

WHERE RULENAME = ‘OMUNX.SP2OS’;

DELETE FROM O4SRV.SITDB AT(‘REMOTE_1’)

WHERE RULENAME = ‘OMUNX.SP2OS’;

DELETE FROM O4SRV.SITDB

WHERE RULENAME = ‘UADVISOR_OMUNX_SP2OS____________’;

DELETE FROM O4SRV.SITDB AT(‘REMOTE_1’)

WHERE RULENAME = ‘UADVISOR_OMUNX_SP2OS____________’;

That is one SQL at the hub TEMS and once for each remote TEMS. Appendix 1 has instructions for how to run the SQL.

The hub and remote TEMSes were all restarted.

ITM communications configuration

In this large environment, the pipe listen limit was too small. The following control was added to the hub TEMS

by updating the /tables//KBBENV file.

KDEP_SERVICETHREADS=32

Customer results

There was a TEPS issue resolved in the same time period. The performance is now much better and there are no further cases of remote TEMS going offline.

APARs

Three APARs were created from this activity

IV47363

SITUATION UADVISOR_OMUNX_SP2OS MUST BE DISABLED BY DEFAULT.

Status: this is included in ITM 623 FP4/ITM 630 FP3. In the next major ITM release the objects will be removed entirely.

IV47465

UADVISOR SITUATIONS FOR ENTERPRISE VIRTUAL TABLES MUST BE DISABLED

Status: this relates to similar objects in the ITM Agent for Microsoft SQL Server and 3 other agents. That needs more work since there are a very few product provided workspace views.

IV47060 – COMMUNICATION FAILURE FROM PIPE LISTEN LIMIT (16)

Status: The default will be raised to 32 in a future release,

Summary

A good mystery puzzle resolved satisfactorily.

The issue was almost invisible until this very large environment was affected badly.

Sitworld: Table of Contents

Appendix 1

Model directions for Linux/Unix

1) Login to the system running the TEPS.

2) Transfer the SQL to /opt/IBM/ITM/tmp/delsp2os.sql

3) cd /opt/IBM/ITM/bin

4) ./itmcmd execute cq “KfwSQLClient /f /opt/IBM/ITM/tmp/delsp2os.sql”

Model directions for Windows

1) Login to the server running the TEPS,

2) Transfer the SQL to c:\temp\ delsp2os.sql

3) cd c:\IBM\ITM\cnps the current directory

4) KfwSQLClient /f c:\temp\delsp2os.sql

This technique uses the existing TEPS->TEMS connection which is usually simplest.

Photo Note: Painting of Pfeiffer Beach in Big Sur by a dear friend.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: