by John Alvord, IBM Corporation
On December 25, 2013 a sev1 PMR arrived for an unstable remote TEMS. The remote TEMS ran for three hours and then went crazy. Agents could not connect and the hub TEMS decided the remote TEMS was offline. That condition repeated after a remote TEMS recycle. Nineteen remote TEMS were working fine but two remote TEMS were behaving in this strange way.
I used TEMS Audit to summarize the TEMS diagnostic logs. The customer had wisely taken a pdcollect for the two remote TEMS and the hub TEMS and uploaded with the PMR. Even without the workload tracing you can learn things from the ERROR level trace message.
The audit summary showed two significant issues:
There were two situations identified as “Filter object too big
xx1_errlog_gmq4_std QMERRLOG 33248 + 7272
xx2_errlog_gmq4_std QMERRLOG 33248 + 7272
This can result in a very high TEMS incoming workload… basically workload with no filtering [at all!!] at the agent. Here is an example of one of the formula with 21 tests
*IF ( ( *VALUE Error_Log.Message_ID *EQ AMQ5008 ) *OR
( *VALUE Error_Log.Message_ID *EQ AMQ6064 ) *OR
( *VALUE Error_Log.Message_ID *EQ AMQ6166 ) *OR
( *VALUE Error_Log.Message_ID *EQ AMQ6708 ) *OR
( *VALUE Error_Log.Message_ID *EQ AMQ6709 ) *OR
( *VALUE Error_Log.Message_ID *EQ AMQ6767 ) *OR
( *VALUE Error_Log.Message_ID *EQ AMQ7017 ) *OR
( *VALUE Error_Log.Message_ID *EQ AMQ7463 ) *OR
( *VALUE Error_Log.Message_ID *EQ AMQ7465 ) *OR
( *VALUE Error_Log.Message_ID *EQ AMQ7466 ) *OR
( *VALUE Error_Log.Message_ID *EQ AMQ7469 ) *OR
( *VALUE Error_Log.Message_ID *EQ AMQ7472 ) *OR
( *VALUE Error_Log.Message_ID *EQ AMQ8149 ) *OR
( *VALUE Error_Log.Message_ID *EQ AMQ9505 ) *OR
( *VALUE Error_Log.Message_ID *EQ AMQ9506 ) *OR
( *VALUE Error_Log.Message_ID *EQ AMQ9507 ) *OR
( *VALUE Error_Log.Message_ID *EQ AMQ9526 ) *OR
( *VALUE Error_Log.Message_ID *EQ AMQ9534 ) *OR
( *VALUE Error_Log.Message_ID *EQ AMQ9543 ) *OR
( *VALUE Error_Log.Message_ID *EQ AMQ9544 ) *OR
( *VALUE Error_Log.Message_ID *EQ AMQ9556 ) )
The question about how long a situation can be has been confusing “forever” – since you can see the Situation Editor capacity gauge. However, that gauge is simply the first of five situation limits and this exceeded stage four – where the translated WHERE clause is too large to transmit to the agent.
Everyone developing situations should be aware of these limits so lets look at them.
The second identified issue involved Listen Threads. The TEMS Audit summary showed
Listen Pipe Report listen=12 balance=90 threads=1080 pipes=848
This listen=12 means that 12 “listen threads” were allocated to manage the communications workload. A well running TEMS will show 3 or 4. The default maximum is 16. This is a strong indicator that the TEMS is stressed by too much communications. The above situations are a good possibility for that stress.
ITM Situation Limits
ITM situation formula are limited in size and complexity at four separate processing stages. There is also a return of data limit which makes a fifth stage.
Stage 1: Portal Client (TEP) Situation editor
The Portal Client or TEP has a graphical interface which is used to author situations. One aspect of a situation is the formula, which defines when the situation will be true. Here is a part of the formula tab
Near the bottom, there is a Situation Formula Capacity gauge. This relates to the formula definition storage area, which is exactly 1020 bytes in size.
*IF *VALUE Process.Command_U *EQ ‘ma? *’
that is 40 characters exactly. 40/1020 is 3% rounded downward. You could see this in an exported situation in the PDT tag using
tacmd viewsit -s sitname -e sitname.xml
The 1020 character limit is absolute. Follow best practice and avoid coming close to the limit. There was a recent example where a situation that was very close to the limit was exported as above and that situation could not be imported back using tacmd createsit -i sitname.xml. The exported situation PDT had expanded somewhat because spaces had been added before and after some parentheses which made the result 1024 characters long. This is not something to discover when you are using the exported situation for backup or moving to a new TEMS.
Stage 2: TEMS rule processing
When a situation is started, the formula is converted to SQL that the TEMS dataserver [SQL processor] will use to define the situation. For our case, here is what the SQL looks like
SELECT BCMD, CPUID, CPUPERCENT, CPUTIME, CPU, CHILDSTIME, CHILDUTIME, CMD, UCMD, CONTSWITCH, CONTSWIT64, EGID, EGRPN, EUID, EUSERN, ELAPTIME, ADDR, EVENT, EXECSTATE, FLAG, GRPN, HEAP, INVCONTSWT, INVCONTS64, MAJORFAULT, MAJORFAU64, MEMPERCENT, MINORFAULT, MINORFAU64, NICE, PPID, PRIORITY, COMMAND, UCOMMAND, PROCCOUNT, PGID, PID, READWRITE, READWRI64, GID, SCHEDCLASS, SESSIONID, SIZE, STACK, STARTTIME, SYSTEMTIM, ORIGINNODE, TTY, THREADCNT, TIME, TIMESTAMP, TOTCPUPERC, TOTALTIME, CHILDTIME, SYSTEMTYPE, USERTIME, UID, USERNAME, UUSERNAME, VSIZE, WAITCPUTIM, WAITLKTIME, ZONEID, ZONENAME FROM OMUNX.UNIXPS WHERE SYSTEM.PARMA(“SITNAME”, “IBM_man_wildcard”, 16) AND SYSTEM.PARMA(“NUM_VERSION”, “2”, 1) AND SYSTEM.PARMA(“LSTDATE”, “1100210164631000”, 16) AND SYSTEM.PARMA(“SITINFO”, “TFWD=Y;OV=N;TDST=0;”, 19) AND UNIXPS.UCMD LIKE N’ma? *’ ;
When a situation fires the agent captures all attributes, not just the ones used in the formula. So this SELECT starts off with the list of all attributes. Following are some SYSPARMA to convey information about the situation. Finally, the formula limitations are present. This result SQL text is 841 characters long. The SQL is stored in a 2998 character column and so this uses about 28%.
I have never seen this to be a limiting factor, but it could be an issue if an application had a very high number of attribute names.
Stage 3: Dataserver SQL Compile Process
When a situation is started and a TEMS is aware that an agent will require the situation, the situation rule or SQL is processed through the data server. The data server itself has limits. For example, there is a limit on the number of functions like *SCAN [find string within string] and *STR [compare string in fixed position]. If there are too many functions, the compile can fail. That will result in an error status being seen at Portal client. The exact limit is not known because the data server has optimization heuristics which can merge similar tests and thus require fewer actual functions.
Another recent case involved a long action command. At 444 bytes the action command worked fine. At 448 bytes the compiler complained about a too long literal and another compile failure occurred.
Stage 4: Situation Transmission to Agent
A soft limit exists when the situation constraints are transmitted to the agent. It is called soft because correct results are obtained but at a very high cost. From ITM 623 FP4 and ITM 630 FP2 this condition will result in a TEMS_Alert situation event. The goal is to warn the developer of this potential performance issue instead of experiencing the TEMS instability.
Here is the error message from the TEMS diagnostic log
Filter object too big (96800 + 21904),Table UNIXCPU Situation .
The first number is the filter plan – which has a limit of 32767 characters.
The second number is the filter pool or constants, which also has a limit of 32767 characters.
The number of characters do not directly compare to the size of the SQL. This is a C++ object which has been “flattened” so it can be sent via a remote procedure call.
This error is most often seen when there is a large number of where conditions. If there are mixed *OR and *AND in the formula, that can drastically increase the size of the SQL and thus the filter object.
In this case, the agent does the work without a filter object. All rows for the attribute group are returned to the TEMS each evaluation period and the TEMS does the filtering. For a pure situation, every potential event is sent to the TEMS. While this produces correct results, the increase in communications traffic and TEMS processing is dramatic and should always be avoided.
If you are developing a new situation and are worried about the issue, best practice is to review the TEMS diagnostic log for the message above and then reduce complexity of the situation if required.
Here is an example of a situation that fits in the Situation Editor but just barely squeezes into the filter object limits. There are 20 elements in the *IN clause,
*IF *VALUE NT_Event_Log.Log_Name *EQ System *AND
*VALUE NT_Event_Log.Source *EQ Wins *AND
*VALUE NT_Event_Log.Event_ID_String *IN (‘4177′,’4241′,’4273′,’4305′,’4178′,’4242′,’4274′,’4306′,’4179′,’4211′,’4243′,’4275′,’4307′,’4180′,’4212’)
The PDT usage is about 29%. The Filter object is about 31K, almost up to the 32767 byte limit. This is easy to explain. The *IN clause is actually a shortcut way of specifying a series of *VALUE *EQ clauses *OR’d together. Thus the PDT representation is smaller, but the other stages are just as long. If you add just one test more, the Filter object too big problem occurs.
Stage 5: Situation Results Data Return Limit
When the situation evaluates at the agent, candidate rows are prepared for return. There is an undocumented hard limit that only 16 megabytes of data at most will be returned. If 28 megabytes are prepared only 16 megabytes will be sent to the TEMS.
If you check the Agent diagnostic log you will see a message like this
(50C7605E.0000-7:kraarpcm.cpp,519,”Sample”) *ERROR: Sample data
exceeded RPC maximum transmission limit, 4915 of 37572 rows sent
Context P<185599208> A<250610650>
For this case only 4915 rows would fit in 16 megabytes. 32,000+ rows were left behind.
The same condition can exist for real time data requests from Portal Client or other mechanisms,
The only solution here is to be aware of the potential issue and check for that message if you suspect results are missing.
Situation Formula Overrides
From ITM 621, a situation developer can define Formula Overrides. There is a limit to how many overrides can be supplied. You will see this in two ways: the overrides do not work and there is an error message in the TEMS log like this:
(4C59FF0A.0000-6:ko4ovrd.cpp,168,"Override::Override") Error: Override <IBM_check_ovr__3A44FD0A864C4350>
sit <IBM_check_ovr> size <5062> exceeds limit 4000
(4C59FF0A.0001-6:ko4ovrmg.cpp,172,"OverrideManager::getTaskFromName") Fatal err in override <IBM_check_ovr__3A44FD0A864C4350> construction. Giving up.
The XML file that defines the overrides for a situation is created at the TEMS and the text must fit into a table column of 4000 characters. The actual size requirement for a single override varies tremendously depending on the names and values of the key columns and the names and values of the override formula. In one case the limit was 26 overrides. In another simpler case, it was 38 overrides.
If you must use many overrides, then create them, start the situation and review the TEMS diagnostic log for such error messages. Then reduce the number until they fit.
The customer converted the 2 existing situations into 3 situations with fewer tests and the remote TEMS workload dropped. The remote TEMS became stable,
Summary: Simplify and don’t push against the limits
Avoid creating long complicated situations that push against ITM product limits. If you do run into the limits, then create two situations to do the work. There are virtually no limits to the number of situations running.. so run simpler situations whenever you have a choice.
Use TEMS Audit to identify problem cases before they cause instability.
Notes: Turtle on the Sand – Virgin Islands 1988