Sitworld: Event History #2 Duplicate DisplayItems At Same Second

holding_in_place

John Alvord, IBM Corporation

jalvord@us.ibm.com

Draft #1 – 10 April 2018 – Level 1.00000

Follow on twitter

Inspiration

The Event History Audit project is complex to grasp as a whole. The following series of posts will track individual diagnostic efforts and how the new project aided in the process.

A Situation Hid Events

This was seen in the Event Audit History Advisory section:

50,EVENTAUDIT1013W,TEMS,Situations [4] lost [merged] events Multiple Events with same DisplayItem at same TEMS second – see EVENTREPORT004

And in that report section:

EVENTREPORT004: Situations with Multiple results at TEMS with same DisplayItem at same second

Situation,Type,Agent_Second,Results,Agent,Atomize,Atom,

ccp_errlog_xul1_aix,Pure,1180331084920000,4,z9x_z9xssodb01:KUL,ULLOGENT.ENTRYDESC,DE3B8540: PATH HAS FAILED,

There is a situation ccp_errlog_xul1_aix. It is Pure [looking at a log]. At the TEMS second 1180331084920000 [2018-03-31 08:49:20] 4 results were seen. The agent name was z9x_z9xssodb01:KUL – a Unix Log Agent. The displayItem was ULLOGENT.ENTRYDESC [the description] and the value that all four results had was “DE3B8540: PATH HAS FAILED, “

Only a single result will be reported, unless you have the TEMS configured for “One Row One Result“.  However the alert from three of the four detected conditions has been lost forever. Monitoring has been degraded since you are not getting all the information that is available. If the information is important, you should probably consider setting a different DisplayItem – if possible. Otherwise you can use the “One Row One Result” configuration.

It is also important to determine if the information is really useful. Probably it is because someone constructed the situation. However in the case it is not really useful, the situation should be stopped and deleted.

Deep dive Into the report details

Scan or search ahead for Report 999. It is sorted by first node, then situation, then by Time at the TEMS. I will first describe what you see and the guidance from the column description line.

EVENTREPORT999: Full report sorted by Node/Situation/Time

Situation,Node,Thrunode,Agent_Time,TEMS_Time,Deltastat,Reeval,Results,Atomize,DisplayItem,LineNumber,PDT

Situation – Situation Name, which can be different from the Full Name that you see in situation editor, like too long or other cases.

Node – Managed System Name or Agent Name

Thrunode – The managed system that knows how to communicate with the agent, the remote TEMS in simple cases

Agent_Time – The time as recorded at the Agent during TEMA processing. You will see cases where the same Agent time is seen in multiple TEMS seconds because the Agent can produce data faster than then TEMS can process it at times. Simple cases have a last three digits of 999. Other cases will have tie breakers of 000,001,…,998 when a lot of data is being generated. This the UTC [earlier GMT] time at the agent.

TEMS_Time – The time as recorded at the TEMS during processing. This the UTC [earlier GMT] time.

Deltastat – event status. You generally see Y for open and N for close. There are more not recorded here.

Reeval – Sampling interval [re-evaluation] in seconds and 0 means a pure event.

Results – How many results were seen. The simplest cases are 1 and you would see that if you used -allresults control. In this report you only get a warning when there are multiple results.

Atomize – The table/column specification of the value used for Atomize. It can be null meaning not used.

DisplayItem – The value of the atomize in this instance. Atomize is just the [up to] first 128 bytes of another string attribute.

LineNumber – A debugging helper that tells what line of the TSITSTSH data dump supplied this information

PDT  – The Predicate or Situation Formula as it is stored.

The Descriptor line – before we see the results.

ccp_errlog_xul1_aix,z9x_z9xssodb01:KUL,REMOTE_usrdrtm041ccpr2,1180331084920999,1180331084920000,Y,0,4,ULLOGENT.ENTRYDESC,DE3B8540: PATH HAS FAILED,2602,*IF ( ( *VALUE Log_Entries.Log_Path *EQ ‘/var/adm/ras/’ *AND *VALUE Log_Entries.Log_Name *EQ errlog *AND *VALUE Log_Entries.Type *EQ P *AND *VALUE Log_Entries.Class *EQ Hardware *AND *SCAN Log_Entries.Description *NE 4865FA9B *AND *SCAN Log_Entries.Description *NE 476B351D ) *OR ( *VALUE Log_Entries.Log_Path *EQ ‘/var/adm/ras/’ *AND *VALUE Log_Entries.Log_Name *EQ errlog *AND *VALUE Log_Entries.Type *EQ U *AND *SCAN Log_Entries.Description *EQ 4B6BA416 ) ) *UNTIL ( *TTL 7:00:00:00 ),

Following the descriptor line is one or more P [Predicate/formula] lines as used as the Agent logic, followed by the results contributing to the TEMS logic.

,,,,,,,P,*PREDICATE=( ( ULLOGENT.LOGPATH = ‘/var/adm/ras/’ AND ULLOGENT.LOGNAME = ‘errlog’ AND ULLOGENT.ENTRYTYPE = ‘P’ AND ULLOGENT.ENTRYCLASS = ‘H’ AND STRSCAN(ULLOGENT.ENTRYDESC, ‘4865FA9B’) = 0 AND STRSCAN(ULLOGENT.ENTRYDESC, ‘476B351D’) = 0 ) OR ( ULLOGENT.LOGPATH = ‘/var/adm/ras/’ AND ULLOGENT.LOGNAME = ‘errlog’ AND ULLOGENT.ENTRYTYPE = ‘U’ AND STRSCAN(ULLOGENT.ENTRYDESC, ‘4B6BA416’) = 1 ) ),

Following the predicate is one or more result lines. These are all in the form of Attribute=value in the Table/Column=raw_data form. There is a leading count of the index of this result line. In this case there were 3 P lines and 4 index lines. More comments follow. Ignore the funny emoticons that some Browsers convert  equal [=] followed by  semicolon [;] into. If needed you can copy/paste the line into a line mode editor for study. Clearly the results were coming in very fast, but apparently they arrived in three separate bundles of 4 total results.

,,,,,,,0,ULLOGENT.ENTRYCLASS=H;ULLOGENT.ENTRYDESC=DE3B8540: PATH HAS FAILED;ULLOGENT.ENTRYSRC=hdisk7;ULLOGENT.ENTRYSYS=z9xssodb01;ULLOGENT.ENTRYTIME=1180331084900000;ULLOGENT.ENTRYTYPE=P;ULLOGENT.FREQTHRESH=0;ULLOGENT.LOGNAME=errlog;ULLOGENT.LOGPATH=/var/adm/ras/;ULLOGENT.ORIGINNODE=z9x_z9xssodb01:KUL;ULLOGENT.PERIODTHRS=0;ULLOGENT.TIMESTAMP=1180331084920000;ULLOGENT.UENTRYDESC=DE3B8540: PATH HAS FAILED;ULLOGENT.UENTRYSRC=hdisk7;ULLOGENT.ULOGNAME=errlog;ULLOGENT.ULOGPATH=/var/adm/ras/,

,,,,,,,P,*PREDICATE=( ( ULLOGENT.LOGPATH = ‘/var/adm/ras/’ AND ULLOGENT.LOGNAME = ‘errlog’ AND ULLOGENT.ENTRYTYPE = ‘P’ AND ULLOGENT.ENTRYCLASS = ‘H’ AND STRSCAN(ULLOGENT.ENTRYDESC, ‘4865FA9B’) = 0 AND STRSCAN(ULLOGENT.ENTRYDESC, ‘476B351D’) = 0 ) OR ( ULLOGENT.LOGPATH = ‘/var/adm/ras/’ AND ULLOGENT.LOGNAME = ‘errlog’ AND ULLOGENT.ENTRYTYPE = ‘U’ AND STRSCAN(ULLOGENT.ENTRYDESC, ‘4B6BA416’) = 1 ) ),

,,,,,,,1,ULLOGENT.ENTRYCLASS=H;ULLOGENT.ENTRYDESC=DE3B8540: PATH HAS FAILED;ULLOGENT.ENTRYSRC=hdisk9;ULLOGENT.ENTRYSYS=z9xssodb01;ULLOGENT.ENTRYTIME=1180331084900000;ULLOGENT.ENTRYTYPE=P;ULLOGENT.FREQTHRESH=0;ULLOGENT.LOGNAME=errlog;ULLOGENT.LOGPATH=/var/adm/ras/;ULLOGENT.ORIGINNODE=z9x_z9xssodb01:KUL;ULLOGENT.PERIODTHRS=0;ULLOGENT.TIMESTAMP=1180331084920000;ULLOGENT.UENTRYDESC=DE3B8540: PATH HAS FAILED;ULLOGENT.UENTRYSRC=hdisk9;ULLOGENT.ULOGNAME=errlog;ULLOGENT.ULOGPATH=/var/adm/ras/,

,,,,,,,P,*PREDICATE=( ( ULLOGENT.LOGPATH = ‘/var/adm/ras/’ AND ULLOGENT.LOGNAME = ‘errlog’ AND ULLOGENT.ENTRYTYPE = ‘P’ AND ULLOGENT.ENTRYCLASS = ‘H’ AND STRSCAN(ULLOGENT.ENTRYDESC, ‘4865FA9B’) = 0 AND STRSCAN(ULLOGENT.ENTRYDESC, ‘476B351D’) = 0 ) OR ( ULLOGENT.LOGPATH = ‘/var/adm/ras/’ AND ULLOGENT.LOGNAME = ‘errlog’ AND ULLOGENT.ENTRYTYPE = ‘U’ AND STRSCAN(ULLOGENT.ENTRYDESC, ‘4B6BA416’) = 1 ) ),

,,,,,,,2,ULLOGENT.ENTRYCLASS=H;ULLOGENT.ENTRYDESC=DE3B8540: PATH HAS FAILED;ULLOGENT.ENTRYSRC=hdisk10;ULLOGENT.ENTRYSYS=z9xssodb01;ULLOGENT.ENTRYTIME=1180331084900000;ULLOGENT.ENTRYTYPE=P;ULLOGENT.FREQTHRESH=0;ULLOGENT.LOGNAME=errlog;ULLOGENT.LOGPATH=/var/adm/ras/;ULLOGENT.ORIGINNODE=z9x_z9xssodb01:KUL;ULLOGENT.PERIODTHRS=0;ULLOGENT.TIMESTAMP=1180331084920000;ULLOGENT.UENTRYDESC=DE3B8540: PATH HAS FAILED;ULLOGENT.UENTRYSRC=hdisk10;ULLOGENT.ULOGNAME=errlog;ULLOGENT.ULOGPATH=/var/adm/ras/,

,,,,,,,3,ULLOGENT.ENTRYCLASS=H;ULLOGENT.ENTRYDESC=DE3B8540: PATH HAS FAILED;ULLOGENT.ENTRYSRC=hdisk11;ULLOGENT.ENTRYSYS=z9xssodb01;ULLOGENT.ENTRYTIME=1180331084900000;ULLOGENT.ENTRYTYPE=P;ULLOGENT.FREQTHRESH=0;ULLOGENT.LOGNAME=errlog;ULLOGENT.LOGPATH=/var/adm/ras/;ULLOGENT.ORIGINNODE=z9x_z9xssodb01:KUL;ULLOGENT.PERIODTHRS=0;ULLOGENT.TIMESTAMP=1180331084920000;ULLOGENT.UENTRYDESC=DE3B8540: PATH HAS FAILED;ULLOGENT.UENTRYSRC=hdisk11;ULLOGENT.ULOGNAME=errlog;ULLOGENT.ULOGPATH=/var/adm/ras/,

What is the problem and How to fix it?

The agent logic produced 4 results tagged with the same agent time and only a single event was created. This *might* not be any problem if they were all identical – although you might want to see if the agent has controls to suppress duplicate results. Or if the data is boring and no one needs to fix it, adjust the formula or even stop and delete the situation. The biggest performance improvement you can ever achieve is NOT doing unneeded work.

In this particular case, DisplayItem was set to the first 128 characters of ULLOGENT.ENTRYDESC and in fact all four results had “DE3B8540: PATH HAS FAILED” and so all were merged into one event and 3 events were lost. Looking through all the results row attributes I see

0: ULLOGENT.ENTRYSRC=hdisk7

1: ULLOGENT.ENTRYSRC=hdisk9

2: ULLOGENT.ENTRYSRC=hdisk10

3: ULLOGENT.ENTRYSRC=hdisk11

For this particular case, if you made DisplayItem be ULLOGENT.ENTRYSRC you would get the actual four event data that was actually present but three hidden. It  passes a reasonableness test in that the number of cases looks limited. If a DisplayItem has something like a date or time or a callstack, it can cause a TEMS performance data because all cases of situation/node/displayitem status are kept in storage and the TSITSTSC table and it grows and grows forever [or at least until the TEMS is recycled]

This is just one case, so you should review others and talk it through with Subject Matter Experts on the contents of that log.  Your work isn’t done until you understand all the possible ramifications.

Summary

Tale #2 of using Event Audit History to understand and correct one type of Incorrect DisplayItem conditions and thus get more results.

Sitworld: Table of Contents

History and Earlier versions

There are no binary object associated with this project.

1.000000

initial release

Photo Note: Holding Things Together on Cruise Ship Build 2018

 

One thought on “Sitworld: Event History #2 Duplicate DisplayItems At Same Second

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: