John Alvord, IBM Corporation
jalvord@us.ibm.com
Draft #1 – 1 May 2018 – Level 1.00000
Inspiration
The Event History Audit project is complex to grasp as a whole. The following series of posts will track individual diagnostic efforts and how the new project aided in the process.
This was about diagnosing a situation that produces a lot of results and what that means and what to do. Then it became more complicated and interesting.
This was seen in the Summary Report
Total Result Bytes: 11698309 49.38 K/min Worry[9.88%]
49.38K bytes per minute incoming to a hub TEMS is not usually something to worry about. The standard worry point is 500K bytes per minute. Even so it is always interesting to review situations which are dominating and can be rethought. If case you ever wondered, the maximum incoming result rate ever seen was 93 megs/minute and 128 remote TEMS were crippled.
This was seen in the Report011: Event/Results Budget Situations Report by Result Bytes
EVENTREPORT011: Event/Results Budget Situations Report by Result Bytes
Situation,Table,Rowsize,Reeval,Event,Event%,Event/min,Results,ResultBytes,Result%,Miss,MissBytes,Dup,DupBytes,Null,NullBytes,SampConfirm,SampConfirmBytes,PureMerge,PureMergeBytes,transitions,nodes,PDT
wlp_fretbsp_grzw_std,KRZTSOVEW,384,180,13,7.26%,0.06,14415,5535360,47.32%,13986,5370624,0,0,0,0,422,162048,0,0,13,1,*IF *SCAN KRZ_RDB_TABLESPACENORMAL_USAGE.Tablespace_Name *NE ‘TEMP’ *AND *SCAN KRZ_RDB_TABLESPACENORMAL_USAGE.Tablespace_Name *NE ‘UNDO’ *AND *SCAN KRZ_RDB_TABLESPACENORMAL_USAGE.Tablespace_Name *NE ‘RBS’ *AND *SCAN KRZ_RDB_TABLESPACENORMAL_USAGE.Tablespace_Name *NE ‘ROLLBACK’ *AND *SCAN KRZ_RDB_TABLESPACENORMAL_USAGE.Tablespace_Name *NE ‘FNTMP’ *AND *SCAN KRZ_RDB_TABLESPACENORMAL_USAGE.Tablespace_Name *NE ‘2013’ *AND *SCAN KRZ_RDB_TABLESPACENORMAL_USAGE.Tablespace_Name *NE ‘UNDOTBS1’ *AND *SCAN KRZ_RDB_TABLESPACENORMAL_USAGE.Tablespace_Name *NE ‘UNDOTBS2’ *AND *VALUE KRZ_RDB_TABLESPACENORMAL_USAGE.Percentage_Free_To_Allocated *GT 5.00 *AND *VALUE KRZ_RDB_TABLESPACENORMAL_USAGE.Percentage_Free_To_Allocated *LE 10.00 *AND *SIT wlp_tbspro_grzc_std *EQ *TRUE,
For this discussion, we look at column 8,9 and 10.
8 Results 14415
9 ResultBytes 5535360
10 Result% 47.32%
22 Nodes 1
These are estimated based on the Event History Status table. If the TEMS has been running a long time and the event history table has wrapped, there could well be unseen workload components. You need a TEMS Audit report to directly measure incoming result workload to be accurate.
This one situation wlp_fretbsp_grzw_std was observed running on a single agent and was producing almost half the incoming result workload. Again the impact isn’t high but it does seem unusual and needs further analysis. The formula excludes a lot of table types and then looks for a percent free to allocated between 5% and 10%. That is the sort of formula that may violate the general rule of good situations – Rare/Exceptional and Fixable. The fact there are so many positive results strongly suggest that it is neither Rare nor Exceptional and that no one is fixing the condition. More Reports that will supply more details.
This was seen in the Report018: Situations processed but not forwarded
EVENTREPORT018: Situations processed but not forwarded
Situation,Count,Nodes,
wlp_fretbsp_grzw_std,13,1,
This means that some events were sent to an event receiver, but not this one. That means that 47% of the TEMS workload provided monitoring for a condition that was not sent to an event receiver. That might explain why no one has fixed the condition. There was a very bad condition once where the hub TEMS was suffering and 80% of the incoming events created by one situation on one AIX Unix OS Agent running on a decommissioned AIX system that was supposed to be powered off. This isn’t always a wrong condition, of course, but anytime you have a non-forwarded situation event it needs to be reviewed and justified.
This was seen in the Report004: Situations with Multiple results at TEMS with same DisplayItem at same second
EVENTREPORT004: Situations with Multiple results at TEMS with same DisplayItem at same second
Situation,Type,TEMS_Second,Results,Agent,Atomize,Atom,
wlp_fretbsp_grzc_std,Sampled,1180429225812000,5,RZ:multip-multip-vaathmr406:RDB,,,
wlp_fretbsp_grzc_std,Sampled,1180429231916000,5,RZ:multip-multip-vaathmr406:RDB,,,
Here we see that Situation Events are being produced and each time there are 5 results. The Atomize and Atom are null which means that no DisplayItem has been set. This also means that events are being lost and never seen. Of course since the situation is not forwarded to an event receiver it may not matter. More information in the detail report later.
This was seen in the Report026: Situations showing high Open<->Close rate
EVENTREPORT026: Situations showing high Open<->Close rate
Situation,Reeval,Rate,Node_ct,PDT
wlp_fretbsp_grzw_std,180,3.37,1,*IF *SCAN KRZ_RDB_TABLESPACENORMAL_USAGE.Tablespace_Name *NE ‘TEMP’ *AND *SCAN KRZ_RDB_TABLESPACENORMAL_USAGE.Tablespace_Name *NE ‘UNDO’ *AND *SCAN KRZ_RDB_TABLESPACENORMAL_USAGE.Tablespace_Name *NE ‘RBS’ *AND *SCAN KRZ_RDB_TABLESPACENORMAL_USAGE.Tablespace_Name *NE ‘ROLLBACK’ *AND *SCAN KRZ_RDB_TABLESPACENORMAL_USAGE.Tablespace_Name *NE ‘FNTMP’ *AND *SCAN KRZ_RDB_TABLESPACENORMAL_USAGE.Tablespace_Name *NE ‘2013’ *AND *SCAN KRZ_RDB_TABLESPACENORMAL_USAGE.Tablespace_Name *NE ‘UNDOTBS1’ *AND *SCAN KRZ_RDB_TABLESPACENORMAL_USAGE.Tablespace_Name *NE ‘UNDOTBS2’ *AND *VALUE KRZ_RDB_TABLESPACENORMAL_USAGE.Percentage_Free_To_Allocated *GT 5.00 *AND *VALUE KRZ_RDB_TABLESPACENORMAL_USAGE.Percentage_Free_To_Allocated *LE 10.00 *AND *SIT wlp_tbspro_grzc_std *EQ *TRUE,
We see here that the situation has a re-evaluation rate of 180 seconds, 3 minutes. Every hour there are an average of 3.37 open->close or close->open transitions [per agent] and only one agent/node is involved.
Situations that show a rapid rate of opening and closing are suspect. It usually means they fail the basic test of Rare/Exceptional. How can a condition be rare if it shows a new open event multiple times an hour. This needs to be examined for reasonableness. The lack of DisplayItem may be causing this effect. When DisplayItems are configured, each result can separately control a situation event and that means there is less internal TEMS confusion. However since the Situation is not forwarded, this may be just internal friction, wasting resources to no benefit.
Deep dive Into the report details
Scan or search ahead for Report 999. It is sorted by first node, then situation, then by Time at the TEMS. I will first describe what you see and the guidance from the column description line.
EVENTREPORT999: Full report sorted by Node/Situation/Time
Situation,Node,Thrunode,Agent_Time,TEMS_Time,Deltastat,Reeval,Results,Atomize,DisplayItem,LineNumber,PDT
Situation – Situation Name, which can be different from the Full Name that you see in situation editor, like too long or other cases.
Node – Managed System Name or Agent Name
Thrunode – The managed system that knows how to communicate with the agent, the remote TEMS in simple cases
Agent_Time – The time as recorded at the Agent during TEMA processing. You will see cases where the same Agent time is seen in multiple TEMS seconds because the Agent can produce data faster than then TEMS can process it at times. Simple cases have a last three digits of 999. Other cases will have tie breakers of 000,001,…,998 when a lot of data is being generated. This the UTC [earlier GMT] time at the agent.
TEMS_Time – The time as recorded at the TEMS during processing. This the UTC [earlier GMT] time.
Deltastat – event status. You generally see Y for open and N for close. There are more not recorded here.
Reeval – Sampling interval [re-evaluation] in seconds and 0 means a pure event.
Results – How many results were seen. The simplest cases are 1 and you would see that if you used -allresults control. In this report you only get a warning when there are multiple results.
Atomize – The table/column specification of the value used for Atomize. It can be null meaning not used.
DisplayItem – The value of the atomize in this instance. Atomize is just the [up to] first 128 bytes of another string attribute.
LineNumber – A debugging helper that tells what line of the TSITSTSH data dump supplied this information
PDT – The Predicate or Situation Formula as it is stored.
The Descriptor line – before we see the results.
wlp_fretbsp_grzw_std,RZ:multip-multip-vaathmr406:RDB,REMOTE_va10plvtem021,1180429225812000,1180429225815145,Y,180,5,,,8103,*IF *SCAN KRZ_RDB_TABLESPACENORMAL_USAGE.Tablespace_Name *NE ‘TEMP’ *AND *SCAN KRZ_RDB_TABLESPACENORMAL_USAGE.Tablespace_Name *NE ‘UNDO’ *AND *SCAN KRZ_RDB_TABLESPACENORMAL_USAGE.Tablespace_Name *NE ‘RBS’ *AND *SCAN KRZ_RDB_TABLESPACENORMAL_USAGE.Tablespace_Name *NE ‘ROLLBACK’ *AND *SCAN KRZ_RDB_TABLESPACENORMAL_USAGE.Tablespace_Name *NE ‘FNTMP’ *AND *SCAN KRZ_RDB_TABLESPACENORMAL_USAGE.Tablespace_Name *NE ‘2013’ *AND *SCAN KRZ_RDB_TABLESPACENORMAL_USAGE.Tablespace_Name *NE ‘UNDOTBS1’ *AND *SCAN KRZ_RDB_TABLESPACENORMAL_USAGE.Tablespace_Name *NE ‘UNDOTBS2’ *AND *VALUE KRZ_RDB_TABLESPACENORMAL_USAGE.Percentage_Free_To_Allocated *GT 5.00 *AND *VALUE KRZ_RDB_TABLESPACENORMAL_USAGE.Percentage_Free_To_Allocated *LE 10.00 *AND *SIT wlp_tbspro_grzc_std *EQ *TRUE,
This is mostly what we have seen before. We do get to see the TEMS and Agent Time. We can also see that this was seen at line 8103 of the input.
Following the descriptor line is one or more P [Predicate/formula] lines as used as the Agent logic, followed by the results contributing to the TEMS logic.
,,,,,,,P,*PREDICATE=STRSCAN(KRZTSNLUE.TNAME, N’TEMP’) = 0 AND STRSCAN(KRZTSNLUE.TNAME, N’UNDO’) = 0 AND STRSCAN(KRZTSNLUE.TNAME, N’RBS’) = 0 AND STRSCAN(KRZTSNLUE.TNAME, N’ROLLBACK’) = 0 AND STRSCAN(KRZTSNLUE.TNAME, N’FNTMP’) = 0 AND STRSCAN(KRZTSNLUE.TNAME, N’2013′) = 0 AND STRSCAN(KRZTSNLUE.TNAME, N’UNDOTBS1′) = 0 AND STRSCAN(KRZTSNLUE.TNAME, N’UNDOTBS2′) = 0 AND KRZTSNLUE.PCIFREE > 00500 AND KRZTSNLUE.PCIFREE <= 01000 AND KRZTSOVEW.STATUS <> N’READ’,
,
Following the predicate is one or more result lines. These are all in the form of Attribute=value in the Table/Column=raw_data form. There is a leading count of the index of this result line. In this case there was one P line and 5 result lines.
,,,,,,,0,KRZTSNLUE.DBHOSTNAME=vaathmr406;KRZTSNLUE.DBINSTNAME=multip;KRZTSNLUE.DBUNIQNAME=multip;KRZTSNLUE.KBFREE=7168000;KRZTSNLUE.KBUSED=131072000;KRZTSNLUE.KBYTES=138240000;KRZTSNLUE.KLARGEST=7168000;KRZTSNLUE.KMAXSIZE=13421766400;KRZTSNLUE.MBFREE=7000;KRZTSNLUE.MBUSED=128000;KRZTSNLUE.MBYTES=135000;KRZTSNLUE.MLARGEST=7000;KRZTSNLUE.MMAXSIZE=13107194;KRZTSNLUE.ORIGINNODE=RZ:multip-multip-vaathmr406:RDB;KRZTSNLUE.PCIFMAX=9902;KRZTSNLUE.PCIFREE=519;KRZTSNLUE.PCIMAX=103;KRZTSNLUE.PCIUSED=9481;KRZTSNLUE.TIMESTAMP=1180429225646000;KRZTSNLUE.TNAME=CASOL_CR201605_DAT;KRZTSNLUE.TSFGNUM=1,
,,,,,,,1,KRZTSNLUE.DBHOSTNAME=vaathmr406;KRZTSNLUE.DBINSTNAME=multip;KRZTSNLUE.DBUNIQNAME=multip;KRZTSNLUE.KBFREE=26214400;KRZTSNLUE.KBUSED=466227200;KRZTSNLUE.KBYTES=492441600;KRZTSNLUE.KLARGEST=26214400;KRZTSNLUE.KMAXSIZE=13421766400;KRZTSNLUE.MBFREE=25600;KRZTSNLUE.MBUSED=455300;KRZTSNLUE.MBYTES=480900;KRZTSNLUE.MLARGEST=25600;KRZTSNLUE.MMAXSIZE=13107194;KRZTSNLUE.ORIGINNODE=RZ:multip-multip-vaathmr406:RDB;KRZTSNLUE.PCIFMAX=9653;KRZTSNLUE.PCIFREE=532;KRZTSNLUE.PCIMAX=367;KRZTSNLUE.PCIUSED=9468;KRZTSNLUE.TIMESTAMP=1180429225646000;KRZTSNLUE.TNAME=CASOL_CR201605_IDX;KRZTSNLUE.TSFGNUM=1,
,,,,,,,2,KRZTSNLUE.DBHOSTNAME=vaathmr406;KRZTSNLUE.DBINSTNAME=multip;KRZTSNLUE.DBUNIQNAME=multip;KRZTSNLUE.KBFREE=17408000;KRZTSNLUE.KBUSED=234496000;KRZTSNLUE.KBYTES=251904000;KRZTSNLUE.KLARGEST=17408000;KRZTSNLUE.KMAXSIZE=13421766400;KRZTSNLUE.MBFREE=17000;KRZTSNLUE.MBUSED=229000;KRZTSNLUE.MBYTES=246000;KRZTSNLUE.MLARGEST=17000;KRZTSNLUE.MMAXSIZE=13107194;KRZTSNLUE.ORIGINNODE=RZ:multip-multip-vaathmr406:RDB;KRZTSNLUE.PCIFMAX=9825;KRZTSNLUE.PCIFREE=691;KRZTSNLUE.PCIMAX=188;KRZTSNLUE.PCIUSED=9309;KRZTSNLUE.TIMESTAMP=1180429225646000;KRZTSNLUE.TNAME=CASOL_CR201612_DAT;KRZTSNLUE.TSFGNUM=1,
,,,,,,,3,KRZTSNLUE.DBHOSTNAME=vaathmr406;KRZTSNLUE.DBINSTNAME=multip;KRZTSNLUE.DBUNIQNAME=multip;KRZTSNLUE.KBFREE=11264000;KRZTSNLUE.KBUSED=206848000;KRZTSNLUE.KBYTES=218112000;KRZTSNLUE.KLARGEST=11264000;KRZTSNLUE.KMAXSIZE=13421766400;KRZTSNLUE.MBFREE=11000;KRZTSNLUE.MBUSED=202000;KRZTSNLUE.MBYTES=213000;KRZTSNLUE.MLARGEST=11000;KRZTSNLUE.MMAXSIZE=13107194;KRZTSNLUE.ORIGINNODE=RZ:multip-multip-vaathmr406:RDB;KRZTSNLUE.PCIFMAX=9846;KRZTSNLUE.PCIFREE=516;KRZTSNLUE.PCIMAX=163;KRZTSNLUE.PCIUSED=9484;KRZTSNLUE.TIMESTAMP=1180429225646000;KRZTSNLUE.TNAME=CASOL_CR201702_DAT;KRZTSNLUE.TSFGNUM=1,
,,,,,,,4,KRZTSNLUE.DBHOSTNAME=vaathmr406;KRZTSNLUE.DBINSTNAME=multip;KRZTSNLUE.DBUNIQNAME=multip;KRZTSNLUE.KBFREE=25395200;KRZTSNLUE.KBUSED=462131200;KRZTSNLUE.KBYTES=487526400;KRZTSNLUE.KLARGEST=25395200;KRZTSNLUE.KMAXSIZE=13421766400;KRZTSNLUE.MBFREE=24800;KRZTSNLUE.MBUSED=451300;KRZTSNLUE.MBYTES=476100;KRZTSNLUE.MLARGEST=24800;KRZTSNLUE.MMAXSIZE=13107194;KRZTSNLUE.ORIGINNODE=RZ:multip-multip-vaathmr406:RDB;KRZTSNLUE.PCIFMAX=9656;KRZTSNLUE.PCIFREE=521;KRZTSNLUE.PCIMAX=363;KRZTSNLUE.PCIUSED=9479;KRZTSNLUE.TIMESTAMP=1180429225646000;KRZTSNLUE.TNAME=CASOL_CR201705_IDX;KRZTSNLUE.TSFGNUM=1,
Reviewing the attributes, it appears that the differentiating attribute is KRZTSNLUE.TNAME which takes on five different values – the last is CASOL_CR201705_IDX.
What are the problems and How to fix them?
The first problem is that there are five results and they are being merged – so four are lost. That could be resolved by configuring DisplayItem to KRZTSNLUE.TNAME.
The second problem is that the Situation event goes open and close 3.17 times per hour. This means it is not Rare/Exceptional and so the formula needs to be reviewed for reasonableness. Most important who is going to fix the condition. If not one is going to fix the condition, the situation should be stopped and deleted. The best performance gain comes from not doing unnecessary work.
The third problem is that the Situation is not forwarded. This is only alerted when some situations are forwarded but not this. This is often seen when a situation is being developed [and tested on a single system] and then forgotten. There can also be good reasons for not-forwarding such as when used by a Workflow Policy or in an *UNTIL/*SIT control to block other situations. However it is most often an oversight and should be stopped and deleted.
Summary
Tale #12 of using Event Audit History is about reviewing a case where half the TEMS workload is coming from a single situation
Sitworld: Table of Contents
History and Earlier versions
There are no binary objects associated with this project.
1.000000
initial release
Photo Note: Open Air Pool – Cruise Ship Build 2015
One thought on “Sitworld: Event History #12 High Impact Situations And Much More”