John Alvord, IBM Corporation
jalvord@us.ibm.com
Draft #1 – 25 May 2018 – Level 1.00000
Inspiration
The Event History Audit project is complex to grasp as a whole. The following series of posts will track individual diagnostic efforts and how the new project aided in the process.
This was seen in the Summary Section
Total Result Bytes: 1023369249 989.47 K/min Worry[197.89%]
This environment is receiving almost one megabyte of results data per minute. Experience has shown that problems often occur if the result rate is over 500K per minute. That is the source of the “worry” percentage. Your mileage may vary based on the server running the workload. Even if the server can handle the workload it is never a good idea to perform useless work.
This was seen in the Report011: Event/Results Budget Situations Report by Result Bytes
EVENTREPORT011: Event/Results Budget Situations Report by Result Bytes
Situation,Table,Rowsize,Reeval,Event,Event%,Event/min,Results,ResultBytes,Result%,Miss,MissBytes,Dup,DupBytes,Null,NullBytes,SampConfirm,SampConfirmBytes,PureMerge,PureMergeBytes,transitions,nodes,PDT
deb_prccpu_xuxw_aix,UNIXPS,2784,60,978,13.46%,0.97,269576,750499584,73.34%,0,0,0,0,0,0,269576,750499584,0,0,978,39,*IF *VALUE Process.CPU_Pct *GE 1.00 *AND *VALUE Process.Process_Command_U *EQ ‘/opt/BESClient/bin/BESClient’ *AND *VALUE Process.CPU_Pct *LT 4.00,
So there is a situation deb_prccpu_xuxw_aix which runs every 60 seconds and checks for one process and alerts the CPU% is between 1% and 4%. It runs on 39 agents connected to this remote TEMS.
Remarkably, this one situation causes an estimated 73.34% of the total estimated workload. This is an estimate because the data does not include information about situations which started before the Event Status History data. The actual result data can be higher because of real time data requests.
Deep dive Into the report details
Scan or search ahead for Report 999. It is sorted by first node, then situation, then by Time at the TEMS. I will first describe what you see and the guidance from the column description line.
This will show only a single open event and then close event, but there were many listed in the full report.
EVENTREPORT999: Full report sorted by Node/Situation/Time
Situation,Node,Thrunode,Agent_Time,TEMS_Time,Deltastat,Reeval,Results,Atomize,DisplayItem,LineNumber,PDT
Situation – Situation Name, which can be different from the Full Name that you see in situation editor, like too long or other cases.
Node – Managed System Name or Agent Name
Thrunode – The managed system that knows how to communicate with the agent, the remote TEMS in simple cases
Agent_Time – The time as recorded at the Agent during TEMA processing. You will see cases where the same Agent time is seen in multiple TEMS seconds because the Agent can produce data faster than then TEMS can process it at times. Simple cases have a last three digits of 999. Other cases will have tie breakers of 000,001,…,998 when a lot of data is being generated. This the UTC [earlier GMT] time at the agent.
TEMS_Time – The time as recorded at the TEMS during processing. This the UTC [earlier GMT] time.
Deltastat – event status. You generally see Y for open and N for close. There are more not recorded here.
Reeval – Sampling interval [re-evaluation] in seconds and 0 means a pure event.
Results – How many results were seen. The simplest cases are 1 and you would see that if you used -allresults control. In this report you only get a warning when there are multiple results.
Atomize – The table/column specification of the value used for Atomize. It can be null meaning not used.
DisplayItem – The value of the atomize in this instance. Atomize is just the [up to] first 128 bytes of another string attribute.
LineNumber – A debugging helper that tells what line of the TSITSTSH data dump supplied this information
PDT – The Predicate or Situation Formula as it is stored.
The Descriptor line – before we see the results.
deb_prccpu_xuxw_aix,deb_gb02cap070debx7:KUX,REMOTE_gbnhham080tmsxm,1180410002104999,1180410001843000,Y,60,1,,,3850,*IF *VALUE Process.CPU_Pct *GE 1.00 *AND *VALUE Process.Process_Command_U *EQ ‘/opt/BESClient/bin/BESClient’ *AND *VALUE Process.CPU_Pct *LT 4.00,
Situation was deb_prccpu_xuxw_aix, agent was deb_gb02cap070debx7:KUX. thrunode was REMOTE_gbnhham080tmsxm. Agent_time was 1180410002104999 and TEMS_time was 1180410001843000, so the agent is running a few minutes ahead. It was an Open event [Y], sampling interval was 600, there was one result, there was no DisplayItem. The record came from line number 3850 in the input. The PDT is shown at the end.
Following the descriptor line is one or more P [Predicate/formula] lines as used as the Agent logic, followed by the results contributing to the TEMS logic.
,,,,,,,P,*PREDICATE=UNIXPS.CPUPERCENT >= 100 AND UNIXPS.UCOMMAND = N’/opt/BESClient/bin/BESClient’ AND UNIXPS.CPUPERCENT < 400,
Following the predicate is one or more result lines. These are all in the form of Attribute=value in the Table/Column=raw_data form. There is a leading count of the index of this result line. In this case there was one P line and one result line. Sometimes there are many more but not this time.
,,,,,,,0,UNIXPS.ADDR=b424b459;UNIXPS.BCMD=BESClient;UNIXPS.CHILDSTIME=0;UNIXPS.CHILDTIME=0;UNIXPS.CHILDUTIME=0;UNIXPS.CMD=/opt/BESClient/bin/BESClient;UNIXPS.COMMAND=/opt/BESClient/bin/BESClient;UNIXPS.CONTSWIT64=460694;UNIXPS.CONTSWITCH=460694;UNIXPS.CPU=120;UNIXPS.CPUID=-1;UNIXPS.CPUPERCENT=144;UNIXPS.CPUTIME=62;UNIXPS.EGID=0;UNIXPS.EGRPN=system;UNIXPS.ELAPTIME=000d04:25:13;UNIXPS.EUID=0;UNIXPS.EUSERN=root;UNIXPS.EVENT=*;UNIXPS.EXECSTATE=A;UNIXPS.FLAG= 40001;UNIXPS.GID=0;UNIXPS.GRPN=system;UNIXPS.HEAP=-1;UNIXPS.INVCONTS64=11490;UNIXPS.INVCONTSWT=11490;UNIXPS.MAJORFAU64=9;UNIXPS.MAJORFAULT=9;UNIXPS.MEMPERCENT=100;UNIXPS.MINORFAU64=65168;UNIXPS.MINORFAULT=65168;UNIXPS.NICE=22;UNIXPS.ORIGINNODE=deb_gb02cap070debx7:KUX;UNIXPS.PGID=8847390;UNIXPS.PID=11206814;UNIXPS.PPID=1;UNIXPS.PRIORITY=64;UNIXPS.PROCCOUNT=1;UNIXPS.PSU=10792;UNIXPS.RDS=10792;UNIXPS.READWRI64=610219238;UNIXPS.READWRITE=610219238;UNIXPS.RTS=5177;UNIXPS.SCHEDCLASS=N/A;UNIXPS.SESSIONID=8847390;UNIXPS.SIZE=63876;UNIXPS.STACK=-1;UNIXPS.STARTTIME=1
180409195521000;UNIXPS.SYSTEMTIM=000d00:00:11;UNIXPS.SYSTEMTYPE=AIX;UNIXPS.TEXT_SIZE=15207;UNIXPS.THREADCNT=4;UNIXPS.TIME=00001:02;UNIXPS.TIMESTAMP=1180410002034000;UNIXPS.TOTALTIME=000d00:01:02;UNIXPS.TOTCPUPERC=38;UNIXPS.TTY=-;UNIXPS.UCMD=/opt/BESClient/bin/BESClient;UNIXPS.UCOMMAND=/opt/BESClient/bin/BESClient;UNIXPS.UID=0;UNIXPS.UPROCFILT=;UNIXPS.USERNAME=root;UNIXPS.USERTIME=000d00:00:51;UNIXPS.UUSERNAME=root;UNIXPS.VSIZE=58372;UNIXPS.WAITCPUTIM=0;UNIXPS.WAITLKTIME=0;UNIXPS.WLM_NAME=Unclassified;UNIXPS.WPAR_NAME=Global;UNIXPS.ZONEID=-1;UNIXPS.ZONENAME=-1,
Here is where I extracted the value result. This is raw data and represents 1.44%.
==> UNIXPS.CPUPERCENT=144
I will skip repeating the full details. Next you see the results coming in false and then true. Each time it is true I record the value.
deb_prccpu_xuxw_aix,deb_gb02cap070debx7:KUX,REMOTE_gbnhham080tmsxm,1180410003604999,1180410003343000,N,60,0,
deb_prccpu_xuxw_aix,deb_gb02cap070debx7:KUX,REMOTE_gbnhham080tmsxm,1180410013304999,1180410013043000,Y,60,1
==> UNIXPS.CPUPERCENT=336
deb_prccpu_xuxw_aix,deb_gb02cap070debx7:KUX,REMOTE_gbnhham080tmsxm,1180410013504999,1180410013243000,N,60,0,
deb_prccpu_xuxw_aix,deb_gb02cap070debx7:KUX,REMOTE_gbnhham080tmsxm,1180410020704999,1180410020444000,Y,60,1
==> UNIXPS.CPUPERCENT=110
deb_prccpu_xuxw_aix,deb_gb02cap070debx7:KUX,REMOTE_gbnhham080tmsxm,1180410020804999,1180410020543000,N
deb_prccpu_xuxw_aix,deb_gb02cap070debx7:KUX,REMOTE_gbnhham080tmsxm,1180410043404999,1180410043143000,Y,60,1
==> UNIXPS.CPUPERCENT=121
What is the problem and How to fix it?
From this capture, 13.46% of the events and 73.34% of the result workload was from this situation. This from only 39 agents!
Doing that work constitutes a substantial investment. It fails the basic test of a good situation which is to be Rare, Exceptional, and Fixable. It is certainly not a rare condition, seems to be happening all the time. It is happening a lot and no one is “fixing” the condition.
So this situation is not rare, not exceptional and clearly no one is “fixing” it. Therefore the situation should be rethought and reworked until it is rare, exceptional and fixable. If that is not possible, the situation should be stopped and deleted to make room for other useful work at the agent(s) and the TEMS and the event receivers. If it could be stopped, the workload would drop substantially. Thus the situation should be reviewed and justified.
Summary
Tale #15 of using Event Audit History to understand and review a high overhead situation and thus potentially save resources.
Sitworld: Table of Contents
History and Earlier versions
There are no binary objects associated with this project.
1.000000
initial release
Photo Note: Back Deck – Big Sur 1999
One thought on “Sitworld: Event History #15 High Results Situation to No Purpose”