John Alvord, IBM Corporation
jalvord@us.ibm.com
Draft #1 – 27 April 2018 – Level 1.00000
Inspiration
The Event History Audit project is complex to grasp as a whole. The following series of posts will track individual diagnostic efforts and how the new project aided in the process.
One of the largest difficulty was understanding what happened when two [or more] results were merged in a single event. There are so many attribute values to compare it can be tedious. This needed a new report section!!
This was seen in the Event Audit History Report Section
EVENTREPORT007: Detailed Attribute differences on first two merged results
Situation,Node,Agent_Time,Reeval,Results,Atom,Atomize,Attribute_Differences
bnc_wasaftergc_gynp_dseiiwas,PWNESTBN:bnc_itmaxpapnestba:KYNS,1180410062604000,300,2,KYNGCCYC.SERVER_NAM,PWNESTBA,KYNGCCYC.AF_NO 1[31178] 2[31128],KYNGCCYC.BYTE_FREED 1[1793354] 2[1786754],KYNGCCYC.BYTE_USED 1[303798] 2[310398],KYNGCCYC.FINAL_REFS 1[826] 2[1165],KYNGCCYC.GC_NO 1[31179] 2[31129],KYNGCCYC.GC_TIME 1[1180410062522599] 2[1180410062451969],KYNGCCYC.HEAP_AVAIL 1[1793354] 2[1786754],KYNGCCYC.SOFT_REFS 1[4] 2[0],KYNGCCYC.TIME_COMP 1[441] 2[569],KYNGCCYC.TIME_MARK 1[172] 2[169],KYNGCCYC.WEAK_REFS 1[340] 2[412],,
This involved the bnc_wasaftergc_gynp_dseiiwas situation which was delivered from agent PWNESTBN:bnc_itmaxpapnestba:KYNS and the Agent time was 1180410062604000. The sampling interval was 300 seconds [Sampled situation] and there were two results merged. There was a DisplayItem KYNGCCYC.SERVER_NAM and the Atomize value was PWNESTBA. That explains why they were merged.
Attribute by Attribute comparison.
At the end of each report line is a comparison between each attribute that is different between the first and the second result rows. If there were more than two results, this comparison is still only between the first two. The idea is to make it easier to compare the two. More comments after.
KYNGCCYC.AF_NO 1[31178] 2[31128],
KYNGCCYC.BYTE_FREED 1[1793354] 2[1786754],
KYNGCCYC.BYTE_USED 1[303798] 2[310398],
KYNGCCYC.FINAL_REFS 1[826] 2[1165],
KYNGCCYC.GC_NO 1[31179] 2[31129],
KYNGCCYC.GC_TIME 1[1180410062522599] 2[1180410062451969],
KYNGCCYC.HEAP_AVAIL 1[1793354] 2[1786754],
KYNGCCYC.SOFT_REFS 1[4] 2[0],
KYNGCCYC.TIME_COMP 1[441] 2[569],
KYNGCCYC.TIME_MARK 1[172] 2[169],
KYNGCCYC.WEAK_REFS 1[340] 2[412],,
Sometimes you can spot an attribute that would make a better DisplayItem, not here though.
The KYNGCCYC.GC_TIME is really interesting – selecting out the minute and second the first is 25:22 and the second in 24:51, about 31 seconds prior. Since the sampling interval is 300 seconds, these two result sets cannot be from the same agent… even though they have the same server name KYNGCCYC.SERVER_NAM of PWNESTBA. Next notice the agent name PWNESTBN:bnc_itmaxpapnestba:KYNS. The first section is often the hostname and this is PWNESTBN – just one character away from PWNESTBA.
What is the problem and How to fix it?
The problem is that there are two results and they are being merged – so one is lost.
From the analysis above, there are two agents which have been accidentally configured with the same name and they are conflicting with each other. They are sending results every 300 seconds. The results arrive in a large collection area identified only by agent name [and situation name and DisplayItem and time etc]. The TEMS dataserver [SQL processor] wakes up every 300 seconds and looks for results for that situation. It finds them [two in this case] and creates a potential situation event package that SITMON [situation monitor logic] then bundles together.
The solution is to review the environment and determine what the duplicated agents are and correct the incorrect agent configuration. That way the agent names will be unique and when this happens again there will be two situation events created.
Often the knowledge of a potential duplicate condition and the agent name is enough to lead the agent owners to the correct ones to fix.
Other times these can be detected with a TEPS Audit report. Agents like this often send inconsistent node statuses – like a changing IP address. The TEPS is very sensitive and complains [produces error messages] and that TEPS Audit will summarize such complaints. Other times the hub and remote TEMS needs diagnostic tracing and a TEMS Audit report will point the way.
Summary
Tale #11 of using Event Audit History is about reviewing a case where there is evidence of duplicate situation names.
Sitworld: Table of Contents
History and Earlier versions
There are no binary objects associated with this project.
1.000000
initial release
Photo Note: Radar Dome Lift – January 2016