John Alvord, IBM Corporation
Draft #1 – 4 April 2018 – Level 1.00000
The Event History Audit project is complex to grasp as a whole. The following series of posts will track individual diagnostic efforts and how the new project aided in the process.
A Situation That Created Too Many Events
Title: False ESX reboot alert
Description: … We are getting false alert for ESX up time monitoring, it is happening randomly for different servers . … I found Alert get triggered when system up time show values as 4294967295 . it is happening for all those false triggered alert…
The situation formula as seen from a tacmd viewsit was simple.
Formula : *IF *VALUE KVM_SERVER.System_up_time *LT 600
Sampling Interval : 0/0:5:0
Just after startup the system being monitored would have an up time below 600 seconds. That formula creates an situation event. A while later the up goes above 600 seconds and the event would close.
Event History Audit
The new tool was run on the remote TEMS database files which were sent with the problem report. If you were doing this yourself you would use the eventdat.pl script. The example follows that end user path.
perl eventaud.pl -lst -allresults
The -allresults means the last report section displays all the results in an easier to understand format. Usually it shows only the situations that triggered advisories.
That final report section is displayed in order by 1) node or agent name, 2) Thrunode [usually remote TEMS], 3) Situation Name, 4) DisplayItem, 5) by TEMS processing second. There is a massive amount of information present and I will show one small snippet of the report relating to the problem at hand.
Here are the lines of interest:
IBM_ESXReboot_W_Test,VM:XXXX232V-ibmesxcdc030:ESX,REMOTE_IBM010,1180327071516999,1180327071516000,Y,300,1,KVMSERVERG.SH,ibmesxcdc030.amer.ibm.corp,1891,*IF *VALUE KVM_SERVER.System_up_time *LT 600,
,,,,,,,P,*PREDICATE=KVMSERVERG.SUT < 600,
,,,,,,,0,KVMSERVERG.AVCPR=83;KVMSERVERG.BIOS_DATE=;KVMSERVERG.BN=3248547;KVMSERVERG.CEM=;KVMSERVERG.CLUSTER=CDC-Linux-Non-Prod-Cluster-01;KVMSERVERG.CP=0;KVMSERVERG.CS=connected;KVMSERVERG.DATACENTER=CSC Chicago Data Center (CSC);KVMSERVERG.DEMAND=0;KVMSERVERG.DM=datacenter-862;KVMSERVERG.DS=283436;KVMSERVERG.EU=0;KVMSERVERG.FQN=;KVMSERVERG.HE=0;KVMSERVERG.IDIS0=0;KVMSERVERG.IDIS1=0;KVMSERVERG.IDIS2=0;KVMSERVERG.IDIS3=0;KVMSERVERG.IDIS4=0;KVMSERVERG.IDIS5=0;KVMSERVERG.IDIS6=0;KVMSERVERG.IDIS7=0;KVMSERVERG.IDIS8=0;KVMSERVERG.IDIS9=0;KVMSERVERG.IP_ADDRESS=;KVMSERVERG.LATENCY=0;KVMSERVERG.MEM=;KVMSERVERG.MM=0;KVMSERVERG.NICS=8;KVMSERVERG.NODEID=;KVMSERVERG.NUMBER_VMS=22;KVMSERVERG.NVO=13;KVMSERVERG.OCU=7;KVMSERVERG.OMU=47;KVMSERVERG.ORIGINNODE=VM:FAIN232V-ibmesxcdc030:ESX;KVMSERVERG.OS=green;KVMSERVERG.PC=12;KVMSERVERG.PC0=0;KVMSERVERG.PEP=0;KVMSERVERG.PER=0;KVMSERVERG.PF=;KVMSERVERG.PM=196544;KVMSERVERG.PRODUCT=VMware ESXi;KVMSERVERG.PS=;KVMSERVERG.PU=0;KVMSERVERG.SAML=0;KVMSERVERG.SH=ibmesxcdc030.amer.ibm.corp;KVMSERVERG.SM=;KVMSERVERG.SN=;KVMSERVERG.SPML=0;KVMSERVERG.SUT=4294967295;KVMSERVERG.SV=;KVMSERVERG.TCM=31908;KVMSERVERG.TIMESTAMP=1180327090935000;KVMSERVERG.TVCM=0;KVMSERVERG.TVPS=0;KVMSERVERG.UCM=0;KVMSERVERG.UD=130917;KVMSERVERG.UUID=2001FF01-0000-0000-0000-00000000005F;KVMSERVERG.VE=Yes;KVMSERVERG.VERSION=5.5.0,
This is a little messy to view but it is simple compared to some. The first line is a header summary which starts off with the situation name, the agent name, the remote TEMS, agent time, TEMS Time, Status [Y=open], 300 [seconds sampling interval]. number of results, the DisplayItem value and the Situation Formula or PDT. In the full report section there is a header title line.
The second line is the predicate or formula summary. The names here all use the Attribute Group Table and Attribute Column. It is exactly parallel to the first line PDT:
The third line shows the attributes sent with the situation being open. In particular note the KVMSERVERG.SUT=4294967295; Using a Decimal to Hex web calculator that number is equivalent to X’FFFFFFFF’ and in signed arithmetic that is -1. Numeric attributes are usually kept in signed 4 byte integers.
In summary the situation fired because -1 < 600 is a true statement. The Agent also continued to send the identical information every 300 seconds – which is how situation processing works. This was done at 22 agents [connecting to this remote TEMS] and comprised 4.27% of the estimated Situation workload.
What is -1?
Much data gathered by the agent comes from calls using APIs to the monitored system. The negative values typically mean the API call could not return the data for some reason. Sometimes that is documented in the Agent Manual, sometimes not. It cannot really be an actual second number because 4294967295 seconds would be roughly 68 years.
Situation Formula reworked
The rework is rather easy.
*IF *VALUE KVM_SERVER.System_up_time *LT 600 *AND *VALUE KVM_SERVER.System_up_time *GE 0
will screen out the negative values.
It might be that -1 is a signal of a true error condition. That would probably need to be worked out with the agent support people or perhaps the vendor and API usage. In that case you could have a separate situation to track them down and fix them.
*IF *VALUE KVM_SERVER.System_up_time *LT 0
Tale #1 of using Event Audit History to diagnose a Situation mystery.
History and Earlier versions
There are no binary object associated with this project.
Photo Note: Between Deck Vent on Cruise Ship Build 2018