Sitworld: Event History #1 The Situation That Fired Oddly

Vent2

John Alvord, IBM Corporation

jalvord@us.ibm.com

Draft #1 – 4 April 2018 – Level 1.00000

Follow on twitter

Inspiration

The Event History Audit project is complex to grasp as a whole. The following series of posts will track individual diagnostic efforts and how the new project aided in the process.

A Situation That Created Too Many Events

Title: False ESX reboot alert

Description: … We are getting false alert for ESX up time monitoring, it is happening randomly for different servers . … I found Alert get triggered when system up time show values as 4294967295 . it is happening for all those false triggered alert…

The situation formula as seen from a tacmd viewsit was simple.

Formula        : *IF *VALUE KVM_SERVER.System_up_time *LT 600

Sampling Interval   : 0/0:5:0

Just after startup the system being monitored would have an up time below 600 seconds. That formula creates an situation event. A while later the up goes above 600 seconds and the event would close.

Event History Audit

The new tool was run on the remote TEMS database files which were sent with the problem report. If you were doing this yourself you would use the eventdat.pl script. The example follows that end user path.

perl eventaud.pl -lst -allresults

The -allresults means the last report section displays all the results in an easier to understand format. Usually it shows only the situations that triggered advisories.

That final report section is displayed in order by 1) node or agent name, 2) Thrunode [usually remote TEMS], 3) Situation Name, 4) DisplayItem, 5) by TEMS processing second. There is a massive amount of information present and I will show one small snippet of the report relating to the problem at hand.

Here are the lines of interest:

IBM_ESXReboot_W_Test,VM:XXXX232V-ibmesxcdc030:ESX,REMOTE_IBM010,1180327071516999,1180327071516000,Y,300,1,KVMSERVERG.SH,ibmesxcdc030.amer.ibm.corp,1891,*IF *VALUE KVM_SERVER.System_up_time *LT 600,

,,,,,,,P,*PREDICATE=KVMSERVERG.SUT < 600,

,,,,,,,0,KVMSERVERG.AVCPR=83;KVMSERVERG.BIOS_DATE=;KVMSERVERG.BN=3248547;KVMSERVERG.CEM=;KVMSERVERG.CLUSTER=CDC-Linux-Non-Prod-Cluster-01;KVMSERVERG.CP=0;KVMSERVERG.CS=connected;KVMSERVERG.DATACENTER=CSC Chicago Data Center (CSC);KVMSERVERG.DEMAND=0;KVMSERVERG.DM=datacenter-862;KVMSERVERG.DS=283436;KVMSERVERG.EU=0;KVMSERVERG.FQN=;KVMSERVERG.HE=0;KVMSERVERG.IDIS0=0;KVMSERVERG.IDIS1=0;KVMSERVERG.IDIS2=0;KVMSERVERG.IDIS3=0;KVMSERVERG.IDIS4=0;KVMSERVERG.IDIS5=0;KVMSERVERG.IDIS6=0;KVMSERVERG.IDIS7=0;KVMSERVERG.IDIS8=0;KVMSERVERG.IDIS9=0;KVMSERVERG.IP_ADDRESS=;KVMSERVERG.LATENCY=0;KVMSERVERG.MEM=;KVMSERVERG.MM=0;KVMSERVERG.NICS=8;KVMSERVERG.NODEID=;KVMSERVERG.NUMBER_VMS=22;KVMSERVERG.NVO=13;KVMSERVERG.OCU=7;KVMSERVERG.OMU=47;KVMSERVERG.ORIGINNODE=VM:FAIN232V-ibmesxcdc030:ESX;KVMSERVERG.OS=green;KVMSERVERG.PC=12;KVMSERVERG.PC0=0;KVMSERVERG.PEP=0;KVMSERVERG.PER=0;KVMSERVERG.PF=;KVMSERVERG.PM=196544;KVMSERVERG.PRODUCT=VMware ESXi;KVMSERVERG.PS=;KVMSERVERG.PU=0;KVMSERVERG.SAML=0;KVMSERVERG.SH=ibmesxcdc030.amer.ibm.corp;KVMSERVERG.SM=;KVMSERVERG.SN=;KVMSERVERG.SPML=0;KVMSERVERG.SUT=4294967295;KVMSERVERG.SV=;KVMSERVERG.TCM=31908;KVMSERVERG.TIMESTAMP=1180327090935000;KVMSERVERG.TVCM=0;KVMSERVERG.TVPS=0;KVMSERVERG.UCM=0;KVMSERVERG.UD=130917;KVMSERVERG.UUID=2001FF01-0000-0000-0000-00000000005F;KVMSERVERG.VE=Yes;KVMSERVERG.VERSION=5.5.0,

This is a little messy to view but it is simple compared to some. The first line is a header summary which starts off with the situation name, the agent name, the remote TEMS, agent time, TEMS Time, Status [Y=open], 300 [seconds sampling interval]. number of results, the DisplayItem value and the Situation Formula or PDT. In the full report section there is a header title line.

The second line is the predicate or formula summary. The names here all use the Attribute Group Table and Attribute Column. It is exactly parallel to the first line PDT:

The third line shows the attributes sent with the situation being open. In particular note the KVMSERVERG.SUT=4294967295;   Using a Decimal to Hex web calculator that number is equivalent to X’FFFFFFFF’ and in signed arithmetic that is -1. Numeric attributes are usually kept in signed 4 byte integers.

In summary the situation fired because  -1 < 600   is a true statement. The Agent also continued to send the identical information every 300 seconds – which is how situation processing works. This was done at 22 agents [connecting to this remote TEMS] and comprised 4.27% of the estimated Situation workload.

What is -1?

Much data gathered by the agent comes from calls using APIs to the monitored system.  The negative values typically mean the API call could not return the data for some reason. Sometimes that is documented in the Agent Manual, sometimes not. It cannot really be an actual second number because 4294967295 seconds would be roughly 68 years.

Situation Formula reworked

The rework is rather easy.

*IF *VALUE KVM_SERVER.System_up_time *LT 600 *AND *VALUE KVM_SERVER.System_up_time *GE 0

will screen out the negative values.

It might be that -1 is a signal of a true error condition. That would probably need to be worked out with the agent support people or perhaps the vendor and API usage. In that case you could have a separate situation to track them down and fix them.

*IF *VALUE  KVM_SERVER.System_up_time *LT 0

Summary

Tale #1 of using Event Audit History to diagnose a Situation mystery.

Sitworld: Table of Contents

History and Earlier versions

There are no binary object associated with this project.

1.000000

initial release

Photo Note: Between Deck Vent on Cruise Ship Build 2018

 

One thought on “Sitworld: Event History #1 The Situation That Fired Oddly

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: