Sitworld: Event History #9 Two Open Or Close Events In A Row

NestingSwallows

John Alvord, IBM Corporation

jalvord@us.ibm.com

Draft #1 – 22 April 2018 – Level 1.00000

Follow on twitter

Inspiration

The Event History Audit project is complex to grasp as a whole. The following series of posts will track individual diagnostic efforts and how the new project aided in the process.

This was seen in the Event Audit History Advisory section:

10,EVENTAUDIT1005W,TEMS,Situations [25] showing open->open transitions – see EVENTREPORT025

One situation showed a sequence of open event followed by an second open event. There is also a ,EVENTAUDIT1006W advisory showing a close followed by a close. As seen by the low impact of 10 this is unusual but often not very important.

And in that report section:

EVENTREPORT025: Situations showing Open->Open and Close->Close Statuses

Situation,Type,Count,Node_ct,Nodes,

ccp_cpu_rlzc_redhat,YY,1,1,zev_rtpprdles2:LZ,

There is a situation ccp_cpu_rlzc_redhat. It experienced a case of a Open->Open transition 1 time at one node zev_rtpprdles2:LZ.

Deep dive Into the report details

Scan or search ahead for Report 999. It is sorted by first node, then situation, then by Time at the TEMS. I will first describe what you see and the guidance from the column description line.

This will show only a single open event and then close event, but there were many listed in the full report.

EVENTREPORT999: Full report sorted by Node/Situation/Time

Situation,Node,Thrunode,Agent_Time,TEMS_Time,Deltastat,Reeval,Results,Atomize,DisplayItem,LineNumber,PDT

Situation – Situation Name, which can be different from the Full Name that you see in situation editor, like too long or other cases.

Node – Managed System Name or Agent Name

Thrunode – The managed system that knows how to communicate with the agent, the remote TEMS in simple cases

Agent_Time – The time as recorded at the Agent during TEMA processing. You will see cases where the same Agent time is seen in multiple TEMS seconds because the Agent can produce data faster than then TEMS can process it at times. Simple cases have a last three digits of 999. Other cases will have tie breakers of 000,001,…,998 when a lot of data is being generated. This the UTC [earlier GMT] time at the agent.

TEMS_Time – The time as recorded at the TEMS during processing. This the UTC [earlier GMT] time.

Deltastat – event status. You generally see Y for open and N for close. There are more not recorded here.

Reeval – Sampling interval [re-evaluation] in seconds and 0 means a pure event.

Results – How many results were seen. The simplest cases are 1 and you would see that if you used -allresults control. In this report you only get a warning when there are multiple results.

Atomize – The table/column specification of the value used for Atomize. It can be null meaning not used.

DisplayItem – The value of the atomize in this instance. Atomize is just the [up to] first 128 bytes of another string attribute.

LineNumber – A debugging helper that tells what line of the TSITSTSH data dump supplied this information

PDT  – The Predicate or Situation Formula as it is stored.

The Descriptor line – before we see the results.

ccp_cpu_rlzc_redhat,zev_rtpprdles2:LZ,REMOTE_usrdrtm011ccpr2,1180404222639999,1180404222640000,Y,600,1,,,728,*IF *VALUE Linux_CPU.CPU_ID *EQ Aggregate *AND *VALUE Linux_CPU.Busy_CPU *GE 95.00,

Situation was IBccp_cpu_rlzc_redhat, agent was zev_rtpprdles2:LZ. thrunode was ,REMOTE_usrdrtm011ccpr2. Agent_time was 1180404222639999 and TEMS_time was 1180404222640000, one second later. It was an Open event [Y], sampling interval was 500, there was one result and no DisplayItem or Atomize.. The record came from line number 728 in the input. The PDT is shown at the end.

Following the descriptor line is one or more P [Predicate/formula] lines as used as the Agent logic, followed by the results contributing to the TEMS logic.

,,,,,,,P,*PREDICATE=LNXCPU.CPUID = -1 AND LNXCPU.BUSYCPU >= 9500,

Following the predicate is one result line. These are all in the form of Attribute=value in the Table/Column=raw_data form. There is a leading count of the index of this result line. In this case there was one P line and one result line. Sometimes there are many more but not this time.

,,,,,,,0,LNXCPU.BUSYCPU=9999;LNXCPU.CPUID=-1;LNXCPU.IDLECPU=1;LNXCPU.ORIGINNODE=zev_rtpprdles2:LZ;LNXCPU.SYSCPU=1306;LNXCPU.TIMESTAMP=1180404222255000;LNXCPU.USRCPU=8693;LNXCPU.USRNCPU=0;LNXCPU.USRSYSCPU=665;LNXCPU.WAITCPU=0,

The goal was to check on BUSYCPU >= 9500 ant it was 9999. Notice here that the raw data is not formatted. The formula mentioned Linux_CPU.Busy_CPU *GE 95.00 but the agent itself deals with integers. The *PREDICATE is a test for >= 9500 and result was 9999 [95.00% and 99.99%]. Formatting takes place later on. That is how formatting for different languages is handled. For example in German the decimal separator is a comma [,]. This can be important when a attribute value is used in an action command run at the Agent – the raw data will be seen and formatting is the responsibility of the action command author.

Next another Open is seen – Descriptor line

ccp_cpu_rlzc_redhat,zev_rtpprdles2:LZ,REMOTE_usrdrtm011ccpr2,1180406073115999,1180406073115000,Y,600,1,,,2130,*IF *VALUE Linux_CPU.CPU_ID *EQ Aggregate *AND *VALUE Linux_CPU.Busy_CPU *GE 95.00,

As can be seen from the TEMS timestamp it was processed at 2018-4-6 07:31:15, some 33 hours later, The *Predicate and Results are shown next;

,,,,,,,P,*PREDICATE=LNXCPU.CPUID = -1 AND LNXCPU.BUSYCPU >= 9500,

,,,,,,,0,LNXCPU.BUSYCPU=9979;LNXCPU.CPUID=-1;LNXCPU.IDLECPU=21;LNXCPU.ORIGINNODE=zev_rtpprdles2:LZ;LNXCPU.SYSCPU=2236;LNXCPU.TIMESTAMP=1180406072115000;LNXCPU.USRCPU=7743;LNXCPU.USRNCPU=0;LNXCPU.USRSYSCPU=346;LNXCPU.WAITCPU=0,

What is the problem and How to fix it?

This often means that an agent was offline for some time because of some system problem, a network problem or maybe some planned outage for upgrades. The same thing would be true of N or Close events. The first time a situation is evaluated, if the condition is not true, a 0 result record is sent from the agent. Unlike with Open events, there are no confirm results sent so the TEMS impact is extremely low.

If these are seen in high volume they should be investigated for network issues or event agent mal-configuration issues. Otherwise it is just an interesting side note on processing logic.

Summary

Tale #9 of using Event Audit History to understand and review a open->open transition in the Event History.

Sitworld: Table of Contents

History and Earlier versions

There are no binary objects associated with this project.

1.000000

initial release

Photo Note: Barn Swallow Chicks Nesting in Big Sur Eves – 2009

 

Leave a comment