Sitworld: Event History Audit

edboat2

Version 1.42000 –  21 December 2021

John Alvord, IBM Corporation

jalvord@us.ibm.com

Follow on twitter

Inspiration

This project had several inspirations.

First came from a desire to estimate situation workload impact without any added diagnostic tracing and in general discovering aspects of how results are processed at the agent and events are processed in the TEMS.

Second arose from a customer case involving a situation that did not fire [a guard dog that didn’t bark!]. After lengthy effort, the root cause was determined: the situation used a DisplayItem to make sure that all the monitored database processes offline were reported. In this problem case, the DisplayItem attribute “database name” was blank or null. Since there were a several such disabled database processes, only one of the blank ones generated an event and the rest were suppressed. The situation was rewritten to use a different DisplayItem that always supplied database name and a re-test showed events were created for all the database processes. Thus passed two months of pain. Diagnosing some such cases faster or ahead of time would improve monitoring with minimal effort.

Third is a goal to calculate other metrics such as the time delay between data processed at the agent and data processed at the TEMS. This can point out system performance issues and even system clock issues.

TL;DR

Use the supplied script eventdat.pl to capture  the TSITSTSH table at the hub and each remote TEMS. Then use the eventaud.pl script to detect and correct any problems. You will get more valid situation events. The eventaud.csv report contains explanations of the advisory messages and the report sections. Read the section here on the summary report. There are many other report sections and these are documented at the end of the report itself.

DisplayItem

When situations are started, processing runs on the agents which evaluate conditions defined by the Situation Formula. The situation formula is a filter that takes the available data and determines which results should be passed on to the TEMS. Some formulas are run periodically and those are called Sampled Situations. Other formulas are run as needed to filter incoming sequential log-type data and those are called Pure events.

Sometimes a formula can pass back multiple results. Consider a check for unix mount point with free space lower than desired. Multiple disks can be lower than desired and multiple result rows returned. TEMS uses a DisplayItem to create individual situation events. A DisplayItem is a 128 [or less] beginning of another attribute. When these DisplayItems create unique values per result row, TEMS can create individual events. If DisplayItem is needed but not supplied, only one event is created, thereby hiding real events. The event is chosen somewhat randomly. If the selected DisplayItem is present but does not supply unique values [clearly an Agent issue – why supply a DisplayItem that does not uniquely identify a result row?] you can again miss events.

That issue can also happen with Pure situations. In addition the TEMS has performance logic such that if the same pure event with the same DisplayItem arrives from the same agent in the same second, the second and subsequent results are merged. At the TEP display the merged events can be seen. However they are not transmitted to the event receiver. If a DisplayItem is not defined [or not available], you can again have missed events. Even if you have a DisplayItem, if a flurry arrive in the same second with the same DisplayItem, you again have missed events. Sometimes no one minds: after all if you have one event or twenty in the same second – does anyone really care. There is a way to configure the TEMS to guarantee that every pure result is processed into a single event [see product documentation here]. I must point out that such a configuration can create a serious impact on TEMS and event receiver logic – Is there any real purpose in generating scores of events per second for essentially duplicate issues? As an alternative many such agent sources like Tivoli Log Agent have agent-side controls to hide duplicate identical reports.

Result Workload Measurement

Measuring result workload also has value. One recent case involved a simple sort of situation designed to measure Unix OS Agent CPU usage and alert when too high [more that 50%]. The Event Audit report showed that a staggering 96% of the incoming workload on that remote TEMS was coming from that situation. The logic to work backward… from the Situation Status History table, to the agent side result creation and the TEMS handling of the incoming results was complex but extremely rewarding.

This can also be rewarding with a negative result. A recent case showed signs of overload, but a time based estimate of incoming results showed nothing special around the time of the instability. The TEMS was processing peak levels of 3-4 results per second and most often cruised along at 0.5 to 1.0 results per second. As a result a lot of customer work gathering data and doing restarts was avoided. The root cause was an external influence which blocked tcp communication.

Other reports can show cases where the situation fluctuates between open and close and open and close, etc.  Often that suggests a useless situation since it fails the basic goal of situations – to be rare, exceptional and fixable.

Getting the report

You can make this report yourself as described using an available package. You can also create a Case with IBM Support and upload one or more hub or remote TEMS pdcollects. Ask in the Case that the eventaud.csv report be sent back to you.

Package Installation

The package is eventaud.1.42000. It contains

1) Perl script eventdat.pl – to gather the needed data from the ITM environment

2) Perl script eventaud.pl – to process the data and produce an Event Audit report.

I suggest eventdat.pl be placed an ITM installation tmp directory.  For Windows you need to create the <installdir>\tmp directory. For Linux/Unix use the existing <installdir>/tmp  directory. You can of course use any convenient directory. Any examples following will use these defaults

Linux/Unix:  /opt/IBM/IBM/tmp

Windows: c:\IBM\ITM\tmp

Linux and Unix almost always come with the Perl shell processor installed. For Windows you can install a no cost Community version from http://www.activestate.com if needed. You can also extract the files and copy them to a system where Perl is installed. No CPAN [community donated source] packages are needed.

Parameters for running eventdat.pl

All parameters are optional if defaults are taken

-h home installation directory for TEPS. Default is

Linux/Unix: /opt/IBM/ITM

Windows: c:\IBM\ITM

This can also be supplied with an environment variable

Linux/Unix: export CANDLEHOME=/opt/IBM/ITM

Windows: set CANDLE_HOME=c:\IBM\ITM

-v produce progress messages, default off

-tems specify a TEMS to get event history – can supply multiples using multiple -tems. Default is all online TEMSes except FTO mirror hub TEMS.

-work  directory to store work and report files, default is current directory

-off include offline agents, default off

-aff needed in rare database cases where affinities are in different structure

-redo used during testing by reusing result files

The eventdat.pl script will create several files

HUB.TSITDESC.LST

HUB.TNAME.LST

and one or more situation event history files of this form tems_name.TSITSTSH.LST.

The tems_name is HUB for the hub TEMS and otherwise it is the TEMS nodeid.

The eventdat.pl shell command also creates two shell files:  run_eventaud.cmd  and run_eventaud.sh. They invoke the eventaud script with needed parameters. “.cmd” for Windows, “.sh” for Linux/Unix… only difference is the line end characters to make each environment happier.

Parameters for running eventaud.pl

All parameters are optional if defaults are taken.

-h  Help display

-v Verbose display – more messages

-nohdr skip report header – used in regression testing

-txt read input from TEMS2SQL processed files  QA1CSITF.DB.TXT, QA1DNAME.DB.TXT, QA1CSTSH.DB.TXT

-lst [with no following TEMS name]  get data from HUB.QA1CSITF.LST. HUB.QA1DNAME.LST, HUB.QA1CSTSH.LST

-lst with following TEMS name, get data from HUB.QA1CSITF.LST. HUB.QA1DNAME.LST, <temsnodeid>.QA1CSTSH.LST

-allresults display more results data, especially when the default single result is present

-sum produce a summary file eventaud.txt

-time  [to be implemented]

-days limit look back to this many days before the most recent event, default 7 days, value 0 means use all data,

-o name of output file, default is eventaud.csv

-odir directory for output files, default /tmp or c:\temp

-tsitstsh filename of Situation Status History extract file

-workpath  directory to store working files

-dv  specify detailed situation and value report [can use many]. See details below.

Running eventaud,pl is just putting the needed files into a directory and running it like this:

cd /tmp where HUB.QA1CSITF.LST. HUB.QA1DNAME.LST, HUB.QA1CSTSH.LST \are present

perl eventaud.pl -lst

If you had a remote TEMS with the name RTEMS1 the commands would be

cd /tmp

perl eventaud.pl -lst -tsitstsh RTEMS1

The report file is eventaud.csv  and contains the advisories, reports and explanations at the end of the report.

Summary Report format

The eventaud.csv report is lengthy with advisories and report sections. To avoid the burden of over-documentation that few will read, the report itself contains documentation at the end which describes each advisory and report section including the meaning, portions of an example and suggested recovery action if required. This internal documentation is under revision as new diagnoses are made. For this blog post we will show and discuss a summary report section and what it all means. Most of the exception cases in the summary are explained in more detail in the report sections that follow.

EVENTREPORT000: Event/Result Summary Budget Report

Duration: 4476 Seconds

==> The time between the oldest event history and the most recent in seconds. The default -days 7 will translate to 604800 seconds. This especially important for testing environments where you may have some years old events. The only times measured at the Y [open event], N [close event] and X [problem]  DELTASTAT column values. The S [Start Situation] is used in order to limit event calculation since the last time situation was started – usually when the TEMS started. Start Situations are not included in the time calculation. They are usually followed with a flood of N [close event] records which are not very interesting.

Total Open/Close Events: 337 4.52/min

==> Count of Y [open] and N[close] records over the duration captured. Total count and rate per minute is displayed. This are usually far less than results since most of the result rows are conforming existing Open event conditions.

Total Results: 3507 47.01/min

==> These are result rows sent from the agents. Sampled situation sent a “confirming” result every sampling interval. In fact if such a confirming result is not observed for three sampling intervals, the TEMS will generated an automatic close of the event. This can be confusing especially if the condition is supposed to be always true. It usually means there is some agent or TEMS or Networking interference happening. Definitely not good. In well running systems this never happens. If this is a hub TEMS, the results apply to all the hub and remote TEMS together and thus do not reflect stress on the hub System only.

Total Non-Forwarded Results: 1444 19.36/min [41.17%]

==> This records how many of the result rows were for Situations which were not configured to an event receiver like Netcool/Omnibus. This is only counted if some Situations were configured to send data to an event receiver. These should be considered because if no event receiver is processing the conditions, the issues will never be fixed and thus it is often a waste of time and resources. The last number is the percent non-forwarded compared to total results. This impacts the agent and the remote TEMS but not the hub TEMS… the hub TEMS only sees the Y and N results.

Total Result Bytes: 3506892 45.91 K/min Worry[9.18%]

==> This is an estimate of the number of bytes sent from the agent  and Kbytes per minute. The Worry% is how close this is to the 500K/min level which is where you should be worried. This is not an absolute number. The actual point where a remote TEMS starts to break down could be higher or lower depending on the system power, competing workload and network. Your own experience is the best guide. The largest number ever seen was 93 megabytes/min and of course the remote TEMS were malfunctioning badly. Event Audit numbers are estimates only you would need to use TEMS Audit and tracing to get more precise numbers.

Total Non-Forwarded Result Bytes: 744464 9.75/min [21.23%]

==> This shows the portion of arriving result data associated with non-forwarded situations. See above for more discussion.

Sampled Results Confirm: 3479 46.64/min

==> Many incoming result rows are used to confirm existing open situation events. This is often a high proportion of the total. If the remote TEMS is under stress, you can reduce the stress by increasing the sampling interval, by adding a new remote TEMS to divide the workload, or by using a more powerful system,

Sampled Results Confirm Bytes: 3421828 44.79 K/min, 97.57% of total results

==> This is the bytes require to confirm open situation events. the number at the end is the per cent of total incoming result bytes.

Missing DisplayItem: 26 0.35/min

==> This counts the number of cases where a situation does not have DisplayItem configured and  this has hidden situation events. This degrades the monitoring performance by not creating situation events to resolve problems.

Duplicate DisplayItem: 27 0.36/min

==> This counts the number of cases where a situation has a DisplayItem but at the agent multiple results were seen with identical DisplayItem values. This may degrade the monitoring performance by not creating situation events to resolve problems. Usually this means you cannot use that DisplayItem reliably. It may be considered an Agent issue since the Agent should not offer a DisplayItem that does not uniquely identify the result rows. The practical solution is to select another DisplayItem that does uniquely identify result rows. For example in Unix select Process_ID and not Process Command Name.

Null DisplayItem: 100 1133.57/min

==> This is much like the previous Duplicate case but it means Null values were returned. This has the same negative impact and the same considerations.

Pure Merged Results: 6 0.08/min

==> This reports on cases where multiple Pure Situation results, with same DisplayItem value, were processed at a single second at the TEMS and were suppressed by a different mechanism. If this is important,  a per TEMS configuration can be made to enforce one result equals one event logic.

Open/Open transitions: 0

==> This rare condition is when an open event is followed by a second open event. It is not well understood.

Close/Close transitions: 0

==> This rare condition is when an close event is followed by a second close event. It is not well understood.

Delay Estimate opens[334] over_minimum [114] over_average [1.81 seconds]   

==> This is a stress indicator of how how the agent and network and TEMS handle incoming results. It can report as low as 0.0 seconds. If the TEMS is heavily loaded the number will tend to be larger.  If you see a very large number that often means there is one or more agents running on a system, with a time vastly different that the TEMS time. That can be normal or abnormal and there is a report section devoted to which agents are showing cases where the time is sometimes much larger than the minimum time observed.

Situation Detail Report

At level 1.42000 a situation detail value report was added. This will be used after the initial eventaud.csv is created with the -allresults option. The goal is to report on cases where pure situations are returning large number of results and causing TEPS performance issues.

When a pure situation returns more than one result in a single second, the TEMS merges the results into a single event record. The TEPS has to handle all such results and that can cause TEPS performance issues when each merged result is passed to every TEP user. Often these are invisible results because the user has not associated that situation.

First example the REPORT032 to see situations with a large number of results. If there are some with many results, look at the detailed REPORT999. The goal is to determine an attribute which will convey some meaning to the situation event consumer…. whomever will need to correct the underlying issue. In that report, the values are conveyed using Table and Column. For example in one case a situation K13_IBM_Functional_Error was showing a high rate and the column name involved was ERRORDESCR – error description. The following command was used

perl eventaud.pl -lst -allresults -dv K13_IBM_Functional_Error ERRORDESCR

The following report section was created

EVENTREPORT997: Detailed Situation DisplayValue Report
Situation,Count,Display,NodeCount,
,Node,Count,ValueCount,
,,DisplayValue,Count
K13_IBM_Functional_Error,1551,ERRORDESCR,0,
,IBM05:13,420,
,,Exception in IBM Box InventoryBalanceDetailed importMaterial,420,
,IBM04:13,418,
,,Exception in IBM Box InventoryBalanceDetailed importMaterial,418,
,IBM07:13,359,
,,Exception in IBM Box InventoryBalanceDetailed importMaterial,359,
,IBM06:13,354,
,,Exception in IBM Box InventoryBalanceDetailed importMaterial,354,

An Agent Builder agent was reporting about an inventory balance in some subsystem. That was likely an issue that needed addressing… but it also was wrecking TEPS performance since it was arriving many times each second.

Reporting problems

If things do not work as expected, please capture input LST files in a zip or compressed tar file and send to the author. I will endeavor to correct any issue promptly. Also please pass on any interesting ideas.

Summary

The information in the report will show cases where two or more TEMSes having differing information about particular agents. In the simplest cases that strongly suggests a case of duplicate agents. You can expect to see other associated posts concerning different aspects of event history.

Sitworld: Table of Contents

History

This project is also maintained in github.com/jalvo2014/eventaud  and will often be more up to date [and less tested] compared the the point releases. You can also use this github distribution to review history and propose changes via pull requests.

eventaud.1.42000
Add situation result detail REPORT997

eventaud.1.41000
Change syntax to accommodate Perl 5.16.3
Sort by name if count equal in report029

eventaud.1.40000
Include situation pdt in some reports

eventaud.1.39000
Add/correct some table row sizes

eventaud.1.38000
Add/correct some table row sizes

eventaud.1.37000
Add/correct some table row sizes

eventaud.1.36000
Add/correct some table row sizes

eventaud.1.35000
Add report033 on estimated TSITSTSC cache usage and constant on situations

eventaud.1.34000
Add/correct some table row sizes

eventaud.1.33000
Add/correct some table row sizes

eventaud.1.32000

Add/correct some table row sizes

Add predicate related attributes at start in full report

Note: Building a new Cruise Ship 2018

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: