John Alvord, IBM Corporation jalvord@us.ibm.com Draft #1 – 14 December 2020 – Level 1.00000 Follow on twitter Introduction The ITM protocol modifier EPHEMERAL:Y is wonderful when it is used correctly. When used badly it can cause terrible issues such as communications malfunctioning and causing ITM processes like TEMS to fail and require a recycle to continue. IfContinue reading “Sitworld: Use and Misuse of EPHEMERAL:Y”
Category Archives: Uncategorized
Sitworld: Simple Network Testing
John Alvord, IBM Corporation jalvord@us.ibm.com Draft #1 – 6 August 2020 – Level 0.50000 Follow on twitter Inspiration Benchmarking ITM communication links is a tough job. You can validate many aspects using this document ITM Communications Validation But testing for throughput and capacity and errors is tough. Happily a former ITM developer implemented a techniqueContinue reading “Sitworld: Simple Network Testing”
Sitworld: Eliminating Duplicate Agents –
John Alvord, IBM Corporation jalvord@us.ibm.com Draft #5 – 20 August 2020 – Level 0.61000 Follow on twitter NOTE: This was an interesting idea… but I kept running into new special cases that prevented the ideas from working reliably. As a result consider this a failed attempt. At the present time no work is underway. TheContinue reading “Sitworld: Eliminating Duplicate Agents –”
Sitworld: Summarization and Pruning Audit
Version 0.51000 23 March 2020 John Alvord, IBM Corporation jalvord@us.ibm.com Follow on twitter Inspiration ITM has an marvelous facility to store historical data. This facility includes logic to discard old data and summarize data into longer time periods. It is quite easy to get more data recorded but balancing that with the limited database storageContinue reading “Sitworld: Summarization and Pruning Audit”
Sitworld: ITM Permanent Configuration Best Practices
John Alvord, IBM Corporation jalvord@us.ibm.com Draft #1 – 1/16/2020 – Level 1.00000 Follow on twitter Inspiration I am often asked for advice on how to change an ITM process configuration. It could be something simple like adding a diagnostic trace definition. It could be more complex like adding EPHEMERAL:Y to a communication control. There areContinue reading “Sitworld: ITM Permanent Configuration Best Practices”
Sitworld: Table of Contents
Table of Contents
Sitworld: TEP Java Web Start – Basic Diagnostic Capture
TEP diagnostic capture
Sitworld: ITM Communications Validation
Checking for common ITM communication issues.
Sitworld: Introduction
During my normal work. I see many interesting puzzles on how to accomplish useful work in IBM Tivoli Monitoring [ITM]. Often these revolve around situations. Over time I will present some basic education on the subject but first there are some interesting cases that will benefit from some interactions. So for the moment, I willContinue reading “Sitworld: Introduction”
Sitworld: Scrubbing Out Windows Agent Malconfiguration Remotely
Sometimes there is a mistake made in ITM Windows Agent configuration. It is made with good intentions but the result is an ITM Agent which constantly loses connection to the TEMS it is configured to and then reconnects over and over.
Sitworld: AOA Critical Issue – TEMS to TEMS High Latency network connection
temsaud.crit: Early remote SQL failures [&syncdist_early]
Sitworld: AOA Critical Issue – TEMS Possible TCP Blockage
tmsaud.crit: Possible TCP Blockage: Recv-Q[13,1290] Send-Q[10,33203]
Sitworld: TEMS Database File Damage
itm_ref_checker.crit: QA1CSTSH.DB:unexpected size difference at tems2sql.pl line 1066.
Sitworld: AOA Critical Issue – Port Scanning Testing
temsaud.crit: Definite Evidence of port scanning [$scantype] which can destabilize any ITM process including TEMS
Sitworld: AOA Critical Issue – High Virtual Hub Table Updates
datahealth.crit: Virtual Hub Table updates peak $peak_rate per second more then nominal $opt_peak_rate – per hour [$vtnode_tot_hr] – total agents $vtnode_tot_ct – See DATAREPORT020
Sitworld: AOA Critical Issue – High Incoming Workload
eventaud.crit: Estimated Incoming result rate $ppc_result_rate worried $ppc_worry_pc
Sitworld: AOA Critical Issue – Excess MS_Offline type Situations
MS_Offline dataserver evaluation rate $prate agents/sec dangerously high
Sitworld: AOA Critical Issue – Duplicate Agent Names
Duplicate Agent Name cases
Sitworld: Agent Diagnostic Log Communications Summary
I was working through a case where an Agent kept losing connection with a remote TEMS. Seeing the big picture was very tough, the raw data was scattered here and there through many diagnostic log instances.
Sitworld: Adventures in Communications #1
This symptom was a stalled remote TEMS that hardly did any communication. Restarting the remote TEMS resolved the issue for a day or so but eventually it got stuck again.
Sitworld: ITM Port Usage and Managing Port Usage
I get asked questions ITM port usage. This is release 2.0 of the original technote.
Sitworld: Event History #15 High Results Situation to No Purpose
Total Result Bytes: 1023369249 989.47 K/min Worry[197.89%]
This environment is receiving almost one megabyte of results data per minute. Experience has shown that problems often occur if the result rate is over 500K per minute.
Sitworld: Event History #14 Lodging Problems
This was seen in the Advisory Report
100,EVENTAUDIT1008E,TEMS,Situations [2] had lodge failures [2] – See report EVENTREPORT020
Sitworld: Event History #13 Delay Delay Delay
This was seen in the Summary Report
Delay Estimate opens[110] over_minimum [6] over_average [3.33 seconds]
Sitworld: Event History #12 High Impact Situations And Much More
This was seen in the Summary Report
Total Result Bytes: 11698309 49.38 K/min Worry[9.88%]
Sitworld: Event History #11 Detailed Attribute differences on first two merged results
This was seen in the Event Audit History Report Section
EVENTREPORT007: Detailed Attribute differences on first two merged results
Sitworld: Event History #10 lost events because DisplayItem missing or null Atoms
This was seen in the Event Audit History Advisory section:
25,EVENTAUDIT1010W,TEMS,Situations [1] lost events because DisplayItem missing or null Atoms – see EVENTREPORT001
Sitworld: Event History #9 Two Open Or Close Events In A Row
This was seen in the Event Audit History Advisory section:
10,EVENTAUDIT1005W,TEMS,Situations [25] showing open->open transitions – see EVENTREPORT025
Sitworld: Event History #8 Situation Events Opening And Closing Frequently
John Alvord, IBM Corporation jalvord@us.ibm.com Draft #1 – 21 April 2018 – Level 1.00000 Follow on twitter Inspiration The Event History Audit project is complex to grasp as a whole. The following series of posts will track individual diagnostic efforts and how the new project aided in the process. This was seen in the Event Audit History Advisory section:Continue reading “Sitworld: Event History #8 Situation Events Opening And Closing Frequently”
Sitworld: Event History #7 Events Created But Not Forwarded
This was seen in the Event Audit History Advisory section:
40,EVENTAUDIT1004W,TEMS,Situations [145] showing event statuses but event not forwarded – see EVENTREPORT018
Sitworld: Event History #6 Lost events with Multiple Results with same DisplayItem at same TEMS second
This was seen in the Event Audit History Advisory section:
50,EVENTAUDIT1013W,TEMS,Situations [9] lost [merged] events Multiple Results with same DisplayItem at same TEMS second – see EVENTREPORT004
Sitworld: Event History #5 Multiple Results Same DisplayItem Same Second
This was seen in the Event Audit History Advisory section:
50,EVENTAUDIT1009W,TEMS,Situations [7] with multiple results at agent with same DisplayItem at same second – see EVENTREPORT005
Sitworld: Event History #4 Conflict Between DisplayItem and Attributes
This was seen in the Event Audit History Advisory section:
90,EVENTAUDIT1014E,TEMS,Situations [1] had DisplayItem configured which was not in results – See report EVENTREPORT024
Sitworld: Event History #3 Lost Events Because DisplayItem has Duplicate Atoms
This was seen in the Event Audit History Advisory section:
65,EVENTAUDIT1011W,TEMS,Situations [1] lost events because DisplayItem has duplicate atoms
Sitworld: Event History #2 Duplicate DisplayItems At Same Second
50,EVENTAUDIT1013W,TEMS,Situations [4] lost [merged] events Multiple Events with same DisplayItem at same TEMS second – see EVENTREPORT004
Sitworld: Event History #1 The Situation That Fired Oddly
We are getting false alert for ESX up time monitoring, it is happening randomly for different servers . … I found Alert get triggered when system up time show values as 4294967295 . it is happening for all those false triggered alert…
Sitworld: Event History Audit
First came from a desire to estimate situation workload impact without any extra tracing. Second arose from a customer case involving a situation that did not fire [a guard dog that didn’t bark!].
Sitworld: TEMS Audit Tracing Guide Appendix
Some of the TEMS Audit advisories and report sections require diagnostic tracing. We will use a standard workload tracing for these examples
Sitworld: ITM 6 Interface Guide Using KDEB_INTERFACELIST
The ITM 6 KDE component discovers the interfaces and creates a list of ip addresses that are suitable for connections. ITM 6 communications usually works very well using default processing. However in complex cases, you must control interface usage using two environment variables:
Sitworld: ITM Agent Historical Data Export Survey
The ITM Agent Historical Data Survey tool reports cases where the historical data export process has experienced a failure.
Sitworld: Discovering Historical Data Export Problems at Agent
Best practice is to collect the historical data at the TEMA or Agent and then export the data to the Warehouse Proxy Agent which then forwards the data to the data warehouse. With a large number of agents almost anything can go wrong and require fixing.
Sitworld: FTO Configuration Audit
On several recent cases, the hub TEMS randomly became inoperative. After long study and diagnostic data collection, the conclusions were that the Fault Tolerant Option [FTO or Hot Standby or Mirror] configuration was incorrect.
Sitworld: Portal Client [TEP] on Windows Using a Private Java Install
Once more a client called when Portal Client [TEP] no longer worked because a new Java RTE was installed by system administrators. This usually takes several days or a week+ to resolve
Sitworld: TEMS Database Repair
The TEMS database tables are used to store user data such as situation descriptions and distribution definitions. When the files holding the data are damaged and the TEMS usually malfunctions. How are repairs made.
Sitworld: The Encyclopedia of ITM Tracing and Trace Related Controls
ITM tracing is at once the most tiresome of topics and sometimes the most important. This post will collect everything I have collected and discovered over the years. Expect to see many additions and corrections over time.
Sitworld: ITM2SQL Database Utility
I have long envied the TEPS facility called migrate-export, which takes the TEPS database and creates a file of SQL commands to recreate the database. Recently I found a way to accomplish this on a live ITM system using the KfwSQLClient utility.
Sitworld: Policing the Hatfields and the McCoys
John Alvord, IBM Corporation jalvord@us.ibm.com Draft #1 – 2 May 2016 – Level 0.5000 Follow on twitter Inspiration One more time I had to explain to a customer that you could not have a situation formula that included more than a single multi-row attribute group. They had a worthy goal: they wanted to test for a missingContinue reading “Sitworld: Policing the Hatfields and the McCoys”
Sitworld: Real Time Detection of Duplicate Agent Names
One more time I worked on a case where ITM misbehaved because some agents used duplicate names. This particular case involved “false alerts” where a situation event was observed – a missing process case on a Linux System. When investigated, the Linux System did have that process running and so it was a false positive alert.
Sitworld: TEPSI Interface Guide
When you use Portal Client sessions which connect to a Portal Server [TEPS] with the default interfaces, everything is simple. When you stray from the default settings things get more complex.
Sitworld: Portal Client Java Web Start JNLP File Cloner
In the next year or so, most ITM Enterprise customers will be converting to Java Web Start Portal Clients. The major browser vendors are abandoning java applets – see Appendix 1 below for three references.
Sitworld: Diagnostic Snapshort Utility
I get called into a lot of unusual diagnostic cases. ITM has an terrific ability to capture detailed diagnostics and that works great for issues that can be recreated. However rarely occurring conditions – perhaps happening every month or so at random times – are a lot harder to capture.
Sitworld: tacmd logs summary
I recently worked a case where a suspicion arose whether some tacmd functions were causing problems at a hub TEMS. The information was available in a big collection of tacmd diagnostic trace files, like kuiras1_sysadmin_hextime-01.log. However there were thousands of them at the hub TEMS and a manual review was just too much to think about.
Sitworld: Restore Usability to ITCAM YN Custom Situations
Recently an ITCAM YN Agent [IBM Tivoli Composite Application Manager Agent for WebSphere Application] upgraded [to 7.1.0.3.8 or 7.2.0.0.5] caused existing custom situations to be unusable.
Sitworld: TEPS Audit
A recent case involved very slow Portal Client response time and general TEMS instability including failures. After some study the root cause turned out to be duplicate agent names.
Sitworld: Re-re-re-mem-ember Situation Status Cache Growth Analysis
Recently I had two cases where a remote TEMS process size grew and grew and performance was horrible. To speed up analysis of such cases the following project and tool was developed and now anyone can figure out one common case.
Sitworld: Attribute and Catalog Health Survey
Recently I worked with a customer that experienced into a rarely seen ITM limit. This customer had added the 513th catalog file and TEMS failed during startup.
Sitworld: ITM Database Health Checker
In Spring 2014 I was tasked to repair a customer database that experienced a newly identified issue. I challenged myself to discover this issue from the TEMS Database files [EIB] alone.
Sitworld: Suppressing Situation Events By Time Schedule
How can situation events be suppressed during specific time schedules?
Sitworld: Alerting on Daylight Savings Time Truants
Twice a year installations need to change the clocks on many systems. Identifying which systems did not have their time changed is often of great interest.
Sitworld: Report on Daylight Savings Time Truants
Twice a year installations need to change the clocks on many systems.This document shows you how to generate a report showing current DST status.
Sitworld: ITM Situation Formula with Calculations
However ITM does not provide a direct way to perform calculations as above or compute with multiple attributes. Situations provide a here and now calculation based on existing attributes.
Sitworld: ITM Agent Census Scorecard
In some installations, the ITM agents are virtually never updated. New systems get new agent levels and old ones are not touched. This increases the cost of ownership since avoidable defects are experienced.
Sitworld: ITM Protocol Usage and Protocol Modifiers
In March 2010, I documented how and why to update the ITM communications string using Protocol Modifiers. I am re-publishing that here with new information to change the SOAP related configurations.
Sitworld: Agent Workload Audit
A large customer had achieved considerable stability and efficiency in the TEMS and TEPS.
They insisted the agents use less resources. In some cases agents were consuming 15-25% of a single core or more.
Sitworld: Situation Distribution Report
I am often asked how to discover what agents a situation is distributed to. I’ve been tinkering with this issue. I finally made a first draft.
Sitworld: CPAN Library for Perl Projects
The Sitworld blog posts include a number of Perl projects. Some of them use Comprehensive Perl Archive Network or CPAN packages.
Sitworld: ITM Virtual Table Termite Control Project
I recently worked with a customer who had an unstable TEMS environment. There were 3152 agents and 13 remote TEMS. The workload wasn’t high. The instability manifested as remote TEMS missing heartbeats.
Sitworld: ITM TEMS Health Survey
I was recently asked about a TEMS and SOAP oriented health check for a TEMS environment.
Sitworld: The Situation That Cried Wolf
A team that was using ITM to reduce data center problems came to me for advice.Situations had been written but situation alerts were being ignored by the system administrators.
Sitworld: Statistics After 50,000 Views
The Wonderful World of Situations total blog passed 50,000 views recently. Here are some statistics and pointers to items you might have missed and a list of the ones I consider most important.
Sitworld: *MIN and *MAX – the Little Column Functions That Couldn’t
I started work on the Situation Audit project and verified which column functions forced TEMS filtering.*MIN and *MAX did not create a situation event, even though rows were returned to the TEMS.
Sitworld: Do It Yourself TEMS Table Display
I often need to explain how to view or capture TEMS database tables. This explains ways of doing that.
Sitworld: A Situation By Any Other Name…
A customer was attempting to verify a new situation was working as expected. The operation logs were not showing what he expected.
Sitworld: Running TEMS without SITMON
A customer hub TEMS was crashing on startup. This is a painful moment because that means that monitoring is momentarily impossible.
Sitworld: ITM Situation Audit
The ITM Situation Audit tool performs a static analysis on ITM Situation definitions and creates a report showing which situations are probably filtered at the agent and which are probably filtered at the TEMS the agent reports to.
Sitworld: SOAP Flash Flood
I was working at a customer site with occasional SOAP issues and ITM instability.One Monday afternoon about 1pm the ITM environment went unstable. All eight remote TEMS went offline and thus all agents showed as offline.
Sitworld: Sample EIF Listener project
I was thinking for that you could smooth out some ITM rough edges by writing a EIF [Event Integration Facility] listener. You could do extended processing in the listener.
Sitworld: Situation Limits
On December 25, 2013 a sev1 PMR arrived for an unstable remote TEMS. The remote TEMS ran for three hours and then went crazy.
Sitworld: Put Your Situations on a Diet Using Indexed Attributes
During the course of a long sev1 at a large customer site a situation was identified that was delivering 1000+ rows of data every evaluation cycle. That needed investigation and remediation.
Sitworld: Sampled Situations and Until Situations
In 2012 I worked for months on a case where a sampled situation with Until/Situation did not work reliably. In depth tracing showed exactly what was happening.
Sitworld: TEMS Audit Process and Tool
There have been cases every year where a TEMS was running with high enough CPU/Storage resource usage that the customer was concerned. In some cases, the TEMS experienced a steady storage growth and failure after some days.
Sitworld: DetectorRecycler for ITM Windows OS Agent
This is the Windos OS agent version of recovering from high CPU.
Sitworld: ITM and the 1997 Kasparov vs. Deep Blue Chess Match
I was called in for a case where a hub TEMS was showing signs of severe communications overload. The CPU load was not high but the external symptoms were very bad
Sitworld: ITM Agent Health Survey
The ITM Agent Health Survey tool detects possible cases of agents that show online but are non-responsive.
Sitworld: Sampled Situation Blinking Like a Neon Light
A customer reported having a sampled situation opening and closing constantly. The underlying condition was true for long periods and so the situation event should have been constantly open.
Sitworld: Sampling Interval Limits Time Tests
A customer created a time test on a sampled situation and it was not behaving as expected. After resolving three preliminary issue, I found a hard limit between time tests and sampling interval.
Sitworld: TEMS Audit Advisory Messages
TEMS Audit is a program that summarizes a TEMS diagnostic log when traces have been set to record workload. To make usage easier an advisory section has been added.
Sitworld: Situations Caused Domain Name Server Overload
A customer reported an intense workload to their Domain Name Server [DNS] flowing from systems running Windows/Linux/Unix OS Agents. The workload was heavy and caused response time delays on normal business processes.
Sitworld: Configuring a Stable SOAP Port
A customer reported tacmd login failed. I had them stop all ITM Processes, start up the hub TEMS and then start up the other ITM processes. The customer reported that all was now well but they were very concerned about the interruption to normal operations.
Sitworld: Best Practice TEMS Database Backup and Recovery
I was working with a customer with a TEMS Database File problem. These cases are very rare but the results can be disruptive. The hub TEMS or some remote TEMS cannot start or are running without all situations and other objects.
Sitworld: Action Command Wars – A New Beginning
A customer hub TEMS was unstable, crashing a few hours after startup. The environment needed a fresh look.
Sitworld: Detecting and Recovering from High Agent CPU Usage
In a customer environment of 10,000+ Unix/Linux systems, the OS agents were observed running at high CPU occasionally.
Sitworld: An Efficient Design for Starting a Background Process
A customer wanted to start an agent related background task. The normal crontab tasks were not permitted.
Sitworld: Adding Environmental Data to Action Command Emails
A customer situation was created to detect a dangerously full paging space condition on an AIX system. The customer wanted more information than was in the paging space attributes,.
Sitworld: Situation Managing Other Situations
A customer was trying to make a situation work only on some Linux Dell systems. His solution has five major and minor flaws.
Sitworld: Mixed Up Situations
A customer was experiencing a high CPU condition on a remote TEMS. … The second major issue was the result of a situation that used two different attribute groups. For that case I show an alternative solution which reduces the number of result bytes incoming by
Sitworld: Efficient Situation for Two Missing Processes
I was working though a remote TEMS performance problem using TEMS Audit [See extract below]. The #1 and #3 top situation impacters alerted if two different Linux processes were missing.
Sitworld: Getting a Good Nights Sleep
I was working with a customer which had a new processing environment of Windows servers being brought up. They had defined two new remote TEMS and there were 450+ Windows OS Agents reporting. The reported problem was that too few events were arriving at the operations center.
Sitworld: Rational Choices for Situation Sampling Intervals
John Alvord, IBM Corporation jalvord@us.ibm.com Inspiration I was working on a customer remote Tivoli Enterprise Monitoring Services [TEMS] performance problem. The remote TEMS was showing high CPU utilization and was crashing every couple days. The TEMS Audit report showed that results data from Agents were flowing in at 8 megabytes a minute. That is roughly 10 timesContinue reading “Sitworld: Rational Choices for Situation Sampling Intervals”
Sitworld: The Derivative Log Pattern
A user was crafting a situation to alert on a Windows Event Log. The ID attribute was known. He wanted to exclude cases where Description had a certain form. The situation formula was unable to make that exclusion.
Sitworld: Super Duper Situations
Recently a customer asked
How can I find out the real name for the situation _Z_KLOLOGPEVT17
which I see referenced in the TEMS operations log.
Sitworld: MS_Offline – Myth and Reality
An IBM Tivoli Monitoring [ITM] site was hard down and the customer believed that the Quarterly earnings report process was seriously at risk. During a 20 hour long day spent recovering files I discovered that once again a customer had abused some MS_Offline type situations.
Sitworld: Auditing TEMS for Improved Performance
People mostly ignore TEMS performance issues until the TEMS crashes or agents start going wildly online and offline….
Sitworld: ITM Silver Blaze – Agent Responsiveness Checker
I have a great job. People come to me with puzzles and I get paid to investigate. A recent customer had 400+ Solaris systems running Unix OS Agent at ITM 622 FP5 and earlier levels. By chance they identified a single instance of a Unix OS agent that was not running situations. They were naturally worried there could be other cases.
Sitworld: ITM TEMS Stress Tester Experiment
There are times when the TEMS comes under heavy stress. There are sometimes severe outcomes such as a crash, out of storage failures and loss of communication with other TEMSes and agents. Often there are few outward signs before the failure. These sorts of failures are very costly in lost time and support efforts.
Sitworld: ITM Example Perl/SOAP Data Retrieval
Following was rescued from an IBM Technote after a web area was shut down. Question How should ITM data be retrieved using Perl/SOAP Answer SOAP – Best Practice and Model Perl Program ITM data can be accessed via Perl/SOAP programs. In some recent cases, such programs severely stressed the hub TEMS with prolonged periodsContinue reading “Sitworld: ITM Example Perl/SOAP Data Retrieval”