John Alvord, IBM Corporation
jalvord@us.ibm.com
Inspiration
A customer reported an intense workload to their Domain Name Server [DNS] flowing from systems running Windows/Linux/Unix OS Agents. The workload was heavy and caused response time delays on normal business processes.
Diagnosis
There were 3500+ situations and 1500+ had email action commands. Some 230 of the action commands were set up to send the email on every interval, not just the first instance. Every email command would require 2-4 DNS lookups.
The particular condition was the result of a single Windows system which had a situations running which evaluated True. Several of the situations had action commands set to run on every interval. The situations had a DisplayItem configured and 8-10 emails were sent each interval. The result was about 60 emails a minute steady state. That was enough to disrupt the DNS server performance.
Identification and recovery
The “every interval” situations were identified using the following technique. The general procedure for creating a TEP workspace viewing TEMS objects is documented here.
Review that first and slot in the changes.
The Custom SQL used for this case is:
SELECT
SITNAME,
AUTOSOPT,
CMD
FROM O4SRV.TSITDESC’;
In this case we need to add a query filter
The last check – tests for presence of “mail” in the Command attribute. The Command title doesn’t show in this capture but if you slide the scroll bar it can be seen.
In the properties select “return all rows” option.
The end result will be a list of all situations which had a Y in the second character of the AUTOSOPT attribute – which means perform action command every interval. The filter also is limited to the action commands which had “mail” in the command line.
The resulting table was exported to a csv file [right click in table and select export…]. Each situation was evaluated and for most the action command was changed to “Don’t take action twice in a row” – the default.
Results
With this change the number of emails was sharply reduced and the immediate crisis was over.
For a long term solution I suggested they stop using emails to transmit events. Email has a lot more overhead then an event receiver like Netcool/Omnibus. Email cannot transmit the fact that a condition has been resolved. Email is not manageable.
Summary
This documented how a high volume of emails from situations caused a DNS server overload and how the problem cases were identified and resolved.
Sitworld: Table of Contents
Note: Enterprising Spider after a Foggy Night