By John Alvord, IBM Corporation
jalvord@us.ibm.com
Inspiration
I was working though a remote TEMS performance problem using TEMS Audit [See extract below]. The #1 and #3 top situation impactors alerted if two different Linux processes were missing. If one or the other or both were present, that was OK. Otherwise a situation event would be generated. The remote TEMS was receiving 8 megabytes a minute in results sent from 974 agents. The total from those two situations were 4.2 megabytes received per minute. They were distributed to 605 Linux agents.
The following is taken from a TEMS Audit report for a 21 minute period. Columns have been deleted for ease of presentation. We will be talking about situations 1/5 and 3/6. The 2/4 case is left for another post.
Rank | Situation | Rows | Result/Min | Fraction | Cumulative% |
1 | XXX_PRCMissing_NSLCD_C | 25543 | 1873153 | 23.62% | 23.62% |
2 | XXX_SYSLoadAvg15Min_C | 51507 | 1836008 | 23.15% | 46.77% |
3 | XXX_PRCMissing_Rpcbind_C | 21132 | 1549680 | 19.54% | 66.31% |
4 | XXX_SYSLoadAvg5Min_C | 14857 | 528568 | 6.66% | 72.97% |
5 | XXX_PRCMissing_NSCD_C | 4600 | 337333 | 4.25% | 77.22% |
6 | XXX_PRCMissing_Portmap_C | 4284 | 314160 | 3.96% | 81.19% |
Overview
The general approach is to review the situation definitions. Next the diagnostic log will be viewed for details coming from one agent. Lastly an alternative approach is suggested which will be more efficient.
One reason I wrote this up is to give a detailed example of a performance audit analysis how it is resolved. There are a lot of details and this can give you a head start in doing your own analysis.
Business Rule For Condition
If certain Linux systems are not running either nslcd or nscd, then create an alert.
Here is how the issue was originally implemented:
XXX_PRCMissing_NSLCD_C
pdt: *IF *MISSING KLZ_Process.Process_Command_Name *EQ ( ‘nslcd’ )
XXX_PRCMissing_NSCD_C
*IF *MISSING KLZ_Process.Process_Command_Name *EQ ( ‘nslcd’ )
The base situations are used in two functional situations:
XXX_PRCMising_nscd_nslcd_W
XXX_PRCMising_nscd_nslcd_C
Let’s look at one of the 605 Linux OS Agents for 60 seconds. There were 8 results received in one minute. I will ignore the events from other situations
3 – XXX_PRCMissing_Rpcbind_C
1- PrimeShift
1- XXX_CPUSMPBusy_90_W
We will review only the three result rows from XXX_PRCMissing_NSLCD_C. Evidently this is a Linux system which is running nscd and nslcd is missing.
(517EC904.0025-5F5:kpxrpcrq.cpp,781,”IRA_NCS_Sample”) Rcvd 1 rows sz 1540 tbl *.KLZPROC req XXX_PRCMissing_NSLCD_C <1343381874,3977249565> node <xxxgtps2d:LZ>
(517EC91E.000B-5BC:kpxrpcrq.cpp,781,”IRA_NCS_Sample”) Rcvd 1 rows sz 1540 tbl *.KLZPROC req XXX_PRCMissing_NSLCD_C <1341302277,3985638231> node <xxxgtps2d:LZ>
(517EC928.0016-5A8:kpxrpcrq.cpp,781,”IRA_NCS_Sample”) Rcvd 1 rows sz 1540 tbl *.KLZPROC req XXX_PRCMissing_NSLCD_C <1347595196,3979346861> node <xxxgtps2d:LZ>
The results arrived at seconds 0/26/36 in the minute. Initially it seems curious that three results arrive when the “nslcd” process is not running. It seems that only one should arrive at best.
The XXX_PRCMissing_NSLCD_C situation has no distribution. However two functional situations made use of that base situation. These other situations had distributions to MSLs that included this specific agent:
XXX_PRCMising_nscd_nslcd_W
XXX_PRCMising_nscd_nslcd_C
At the agent the two situations start up. If not otherwise started, the base situations will also be automatically started.
XXX_PRCMissing_NSLCD_C
XXX_PRCMissing_NSCD_C
The XXX_PRCMissing_NSCD_C never sends results. However each of the other three situations will send results.
XXX_PRCMissing_NSLCD_C
XXX_PRCMising_nscd_nslcd_W
XXX_PRCMising_nscd_nslcd_C
The situation XXX_PRCMissing_NSLCD_C sends results because that is what started situations do at the agent. A diagnostic trace of the remote TEMS would show the results are discarded because they are not associated with any task.
The two situations are of this form
pdt: *IF *SIT XXX_PRCMissing_NSCD_C *EQ *TRUE *AND
*SIT XXX_PRCMissing_NSLCD_C *EQ *TRUE
Results are sent to TEMS because this form of situation always requires TEMS evaluation.
The situation XXX_PRCMising_nscd_nslcd_W had the identical formula and was in fact an experiment which was not supposed to be running. It was stopped immediately.
That explains the 3 results being sent.
There were 605 Linux OS Agents. If they were all running the situations and also the nscd process, the number of rows in 21 minutes would be
3*605*21 = 38815 rows
Now some Linux OS Agents run nscd and some run nslcd. The sum of the two instance rows in the audit extract totals 30143. This is smaller because the situations are not distributed to all Linux OS Agents. That number of rows results totals 2.2 megabytes per minute or 28% of the total.
This explains the results observed. The existing situations require 2.2 megabytes per minute to figure out if both processes are missing.
Alternative Situation
First make the situation name self explanatory, like this XXX_NSxCD_MISSING_C
Make the formula be
Process Filter == ‘.*(nscd|nslcd).*’ AND MISSING(Command Line) == (‘*’))
Skip this if you are familiar with Regular expressions. I am a novice after 15 years of creating them. Regular expression detail…
.* Match any number of characters
( Match for alternative strings and place in capture buffer
nscd Match “nscd” and place in capture buffer
| Alternatively…
nslcd Match “nslcd” and place in capture buffer
) End of capture buffer
.* Match any number of characters
When a match is made, the capture buffer is placed into the Command Line attribute.
When the nscd or nslcd process is present a result row will be included. The MISSING value of asterisk [*] means match any string. When either process is present, MISSING will not be true and so no results are sent. When neither process is present, the MISSING will create an alert. There is no DisplayItem, but the name of the situation explains the issue.
When things are normal, the remote TEMS will receive no data at all. Thus we have reduced the impact of the situation from 2.2 megabytes per minute to zero on the average.
If there is a need for both a Critical [_C] and a Warning [_W] for different end users, the situation can just be cloned. Since there is no result workload under normal conditions, it really doesn’t matter if two are running.
Warning: the expression ‘.*(nscd|nslcd).*’ might not be enough to distinguish the processes uniquely. You might need to add additional expressions to the string or another attribute like
Process Filter == ‘.*(nscd|nslcd).*’ AND
Process Parent ID == 1 AND
MISSING(Command Line) == (‘*’))
Do thorough testing before settling on a final test!!
Parallel case – Business Rule
If certain Linux systems are not running either rpcbind or portmap, then create an alert.
This is exactly parallel to the first case. Create a situation
XXX_rpcbind_OR_portmap_MISSING_C
Use the formula
Process Filter == ‘.*(rpcbind|portmap).*’ AND MISSING(Command Line) == (‘*’))
It will have exactly the same effect for firing when both are missing.
Summary
TEMS Audit report showed several situations generating a heavy load of result data.
The situation of the type “at least one process running” on Linux/Unix can be run most efficiently. This document shows you how.
Sitworld: Table of Contents
Note: My Two Burmese Kittens in a Bed over the HP Printer – NOT Missing.
2 thoughts on “Sitworld: Efficient Situation for Two Missing Processes”