Sitworld: The Situation That Cried Wolf

HummerAndSwallow

John Alvord, IBM Corporation

jalvord@us.ibm.com

Follow on twitter

Introduction

A team that was using ITM to reduce data center problems came to me for advice. In a large environment, Windows disks and Linux/Unix mount points were low on free space and critical applications were failing. Situations had been written but situation alerts were being ignored by the system administrators.

The blog title references a classic story The Boy Who Cried Wolf and imagines a situation that created so many false positives that people ignored the warnings.

Background

The client team had already determined that there were too many false positives. Many of the disks were large and a 10% free or even a 5% free was not a critical situation any more. Because this did not actually predict immanent failure there were many false positives – cases where situations would alert but there was no actual problem. That was exactly why the system administrators could not make effective use of the information.

The client team had devised a more thoughtful plan which I reproduce here

Category Low limit High Limit Alert Threshold
A >= 0 < 5GB 120 MB
B >= 5GB < 20 GB 500 MB
C >= 20 GB < 100 GB 2 GB
D >=100 GB < 1 TB 10 GB
E >= 1 TB unlimited 25 GB

They had created Situation formulas for Windows, Unix and Linux. All were pretty much the same. Here Linux written for clarity

IF ( ( *VALUE KLZ_Disk.FS_Type *IN ( ‘ext’,’ext2′,’ext3′,’ext4′ ) *AND

*VALUE KLZ_Disk.Size *LT 5120 *AND

*VALUE KLZ_Disk.Disk_Free *LE 120

) *OR (

*VALUE KLZ_Disk.FS_Type *IN ( ‘ext’,’ext2′,’ext3′,’ext4′ ) *AND

*VALUE KLZ_Disk.Size *GE 102400

*AND *VALUE KLZ_Disk.Size *LT 1024000

*AND *VALUE KLZ_Disk.Disk_Free *LE 10240

) *OR (

*VALUE KLZ_Disk.FS_Type *IN ( ‘ext’,’ext2′,’ext3′,’ext4′ ) *AND

*VALUE KLZ_Disk.Size *GE 1024000 *AND

*VALUE KLZ_Disk.Disk_Free *LE 25600

) *OR (

*VALUE KLZ_Disk.FS_Type *IN ( ‘ext’,’ext2′,’ext3′,’ext4′ ) *AND

*VALUE KLZ_Disk.Size *GE 20480 *AND

*VALUE KLZ_Disk.Size *LT 102400 *AND

*VALUE KLZ_Disk.Disk_Free *LE 2048

) *OR (

*VALUE KLZ_Disk.FS_Type *IN ( ‘ext’,’ext2′,’ext3′,’ext4′ ) *AND

*VALUE KLZ_Disk.Size *GE 5120 *AND

*VALUE KLZ_Disk.Size *LT 20480 *AND

*VALUE KLZ_Disk.Disk_Free *LE 500

)

)

The general idea was excellent, but it needed fine tuning.

Fine Tuning Step 1

Any time you see *IN and *ANDs and *ORs  you should worry about Filter Object too big cases. See Situation Limits. As the situation starts up the compiled WHERE clause binary objects  [plan and pool] might be too large to transmit to the Agent. When true that means there is no agent filtering. TEMS filtering can vastly increases the TEMS workload. The result is often a performance disaster. I tested a mock-up of the situation and  and saw the message in the TEMS diagnostic log

(5345637B.0000-D:kpxreqds.cpp,1814,”buildThresholdsFilterObject”) Filter object too big (46688 + 8766),Table KLZDISK Situation IBM_test_disk.

The logical alternative is to switch from the *IN test to the substring test. That means the file system type must have “ext” as the first three characters. That makes sense for this case. The *STR construction uses much less WHERE clause space and is also an Agent filtered column function so the entire situation causes no work at the TEMS unless there is a current issue.

IF ( ( *STR KLZ_Disk.FS_Type *EQ 1,ext *AND

*VALUE KLZ_Disk.Size *LT 5120 *AND

*VALUE KLZ_Disk.Disk_Free *LE 120

) *OR (

*STR KLZ_Disk.FS_Type *EQ 1,ext *AND

*VALUE KLZ_Disk.Size *GE 102400

*AND *VALUE KLZ_Disk.Size *LT 1024000

*AND *VALUE KLZ_Disk.Disk_Free *LE 10240

) *OR (

*STR KLZ_Disk.FS_Type *EQ 1,ext *AND

*VALUE KLZ_Disk.Size *GE 1024000 *AND

*VALUE KLZ_Disk.Disk_Free *LE 25600

) *OR (

*STR KLZ_Disk.FS_Type *EQ 1,ext *AND

*VALUE KLZ_Disk.Size *GE 20480 *AND

*VALUE KLZ_Disk.Size *LT 102400 *AND

*VALUE KLZ_Disk.Disk_Free *LE 2048

) *OR (

*STR KLZ_Disk.FS_Type *EQ 1,ext *AND

*VALUE KLZ_Disk.Size *GE 5120 *AND

*VALUE KLZ_Disk.Size *LT 20480 *AND

*VALUE KLZ_Disk.Disk_Free *LE 500

)

)

Fine Tuning Step 2

Before committing the situation to production some efficiency testing is needed. We must avoid false positives and that requires testing.

Set the situation up with a reasonable distribution. Do not yet set EIF so the production administrators will not be bothered. Pick a remote TEMS you know the agents will be reporting to. Set the needed tracing

1) tacmd login …

2) tacmd settrace -m <temsname>

-p KBB_RAS1 -o ‘error (unit:kpxrpcrq,Entry=”IRA_NCS_Sample” state er)’

3) start the situation

4) wait fifteen  minutes.

5) tacmd settrace -m <temsname> -p KBB_RAS1 -r

  [Above assumes ITM 623 FP2 or later or ITM 630 GA or later. If not use the service console]

 The diagnostic trace messages of interest look like this

(5021834F.004F-32:kpxrpcrq.cpp,781,”IRA_NCS_Sample”) Rcvd 2 rows sz 648 tbl *.WTLOGCLDSK req xxx_dsp_3ntw_stl_lex_std <3614445445,1328547661> node <xxx_uss1uap056amp3b:NT>

That means that 2 result rows were received, size 648 bytes each. The Attribute Group table was WTLOGCLDSK – Windows Logical Disk. The situation name was  xxx_dsp_3ntw_stl_lex_std and the node sending the results were xxx_uss1uap056amp3b:NT.

For an efficient situation you should see a lot of zero row results and a few ones with results of a few rows. On the next cycle, you will see just the non-zero result rows resent.  If you see many non-zero row results the situation is not calibrated to report exceptional conditions [or conditions are really really bad!!].  If your results show  with many result rows, that means the situation is not filtering much at the agent – which is an extreme form of inefficiency.

Efficiency testing results on Unix mount points

The testing was summarized by the client like this on Unix mount points like this:

Unix  HDD Situation

(15 minutes time frame for testing)

(377 Total Samples)

(Situation time interval 5 mins)

(Situation distributed on 242 servers/agents)

Rows Samples

7         6    Samples

6         4    Samples

5        10    Samples

4        16    Samples

3        50    Samples

2        88    Samples

1       112    Samples

0        91    Samples

A well designed situation running in a well tuned environment will show mostly 0 row cases…. indicating no problems. This analysis shows a problem with either the situation or the environment. Only 91 of 242 or 38% of the systems have no problems.

The issue might be a situation formula that is needs to be tightened up, so instead of alerting on 10% free it should be 5% free. Remember the system administrators running the servers to go and FIX things. If 62% of the servers need to be fixed, that is a tremendous workload to lay on them from the start. If many are false positives they will rightly rebel. Perhaps you need to start with a formula that just alerts on 5% or 10% of the servers. When those are fixed you can change the formula controls until it makes sense.

The issue might be a condition where your formula is the correct one, but a lot of servers are in serious trouble. In that case dive in and start to fix things. That could include balancing workload and getting more storage space in this case.

At the end of the day, you want situations to alert on exceptional conditions to avoid outages. During implementation, you will need to experiment and discover what the correct formula is that resolves outages but creates a sustainable workload for the system administrators.

Client Implementation Plan

The implementation plan [lightly edited] was

1) Currently there are many occurrences of file systems over the threshold (90%) in our environment, and most system administrator are ignoring the Alerts. In this case, we can change temporarily the threshold, to be tightened up, as mentioned, to 98% (including 98, 99 and 100) during a week, and work in the education theme with the sysadmins, to avoid they continue ignoring the notifications alerts. This will reduce the high number of occurrences, preparing the field before the new formula full implementation.

2) During this week, we can temporary stop the warning notification, to be concentrated only on the critical situation alerts.

3) After the new formula implementation, we can reactivate the warning alerts, but with a totally new threshold, for example 90% to make more sense to the sysadmins.

4) If needed, we can add a second week, and increase a little bit the threshold for example to 96%  (including 96, 97, 98, 99 and 100).

[My summary:  implement the new alerts gradually.]

Results Two Months Later

Two months later I asked about final results and the client responded

“It is very effective. Now all the disk situations triggered from ITM are useful for sysadmins and they do not ignore them anymore. They are also happy that false positives (not from ITM but from a business case point) does not arise now.”

Summary

This document presents a practical and successful implementation plan for a severe business problem.

Sitworld: Table of Contents

Photo Note: Humming Bird and Barn Swallow, Big Sur Early Summer 2013

 

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: