Sitworld: Sampled Situation Blinking Like a Neon Light

twofledglingsinthemorningfog

John Alvord, IBM Corporation

jalvord@us.ibm.com

Follow on twitter

Inspiration

A customer reported having a sampled situation opening and closing constantly. The underlying condition was true for long periods and so the situation event should have been constantly open.

That sounded to me like a blinking neon light in a tavern.

Overview

The situation in question had this formula

*IF *VALUE File_Information.Path *IN (

‘/opt/appdata/commons/PROD/usertracking’,

‘/opt/appdata/commons/PROD/usertracking/out’,

‘/opt/appdata/commons/PROD/saalt’,

‘/opt/appdata/NFSMQHA/mqm/wmb_file_services/PRD1/salsvc/internal’,

‘/opt/appdata/NFSMQHA/mqm/wmb_file_services/PRD2/salsvc/internal’,

‘/opt/appdata/NFSMQHA/mqm/wmb_file_services/PRD1/salsvc/internal/out’,

‘/opt/appdata/NFSMQHA/mqm/wmb_file_services/PRD2/salsvc/internal/out’,

‘/opt/appdata/NFSMQHA/mqm/wmb_file_services/PRD1/loggingservice/out’,

‘/opt/appdata/NFSMQHA/mqm/wmb_file_services/PRD2/loggingservice/out’)

*AND *TIME File_Information.Last_Changed_Time *LTLocal_Time.Timestamp – 24H’ *AND

*AND *VALUE File_Information.File *IN ( ‘out’,’in’,’mqsitransit’)

The sampling interval was 1 minute.

Using the TEMS Audit workload analysis diagnostic log I could there were three major contributors to the remote TEMS workload. The misbehaving situation was third contributor at 22% of the total. From the diagnostic log I could see that just two agents were running this situation and every minute the agents were sending a combined 83 rows of 4184 size each – 350K bytes per minute. That rate is substantial… 70% of the nominal stability target of 500K bytes per minute.

The situation was using *TIME which means there was very little filtering. Essentially every single file in the named directories were being sent to the TEMS every minute. I have seen cases with thousands of files and thus thousands of rows each sampling interval. That condition always results in TEMS instability. I will be writing more about TEMS filtering in the future.

Here is a basic rule: Simple situations are filtered at the agent. Others are filtered at the TEMS and that makes all the difference. TEMS audit will show any specific case. I know sites that will not commit a new situation to production until a TEMS Audit report show the situation has low impact.

Even so,  this single situation workload alone would not explain the unusual behavior. We needed to look at the other two major contributors.

Workload analysis

The #1 situation impacting workload [36% of total] had this formula

Formula:

*IF ( ( *VALUE File_Information.Path *EQ ‘/opt/appdata/IBML_Output1/PROD_PIT/ImportSource/IQ-PIT’ *AND

*SCAN File_Information.File *EQ .eop *AND

*TIME File_Information.Last_Changed_Time *LT  ‘Local_Time.Timestamp – 20M’ )  *OR

( *VALUE File_Information.Path *EQ ‘/opt/appdata/IBML_Output2/PROD_PIT/ImportSource/IQ-PIT’ *AND

*SCAN File_Information.File *EQ .eop *AND

*TIME File_Information.Last_Changed_Time *LT ‘Local_Time.Timestamp – 20M’ )  *OR

( *VALUE File_Information.Path *EQ ‘/opt/appdata/IBML_Output3/PROD_PIT/ImportSource/IQ-PIT’ *AND

*SCAN File_Information.File *EQ .eop *AND

*TIME File_Information.Last_Changed_Time *LT  ‘Local_Time.Timestamp – 20M’ ) *OR

( *VALUE File_Information.Path *EQ ‘/opt/appdata/IBML_Output4/PROD_PIT/ImportSource/IQ-PIT” *AND

*SCAN File_Information.File *EQ .eop *AND

*TIME File_Information.Last_Changed_Time *LT ‘Local_Time.Timestamp – 20M”)

)

Sampling interval 15 minutes. The situation was looking in 4 separate directories for a specific file and if the file was more than 20 minutes late a situation event should be created.

Any use of *TIME causes no agent filtering. Here is an example of the trace data

(51E6C99B.0008-718:kpxrpcrq.cpp,781,”IRA_NCS_Sample”) Rcvd 4008 rows sz 4184 tbl *.FILEINFO req IBM_UX_IBML_0_IQPITEOPFILES …

(51E6CD1F.0004-5F8:kpxrpcrq.cpp,781,”IRA_NCS_Sample”) Rcvd 4008 rows sz 4184 tbl *.FILEINFO req IBM_UX_IBML_0_IQPITEOPFILES …

The returned results were just under 16 megs. There is a hard coded limit 16 megabytes for results returned from agent results. That applies both to real time data requests and also situation results filtered at the TEMS. The log entries show the formula was collecting a lot of unfiltered data [because of the *TIME tests] and also was delivering just under 16 megabytes. The amount exceeding 16 megabytes was not returned to the TEMS for analysis.

This was the worst of both worlds: the TEMS was seriously impacted and some events might be ignored.

A last problem with the situation is that when a time test is used in a sampled situation, the evaluation time must be less then one half the time test. See this blog post for the details. So, even if the other issues did not cause problems there would likely be false alerts.

The #2 situation impacting workload [28% of total] had this formula

formula:

*IF *COUNT Process.Base_Command *EQ ‘3’ *AND

*VALUE Process.Base_Command *EQ ”opmn”

sampling interval 30 seconds

Any use of *COUNT means that all processes are sent to the TEMS with no filtering. Here is an example of unfiltered data being returned

(51E6C677.0000-118C:kpxrpcrq.cpp,781,”IRA_NCS_Sample”) Rcvd 80 rows sz  2728 tbl *.UNIXPS req IBM_UX_ORCL_0_OID2Availability …

(51E6C67A.0001-1088:kpxrpcrq.cpp,781,”IRA_NCS_Sample”) Rcvd 81 rows sz  2728 tbl *.UNIXPS req IBM_UX_ORCL_0_OID2Availability …

The sampling interval of 30 seconds means that 880K bytes per minute were arriving

Blinking Neon Situation Events

Everyone should know  that only the [usually remote] TEMS creates a situation event. Few people know that if no agent results are received for 3 sampling intervals elapsed time, the TEMS will automatically generate a close event. When the sampling interval is low and the TEMS workload is high the result is unexplained closures. See this post for general advice on choosing a sampling interval.

In this case, the heavy load of 16meg incoming result chunks was enough to monopolize the Agent Proxy and TEMS Dataserver [SQL] for many minutes. After 3 minutes, Agent proxy closed the situation event.  When new results arrived – right after the monster 16 meg result data chunks were processed – new results would be processed and the situation event would reopen again.

ITM Science Lab

There are cases where a formula like #2 with *COUNT can do some filtering at the agent in *some* cases. This involves an undocumented ITM aspect called Indexed Attributes. I wrote about here Sitworld: Put Your Situations on a Diet Using Indexed Attributes . Even so the uses are relatively limited. For example any *VALUE test cannot use wildcards. I will write more about this in the future.

Customer results

The top situation was simply stopped. Most of the other top situations were changed to 15 minute sampling intervals. With those changes the situation just stayed open as expected when the condition was true.

The customer began work on a Agent Builder agent where agent side logic would do the time tests and avoid the terrible impact on the remote TEMS. I expect a clever Linux/Unix command line could be used to extract the older files to a separate directory periodically and then have a second situation that just had to work with a few files that were already know to be old. However an Agent Builder agent is clearly the best solution overall.

Controlling the Automatic Close

A TEMS environment variable can control how long to wait for an automatic sampled situation event close. SECS is the number of seconds to wait before the automatic close.

IRA_MIN_NO_DATA_WAIT_TIME=SECS

It is not a great solution since the issue is almost always created by

  1. A unwise selection of sampling interval
  2. A remote TEMS that is severely overloaded
  3. Applies to all situations

Both (1) and (2) should be corrected before resorting to this TEMS configuration control.

Summary

This explained a curious case of a situation event that opened and closed frequently. Weird and unexpected results often occur when an ITM component is overloaded. ITM is a real time processing system and behaves best with running with a light load.

Sitworld: Table of Contents

Photo Note: Two Barn Swallow Fledglings in the Morning Fog

2013 was the fourth year we have had barn swallows nesting in our front deck eves.

 

Leave a comment