Sitworld: Mixed Up Situations

all3onloungepad

By John Alvord, IBM Corporation

jalvord@us.ibm.com

Inspiration

A customer was experiencing a high CPU condition on a remote TEMS. High CPU was also seen on some agents, The first issue was an expensive test to determine if two processes were missing. I documented full details for a zero cost solution here.

The second major issue was the result of a situation that used two different attribute groups. For that case I show an alternative solution which reduces the number of result bytes incoming by 98%.

Mixed Attribute Situation formula

The problem situation was XXXXX_XX_SYSLoadAvg15Min_C  and resulted in 1.836 megabytes a minute of result data even though it only ran every 15 minutes.

The situation formula was not obviously inefficient. It certainly did not draw my attention until the TEMS Audit process showed it a top impacter,

*IF ( ( *VALUE Linux_Machine_Information.Number_of_Processors_Online *EQ 2 *AND

        *VALUE KLZ_System_Statistics.System_Load_15min *GT 8.00 ) *OR

      ( *VALUE Linux_Machine_Information.Number_of_Processors_Online *EQ 3 *AND

        *VALUE KLZ_System_Statistics.System_Load_15min *GT 12.00 ) *OR

      ( *VALUE Linux_Machine_Information.Number_of_Processors_Online *EQ 4 *AND

        *VALUE KLZ_System_Statistics.System_Load_15min *GT 16.00 ) *OR

      ( *VALUE Linux_Machine_Information.Number_of_Processors_Online *EQ 5 *AND

        *VALUE KLZ_System_Statistics.System_Load_15min *GT 20.00 ) *OR

      ( *VALUE Linux_Machine_Information.Number_of_Processors_Online *EQ 6 *AND

        *VALUE KLZ_System_Statistics.System_Load_15min *GT 24.00 ) *OR

      ( *VALUE Linux_Machine_Information.Number_of_Processors_Online *EQ 7 *AND

        *VALUE KLZ_System_Statistics.System_Load_15min *GT 28.00 ) *OR

      ( *VALUE Linux_Machine_Information.Number_of_Processors_Online *GE 8 *AND

        *VALUE KLZ_System_Statistics.System_Load_15min *GT 32.00 )

    ) 

That was formatted for easier viewing. The scheme is designed to be true if the 15 minute system load average was more then 4 times the number of processors on the system being run. The formula seems reasonable. However, an agent can only run a situation with a single attribute group. To calculate this situation, TEMS creates hidden sub-situations which achieve the same goal. In this case TEMS created 22 situation rules in the SITDB table:

XXXXX_XX_SYSLoadAvg15Min_C______

XXXXX_XX_SYSLoadAvg15Min_C_____0

XXXXX_XX_SYSLoadAvg15Min_C_____1

XXXXX_XX_SYSLoadAvg15Min_C_____2

XXXXX_XX_SYSLoadAvg15Min_C_____3

XXXXX_XX_SYSLoadAvg15Min_C_____4

XXXXX_XX_SYSLoadAvg15Min_C_____5

XXXXX_XX_SYSLoadAvg15Min_C_____6

XXXXX_XX_SYSLoadAvg15Min_C_____7

XXXXX_XX_SYSLoadAvg15Min_C_____8

XXXXX_XX_SYSLoadAvg15Min_C_____9

XXXXX_XX_SYSLoadAvg15Min_C_____a

XXXXX_XX_SYSLoadAvg15Min_C_____b

XXXXX_XX_SYSLoadAvg15Min_C_____c

XXXXX_XX_SYSLoadAvg15Min_C_____d

XXXXX_XX_SYSLoadAvg15Min_C_____e

XXXXX_XX_SYSLoadAvg15Min_C_____f

XXXXX_XX_SYSLoadAvg15Min_C_____g

XXXXX_XX_SYSLoadAvg15Min_C_____h

XXXXX_XX_SYSLoadAvg15Min_C_____i

XXXXX_XX_SYSLoadAvg15Min_C_____j

XXXXX_XX_SYSLoadAvg15Min_C_____k

This was the most number of sub-situations I have every seen.

 I won’t bore you with every sub-situation definition but here are three selected examples found in the SITDB table. The SITDB table contains the SQL which represents the situation in the TEMS dataserver,

RULENAME: XXXXX_XX_SYSLoadAvg15Min_C______

PREDICATE: XXXXX_XX_SYSLoadAvg15Min_C_____0 OR XXXXX_XX_SYSLoadAvg15Min_C_____3 OR XXXXX_XX_SYSLoadAvg15Min_C_____6 OR XXXXX_XX_SYSLoadAvg15Min_C_____9 OR XXXXX_XX_SYSLoadAvg15Min_C_____c OR XXXXX_XX_SYSLoadAvg15Min_C_____f OR XXXXX_XX_SYSLoadAvg15Min_C_____

RULENAME: XXXXX_XX_SYSLoadAvg15Min_C_____i

PREDICATE: XXXXX_XX_SYSLoadAvg15Min_C_____j

AND XXXXX_XX _SYSLoadAvg15Min_C_____k

RULENAME: XXXXX_XX_SYSLoadAvg15Min_C_____k

PREDICATE: SELECT BIOSREL, BIOSVER, BRAND, CONFCPU, HOSTNAME, MACSERIAL, MODEL, ONLNCPU, ORIGINNODE, TIMESTAMP, UUID FROM KLZ.LNXMACHIN WHERE SYSTEM.PARMA(“SITNAME”, “LZIOS_BP_SYSLoadAvg15Min_C”, 26) AND SYSTEM.PARMA(“NUM_VERSION”, “8”, 1) AND SYSTEM.PARMA(“LSTDATE”, “1130405010445000”, 16) AND SYSTEM.PARMA(“SITINFO”, “TFWD=N;OV=N;”, 12) AND LNXMACHIN.ONLNCPU = 7 ;

This situation and a parallel one for 5 minutes load resulted in about 40% the incoming workload to the remote TEMS. The fact that situations contain a test for other situations requires TEMS Evaluation which is always very expensive.  Just this situation alone could cause high CPU at the best and might cause a remote TEMS crash.

There was also a much higher workload at the agents where the situation was distributed. All 22 situations were evaluated at the sampling interval. Even if the issue did not exists, the situation tests required TEMS evaluation and so many many duplicate results had to be sent.

Alternative Situation(s) Example Solution

The following three situations use marker and data files to communicate information. Example situations are available IBM_sysload15. The marker/data files are stored in the <install>/ tmp directory. If more that one scheme is being used, the situation name would be made part of the marker file name.

IBM_processor_count

Since systems rarely change the number of online processors, the value is calculated at agent start up and then just once every 999 days. If your environment makes use of  CPU hot plug technology then you could run it more frequently. In this example, the ongoing TEMS impact is zero since it is only evaluated every 999 days or during a TEMS connection,

The action commands are presented for ease of understanding, but will be one long line in the situation editor Action command,

Attribute Group: Linux Machine Information

Formula: (Number of Processors Online >= 0)

Sampling Interval: 999 days

rowsize:764

Action Command:

cd $CANDLEHOME/tmp ;

echo  &{Linux_Machine_Information.Number_of_Processors_Online}  >ponline.txt

The purpose is to record the number of online processors into a known file.

IBM_sysload15_calc

The situation uses the Linux System Statistics attribute group. The formula is set to be always true. The action command is configured to run at every interval.

Attribute Group: Linux System Statistics

Formula: (System Load Last 15 Minutes >= 0.00)

Sampling Interval: 15 minutes

rowsize: 236

Action Command:

cd $CANDLEHOME/tmp;

(

echo “&{KLZ_System_Statistics.System_Load_15min} ” ;

(cat ponline.txt 2>/dev/null || echo 1);

) |

awk  ‘{load=$1;getline;cpu=$1;}

END{

if (load/100 > cpu*4.0)

system(sprintf(“echo %.2f >sysload15.hi”,load/100));

else system(“rm -f sysload15.hi”);

exit 0;}’

Here is an explanation of the action command. Remember that (…) creates a subshell environment, semicolon denotes one command and then the next. || means that the second command runs only if the first one had a non-zero exit command [or failed]. The | alone means the standard output is fed into the standard input of the next command.

cd $CANDLEHOME/tmp;

  ===> make the $CANDLEHOME tmp directory the current directory.

—  (

  ====> Begin a first level subshell environment

echo “&{KLZ_System_Statistics.System_Load_15min}” ;

  ===> output the 15 minute system load into standard output.

  ===> The echo –n is not used because not all platforms support it.

(cat ponline.txt 2>/dev/null || echo 1)

  ====> Within a new subshell, copy the ponline.txt into standard output. Suppress error

  ====> messages with 2>/dev/null and if no file exists, put 1 into the standard output/

) |

  ====> Close the first subshell and two lines are put into standard output

—   awk  ‘{load=$1;getline;cpu=$1;}

====> Run the awk command and get one number from each of the two lines

— END{

===> After reading all input perform the ending logic

—  if (load/100 > cpu*4.00)

  ===> The load is the 15 minute system load. It is divided by 100 for scaling. The agent sends just an integer.

  ===> The  cpu is the number of online processors, from the ponline.txt file

  ===> The test is the business rule to signal if system load is more then 4.0 times number of processors.

system(sprintf(“echo %.2f >sysload15.hi”,load/100));

  ===> The first system is the if statement true logic. It creates a marker file.

—  else system(“rm -f sysload15.hi”);

  ===> The second  system  is the if statement false logic. It erases the marker file.

— exit 0;}’

===> exit the awk process.

IBM_sysload15_high

Check the marker file. When present a high sysload 15 minute condition exists.

Attribute Group: Linux File Information

Formula: (  Path == ‘/opt/IBM/ITM/tmp/’ AND File == ‘sysload15.hi’)

Sampling Interval: 5 minutes

Rowsize: 3580

Event Receivers and Helper situations

The first two situations are helper situations. After testing make sure they are not associated with any Portal Client Navigation nodes. The EIF tab should be set so the events are not transmitted to any event receiver like Omnibus.

Remote TEMS Performance Estimate for 600 agents

Situation 1 – ignored since only runs once.

Situation 2 – always true so result is sent each time.

Every 15 minutes there will be 600 results at 236 bytes which averages 9440 bytes/minute

Situation 3 – Assume 600 agents and 5% are showing a problem.

The 30 agents that have a problem will send

Count*rowsize*freq/hr = 30*3580*12  = 1,632,480  bytes/hour or 27208 per minute.

In this scenario, there will be 36,648 result bytes per minute when combined with situation 2.

That is a 98.0% reduction from the observed rate on the remote TEMS.

You can adjust the situation intervals to speed recognition of the condition with a parallel increase in the remote TEMS burden. Here is a formula you can to estimate the bytes per minute burden on the remote TEMS.

(Agents*Agent_Sysload15%*60*3580)/Check_Interval + (Agents*236*60)/Calc_Interval

Agents = number of agents being looked at

Agent_Sysload15% = fraction of agents exceeding the sysload15 benchmark

Check_Interval = Sampling interval in minutes for condition check

Calc_Interval = Sampling interval in minutes for calculation situation

Note:

In some environments you may need to use nawk [new awk] to get correct results. This was seen in a Solaris platform.

Summary

This shows how to write a situation to reduce the number of result bytes to achieve a business rule. In the particular example the reduction was 98% compared to a single situation.

Sitworld: Table of Contents

Photo note: Three mixed up cats enjoying the sun.

 

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: