By John Alvord, IBM Corporation
Inspiration
A customer was experiencing a high CPU condition on a remote TEMS. High CPU was also seen on some agents, The first issue was an expensive test to determine if two processes were missing. I documented full details for a zero cost solution here.
The second major issue was the result of a situation that used two different attribute groups. For that case I show an alternative solution which reduces the number of result bytes incoming by 98%.
Mixed Attribute Situation formula
The problem situation was XXXXX_XX_SYSLoadAvg15Min_C and resulted in 1.836 megabytes a minute of result data even though it only ran every 15 minutes.
The situation formula was not obviously inefficient. It certainly did not draw my attention until the TEMS Audit process showed it a top impacter,
*IF ( ( *VALUE Linux_Machine_Information.Number_of_Processors_Online *EQ 2 *AND
*VALUE KLZ_System_Statistics.System_Load_15min *GT 8.00 ) *OR
( *VALUE Linux_Machine_Information.Number_of_Processors_Online *EQ 3 *AND
*VALUE KLZ_System_Statistics.System_Load_15min *GT 12.00 ) *OR
( *VALUE Linux_Machine_Information.Number_of_Processors_Online *EQ 4 *AND
*VALUE KLZ_System_Statistics.System_Load_15min *GT 16.00 ) *OR
( *VALUE Linux_Machine_Information.Number_of_Processors_Online *EQ 5 *AND
*VALUE KLZ_System_Statistics.System_Load_15min *GT 20.00 ) *OR
( *VALUE Linux_Machine_Information.Number_of_Processors_Online *EQ 6 *AND
*VALUE KLZ_System_Statistics.System_Load_15min *GT 24.00 ) *OR
( *VALUE Linux_Machine_Information.Number_of_Processors_Online *EQ 7 *AND
*VALUE KLZ_System_Statistics.System_Load_15min *GT 28.00 ) *OR
( *VALUE Linux_Machine_Information.Number_of_Processors_Online *GE 8 *AND
*VALUE KLZ_System_Statistics.System_Load_15min *GT 32.00 )
)
That was formatted for easier viewing. The scheme is designed to be true if the 15 minute system load average was more then 4 times the number of processors on the system being run. The formula seems reasonable. However, an agent can only run a situation with a single attribute group. To calculate this situation, TEMS creates hidden sub-situations which achieve the same goal. In this case TEMS created 22 situation rules in the SITDB table:
XXXXX_XX_SYSLoadAvg15Min_C______
XXXXX_XX_SYSLoadAvg15Min_C_____0
XXXXX_XX_SYSLoadAvg15Min_C_____1
XXXXX_XX_SYSLoadAvg15Min_C_____2
XXXXX_XX_SYSLoadAvg15Min_C_____3
XXXXX_XX_SYSLoadAvg15Min_C_____4
XXXXX_XX_SYSLoadAvg15Min_C_____5
XXXXX_XX_SYSLoadAvg15Min_C_____6
XXXXX_XX_SYSLoadAvg15Min_C_____7
XXXXX_XX_SYSLoadAvg15Min_C_____8
XXXXX_XX_SYSLoadAvg15Min_C_____9
XXXXX_XX_SYSLoadAvg15Min_C_____a
XXXXX_XX_SYSLoadAvg15Min_C_____b
XXXXX_XX_SYSLoadAvg15Min_C_____c
XXXXX_XX_SYSLoadAvg15Min_C_____d
XXXXX_XX_SYSLoadAvg15Min_C_____e
XXXXX_XX_SYSLoadAvg15Min_C_____f
XXXXX_XX_SYSLoadAvg15Min_C_____g
XXXXX_XX_SYSLoadAvg15Min_C_____h
XXXXX_XX_SYSLoadAvg15Min_C_____i
XXXXX_XX_SYSLoadAvg15Min_C_____j
XXXXX_XX_SYSLoadAvg15Min_C_____k
This was the most number of sub-situations I have every seen.
I won’t bore you with every sub-situation definition but here are three selected examples found in the SITDB table. The SITDB table contains the SQL which represents the situation in the TEMS dataserver,
RULENAME: XXXXX_XX_SYSLoadAvg15Min_C______
PREDICATE: XXXXX_XX_SYSLoadAvg15Min_C_____0 OR XXXXX_XX_SYSLoadAvg15Min_C_____3 OR XXXXX_XX_SYSLoadAvg15Min_C_____6 OR XXXXX_XX_SYSLoadAvg15Min_C_____9 OR XXXXX_XX_SYSLoadAvg15Min_C_____c OR XXXXX_XX_SYSLoadAvg15Min_C_____f OR XXXXX_XX_SYSLoadAvg15Min_C_____
RULENAME: XXXXX_XX_SYSLoadAvg15Min_C_____i
PREDICATE: XXXXX_XX_SYSLoadAvg15Min_C_____j
AND XXXXX_XX _SYSLoadAvg15Min_C_____k
RULENAME: XXXXX_XX_SYSLoadAvg15Min_C_____k
PREDICATE: SELECT BIOSREL, BIOSVER, BRAND, CONFCPU, HOSTNAME, MACSERIAL, MODEL, ONLNCPU, ORIGINNODE, TIMESTAMP, UUID FROM KLZ.LNXMACHIN WHERE SYSTEM.PARMA(“SITNAME”, “LZIOS_BP_SYSLoadAvg15Min_C”, 26) AND SYSTEM.PARMA(“NUM_VERSION”, “8”, 1) AND SYSTEM.PARMA(“LSTDATE”, “1130405010445000”, 16) AND SYSTEM.PARMA(“SITINFO”, “TFWD=N;OV=N;”, 12) AND LNXMACHIN.ONLNCPU = 7 ;
This situation and a parallel one for 5 minutes load resulted in about 40% the incoming workload to the remote TEMS. The fact that situations contain a test for other situations requires TEMS Evaluation which is always very expensive. Just this situation alone could cause high CPU at the best and might cause a remote TEMS crash.
There was also a much higher workload at the agents where the situation was distributed. All 22 situations were evaluated at the sampling interval. Even if the issue did not exists, the situation tests required TEMS evaluation and so many many duplicate results had to be sent.
Alternative Situation(s) Example Solution
The following three situations use marker and data files to communicate information. Example situations are available IBM_sysload15. The marker/data files are stored in the <install>/ tmp directory. If more that one scheme is being used, the situation name would be made part of the marker file name.
IBM_processor_count
Since systems rarely change the number of online processors, the value is calculated at agent start up and then just once every 999 days. If your environment makes use of CPU hot plug technology then you could run it more frequently. In this example, the ongoing TEMS impact is zero since it is only evaluated every 999 days or during a TEMS connection,
The action commands are presented for ease of understanding, but will be one long line in the situation editor Action command,
Attribute Group: Linux Machine Information
Formula: (Number of Processors Online >= 0)
Sampling Interval: 999 days
rowsize:764
Action Command:
cd $CANDLEHOME/tmp ;
echo &{Linux_Machine_Information.Number_of_Processors_Online} >ponline.txt
The purpose is to record the number of online processors into a known file.
IBM_sysload15_calc
The situation uses the Linux System Statistics attribute group. The formula is set to be always true. The action command is configured to run at every interval.
Attribute Group: Linux System Statistics
Formula: (System Load Last 15 Minutes >= 0.00)
Sampling Interval: 15 minutes
rowsize: 236
Action Command:
cd $CANDLEHOME/tmp;
(
echo “&{KLZ_System_Statistics.System_Load_15min} ” ;
(cat ponline.txt 2>/dev/null || echo 1);
) |
awk ‘{load=$1;getline;cpu=$1;}
END{
if (load/100 > cpu*4.0)
system(sprintf(“echo %.2f >sysload15.hi”,load/100));
else system(“rm -f sysload15.hi”);
exit 0;}’
Here is an explanation of the action command. Remember that (…) creates a subshell environment, semicolon denotes one command and then the next. || means that the second command runs only if the first one had a non-zero exit command [or failed]. The | alone means the standard output is fed into the standard input of the next command.
— cd $CANDLEHOME/tmp;
===> make the $CANDLEHOME tmp directory the current directory.
— (
====> Begin a first level subshell environment
— echo “&{KLZ_System_Statistics.System_Load_15min}” ;
===> output the 15 minute system load into standard output.
===> The echo –n is not used because not all platforms support it.
— (cat ponline.txt 2>/dev/null || echo 1)
====> Within a new subshell, copy the ponline.txt into standard output. Suppress error
====> messages with 2>/dev/null and if no file exists, put 1 into the standard output/
— ) |
====> Close the first subshell and two lines are put into standard output
— awk ‘{load=$1;getline;cpu=$1;}
====> Run the awk command and get one number from each of the two lines
— END{
===> After reading all input perform the ending logic
— if (load/100 > cpu*4.00)
===> The load is the 15 minute system load. It is divided by 100 for scaling. The agent sends just an integer.
===> The cpu is the number of online processors, from the ponline.txt file
===> The test is the business rule to signal if system load is more then 4.0 times number of processors.
— system(sprintf(“echo %.2f >sysload15.hi”,load/100));
===> The first system is the if statement true logic. It creates a marker file.
— else system(“rm -f sysload15.hi”);
===> The second system is the if statement false logic. It erases the marker file.
— exit 0;}’
===> exit the awk process.
IBM_sysload15_high
Check the marker file. When present a high sysload 15 minute condition exists.
Attribute Group: Linux File Information
Formula: ( Path == ‘/opt/IBM/ITM/tmp/’ AND File == ‘sysload15.hi’)
Sampling Interval: 5 minutes
Rowsize: 3580
Event Receivers and Helper situations
The first two situations are helper situations. After testing make sure they are not associated with any Portal Client Navigation nodes. The EIF tab should be set so the events are not transmitted to any event receiver like Omnibus.
Remote TEMS Performance Estimate for 600 agents
Situation 1 – ignored since only runs once.
Situation 2 – always true so result is sent each time.
Every 15 minutes there will be 600 results at 236 bytes which averages 9440 bytes/minute
Situation 3 – Assume 600 agents and 5% are showing a problem.
The 30 agents that have a problem will send
Count*rowsize*freq/hr = 30*3580*12 = 1,632,480 bytes/hour or 27208 per minute.
In this scenario, there will be 36,648 result bytes per minute when combined with situation 2.
That is a 98.0% reduction from the observed rate on the remote TEMS.
You can adjust the situation intervals to speed recognition of the condition with a parallel increase in the remote TEMS burden. Here is a formula you can to estimate the bytes per minute burden on the remote TEMS.
(Agents*Agent_Sysload15%*60*3580)/Check_Interval + (Agents*236*60)/Calc_Interval
Agents = number of agents being looked at
Agent_Sysload15% = fraction of agents exceeding the sysload15 benchmark
Check_Interval = Sampling interval in minutes for condition check
Calc_Interval = Sampling interval in minutes for calculation situation
Note:
In some environments you may need to use nawk [new awk] to get correct results. This was seen in a Solaris platform.
Summary
This shows how to write a situation to reduce the number of result bytes to achieve a business rule. In the particular example the reduction was 98% compared to a single situation.
Sitworld: Table of Contents
Photo note: Three mixed up cats enjoying the sun.