By John Alvord, IBM Corporation
jalvord@us.ibm.com
Inspiration
A customer was trying to make a situation work only on some Linux Dell systems. He tried the following formula
*IF *VALUE Linux_Machine_Information.Hardware_Brand *EQ ‘Dell Inc.’ *AND
*MISSING Linux_Process.Proc_CMD_Line_U *EQ (‘*xxx_sa_datamgrd’, ‘*xxx_sa_eventmgrd’, ‘*xxx_sa_snmpd’,’*xxx_om_shrsvcd’)
Which seems reasonable but had five major and minor flaws.
One: DisplayItem cannot be set. When multiple processes were missing only a single result was sent. When the Portal Client Situation Editor sees that the first test is against a single row attribute, DisplayItem is not offered. Incidentally you can use tacmd editSit to add in a DisplayItem and that works, but that is hardly reasonable for a production situation. See here for the workaround.
Two: The single result sent from the agent can change. When the condition clears later, it might report a different process. In that case the Omnibus process will not clear the original.
Three: The situation will cause substantial CPU time overhead at the Agent. That can be reduced by increasing the sampling interval at the cost of increased time before alert. In general any use of the asterisk in *VALUE type tests will result in considerable CPU utilization.
Four: There are two attribute groups Linux_Machine_Information and Linux_Process. When the TEMS starts up the situation, it is decomposed into three different situations because Agents only handle single attribute situations. Effectively these are the generated situations:
sit1:
*IF *VALUE Linux_Machine_Information.Hardware_Brand *EQ ‘Dell Inc.’
sit2:
*IF *MISSING Linux_Process.Proc_CMD_Line_U *EQ (‘*xxx_sa_datamgrd’,’*xxx_sa_eventmgrd’,’*xxx_sa_snmpd’,’*xxx_om_shrsvcd’)
sit3:
*IF *SIT sit1 *EQ True *AND *SIT sit2 *EQ True
When there are two attribute groups involved, the above translates into a set union, rather then the sort of mathematical test we normally imagine. So it just cannot work.
Five: The above is also a performance problem because sit3 can only be evaluated at the TEMS the agent reports to. That means the sub-situations need to send results every sampling interval.
The customer was being reasonable but he didn’t know how ITM situation processing worked in the case.
That reminds of a great lyric from a singer Greg Brown:
“Life is not what you think it is. Life just is!“
Situations do not behave as you think they do. Situations behave according to the existing ITM programming logic. So with that in mind I developed a low impact, hands off solution using two situations. These examples use a Linux on zSeries test environment.
Alternative solution
The customer solution was superficially simpler but actually complex and wrong. This example solution is simpler than the customer case for ease of testing and presentation. You can take this example and mold it to what you need. Four example situations for Linux/Unix are are available here.
Situation 1 – IBM_Linux_IBM_MSL to maintain a IBM Managed system list
To minimize overhead, the test should run only on Linux machines of Brand “IBM”. To accomplish that we first need a way to generate a Managed System List for just the IBM Linux systems. The formula is simple
pdt: *IF *VALUE Linux_Machine_Information.Hardware_Brand *EQ ‘IBM’
The sampling interval is 999 days since this will not change as the system is running.
The situation is distributed to *ALL_Linux
The situation is evaluated once each time the Agent connects to a hub or remote TEMS.
The action command is where the magic happens. The action command is configured to run at the TEMS instead of the agent. For clarity of presentation the command is shown on multiple lines.
cd $CANDLEHOME/tmp;
mkdir &{Linux_Machine_Information.System_Name};
cd &{Linux_Machine_Information.System_Name};
(
$CANDLEHOME/bin/tacmd login -s server -u <userid> -p <password> >nul &&
$CANDLEHOME/bin/tacmd editsystemlist -l IBM_Linux_IBM_MSL -f -a &{Linux_Machine_Information.System_Name} >nul
);
cd ..; rmdir -rf &{Linux_Machine_Information.System_Name};
The logic with unique directories is required because if two tacmd logins are running at once they must be in separate directories. Here is a detailed explanation
cd $CANDLEHOME/tmp;
==>make current directory be the ITM tmp directory
mkdir –p &{Linux_Machine_Information.System_Name};
==> create a temporary directory for the tacmd upcoming
cd &{Linux_Machine_Information.System_Name};
==> make the temporary directory be the current
(
==> begin a subshell
$CANDLEHOME/bin/tacmd login -s server -u <userid> -p <password> >nul &&
==> login to hub TEMS. The && means that the next command runs if login succeeds
$CANDLEHOME/bin/tacmd editsystemlist –l IBM_Linux_IBM_MSL -f -a &{Linux_Machine_Information.System_Name} >nul
==> add the agent name to the IBM_Linux_IBM_MSL. –f means ignore errors
);
==> end of subshell
cd ..; rmdir –rf &{Linux_Machine_Information.System_Name;
==> remove temporary directory and contents
The >nul controls are used to swallow the command outputs. Otherwise any command standard output will leak into the TEMS operations log. In some cases you must add 2>&1 to swallow the standard error output.
NOTE: the IBM_Linux_IBM_MSL must be primed with at least one entry to avoid having to program a tacmd createsystemlist invocation. In recent ITM levels you can create a zero entry MSL to dodge this limitation.
There is overhead when the name already exists; however since the action command runs rarely the overall cost is low.
Situation 2 – IBM_Linux_missing_manual watch for missing proceses
This example situation tests for just one process which was not present in the test environment.
pdt: *IF Process.Process_Filter *EQ ‘*.manual.*’ *AND *MISSING MISSING Linux_Process.Proc_CMD_Line *EQ (‘manual’)
The sampling interval is whatever you want.
DisplayItem is set to Proc_CMD_Line
The distribution is set to IBM_Linux_IBM_MSL
This situation runs with high efficiency since the Process_Filter scheme uses very little CPU compared to *VALUE tests with asterisks. If all processes are present, no results sent to the TEMS. If there are processes missing, results are sent for each missing process at each sampling interval.
Performance Overview
Situation 1 only runs once per agent connection to a TEMS
Situation 2 runs with high efficiency and low TEMS impact.
If a new Linux system comes online and it is not a IBM Linux, situation 1 runs but the formula will be False and nothing happens.
If a new IBM Linux system comes online, situation 1 is true and the action command runs at the TEMS the agent connects to. Since it is new, the IBM_Linux_IBM_MSL is updated. Because of the MSL change, situation 2 starts on the new IBM Linux system afterwards.
This alternative solution is very low cost. The only situation running on all Linux systems only runs once during TEMS connection. The situation running on the IBM zLinux systems only sends results if there is a problem. Detection of a problem situation is low cost using the Process Filter attribute. Finally, running it requires no manual intervention.
Note: You might prefer to put the action command into a shell script on the TEMS to avoid specifying the userid and password in the situation action command.
Customer Application
Situation 1
pdt: *IF *VALUE Linux_Machine_Information.Hardware_Brand *EQ ‘Dell Inc.’
The sampling interval is 999 days since this will not change as the system is running.
The situation is distributed to *ALL_Linux
The situation is evaluated once each time the Agent connects to a hub TEMS.
Action command like the example but uses XXXX_Linux_Dell_MSL in action command
Situation 2
pdt: *IF Process.Process_Filter *EQ ‘.*(xxx_sa_datamgrd |xxx_sa_eventmgrd|xxx_sa_snmpd|xxx_om_shrsvcd).*’ *AND
*MISSING Linux_Process.Proc_CMD_Line *EQ ( ‘xxx_sa_datamgrd’,’xxx_sa_eventmgrd’,’xxx_sa_snmpd’,’xxx_om_shrsvcd’)
The sampling interval is whatever you want.
DisplayItem is set to Proc_CMD_Line
The distribution is set to XXXX_Linux_Dell_MSL
Unix Equivalents
The example situations include two for Unix OS Agents. The action commands are exactly parallel with different attributes and Managed System List.
Situation 1 – IBM_Unix_detect_manufacturer
Action command:
cd $CANDLEHOME/tmp; mkdir &{Machine_Information.System_Name}; cd &{Machine_Information.System_Name} ;($CANDLEHOME/bin/tacmd login -s fd7fdlp4.tivlab.austin.ibm.com -u root -p il0veitm >nul && $CANDLEHOME/bin/tacmd editsystemlist -l IBM_Unix_IBM_MSL -f -a &{Machine_Information.System_Name} >nul); cd ..; rmdir -rf &{Machine_Information.System_Name}
Situation 2 – IBM_Unix_missing_manual
Action command:
IF *VALUE Process.ProcessFilter_U *EQ ‘.*(manual).*’ *AND *MISSING Process.Command_U *EQ ( ‘manual’ )
Distribution: IBM_Unix_IBM_MSL
DisplayItem: Command_U
Windows Challenge and a Workaround
If the TEMS runs on a Windows system, there are significant problems because the agent name always has one or more colons [:] and the Windows file system does not allow a colon in a file name.
The easiest solution is a workflow policy running on the Windows TEMS which would wait for the detect situation event and then relay the needed command to a convenient Linux/Unix OS Agent which would run the Unix or Linux flavor of the command line to update the Managed System List involved. That helper Agent would have to have the tacmd component installed since that is not true by default.
The same technique could be used if the TEMS is a z/OS TEMS.
Windows Weird Science Experiment
Purely as a fun experiment, it might be possible to work around the colon in filename issue using a batch file something like this on the windows system:
IBM_update_IBM_MSL.bat:
@SetLocal EnableDelayedExpansion
@set Agent=%1
@set AgentDir=%1::_
@cd C:
@cd %TEMP%
@md /Q %AgentDir%
@cd %AgentDir%
@c:\IBM\ITM\bin\tacmd login -s server -u <user> -p <password> >nul & c:\IBM\ITM\bin\tacmd editsystemlist -f -l IBM_Windows_IBM_MSL -a %Agent% >nul
@ cd ..
@rmdir /RQ
@EndLocal
The third line “set AgentDir=%1::_” is where the magic happens. Following the = is a parameter %1 then one colon, a second colon and an underline character. It creates a new environment variable from the first parameter of the batch file. In the process, it converts all colons [:] into underlines [_] thus creating a legal directory name. The “SetLocal EnableDelayedExpansion” is needed to prevent all environment variables from being resolved before batch file execution.
The situation action command would be something like
C:\IBM\ITM\bin\IBM_update_IBM_MSL.bat ${Linux_Machine_Information.System_Name}
I have not tested this scheme. If you make something like this work, please share the work so all can use.
Summary
This post has shown how to automatically maintain a specialized Managed System List [MSL]. That MSL can be used to run a situation just for that list of Agents. This will always be more efficient than the obvious shotgun approach.
This scheme could be used in other contexts. For example, a Linux/Unix crontab script could periodically consult a database about what situations should be running on what agents. Then the tacmd login/tacmd editsystemlist could be run to manipulate the MSLs which control the situations. That works nicely even if the agent happens to be not online at the scheduled time. When it eventually connects to the TEMS, the right thing will be done.
Sitworld: Table of Contents
Note: some Photo – Koko and Rascal a few months old.
One thought on “Sitworld: Situation Managing Other Situations”