John Alvord, IBM Corporation
Draft #8 – 5 January – Level 1.08000
Introduction
ITM tracing is at once the most tiresome of topics and sometimes the most important. This post will collect everything I have collected and discovered over the years. Expect to see many additions and corrections over time.
Chapter 1 is dedicated to controlling diagnostic log file sizes.
Chapter 2 is dedicated to describing operation log files – what they are and where they are found.
Chapter 3 defines the diagnostic string.
Chapter 4 defines communication tracing with an important warning about that.
Chapter 5 presents defining static traces.
Chapter 6 presents tacmd settrace.
Chapter 7 presents the Service Console method of setting and removing diagnostic traces.
Chapter 8 presents a z/OS only method of setting and removing diagnostic traces.
Chapter I – Control of diagnostic log file size and location.
ITM Diagnostic log file size and location control differ by platform [Linux/Unix, Windows, z/OS and i/5]. This diagnostic log information contains detailed process information. By default – when control set to ERROR – you see error messages and any information level messages. The error messages do not always mean an actual error condition. The entire goal is to help IBM Support to understand product problem issues. This chapter shows how to control the size and location.
All of the examples assume the default install location. In practical usage you will specify the directories actually chosen for the particular installation.
Linux/Unix
Following is the best practice from ITM 623 GA onward. Earlier best practice is described at the end.
- In the /opt/IBM/ITM/config directory create a file xx.environment which has the same attributes/owner/group as xx.ini. For example, lz.environment and lz.ini or ms.environment and ms.ini. If one already exists, just use it. Here are example commands to create a new such environment file.
- cd /opt/IBM/ITM/config
- ls -l ms.ini which showed -rw-rw-rw- 1 itmuser staff 2565 Aug 5 21:17 ms.ini
- touch ms.environment
- chmod 666 ms.environment
- chown itmuser ms.environment
- chgrp staff ms.environment
- These environment variables are applied just before the start process completes and will take effect without any reconfiguration. A recycle is needed of course. When testing is over they can be commented out or deleted or the whole file deleted.
- Other environment variables can be specified, but we are working here on just KBB_RAS1_LOG where characteristics like diagnostic log segment sizes are defined. If this is absent, ITM has some built in defaults.
- Following are the environment variable lines to enable KBB_RAS1_LOG. Further explanations follow. The lines are separate by blank lines here for clarify but blank lines are not required.
CTIRA_LOG_PATH=${CANDLEHOME}/logs
KBB_VARPREFIX=%
RHN=`hostname|cut -d. -f1`
PRODUCTCODE=ms
KBB_RAS1_LOG=%(CTIRA_LOG_PATH)/${RHN}_${PRODUCTCODE}_%(syspgm)_%(sysutcstart)-.log INVENTORY=%(CTIRA_LOG_PATH)/${RHM}_${PRODUCTCODE}_%(syspgm).inv COUNT=16 LIMIT=20 PRESERVE=1 MAXFILES=64
Explanations:
- CTIRA_LOG_PATH needs to be set to directory where the logs will be kept. One large customer uses /var/opt/IBM/ITM/logs
- KBB_VARPREFIX needs to be set to allow ITM basic services to recognize substitutions. If % is already in use, [such as in LDAP environment variables] you can use ! or %%.
- RHM This will calculate the hostname the same way normal configuration does. It avoids then need for a reconfigure in the process. By rights this would be RUNNINGHOSTNAME but one problem management system used complained about that usage. Since it is calculated dynamically the actual name doesn’t matter.
- .PRODUCTCODE this should be set to the product code, like lz for Linux or ms for TEMS.
- Following are the elements of the KBB_RAS1_LOG. Some are the environment variables specified about and are not commented on.
- COUNT number of diagnostic log segments, 16 maximum. I consider 3 a logical minimum.
- LIMIT size in megabytes of diagnostic log segments. I once had to use 200 megs and 16 could to capture 24 hours log.
- PRESERVE=1 make sure the first segment is preserved
- MAXFILES total number of files to be kept. This *must* be larger than COUNT
- %(syspgm) will be replaced by main program like klzagent or kdsmain
- %(sysutcstart) will be replaced with start epoch UTC time in seconds
- The maximum space used will be MAXFILES*LIMIT in megabytes.
Before ITM 623 GA, best practice was as follows:
- Create a file xx.override in /opt/IBM/ITM/config with the same attributes/owner/group as xx.ini.
- Put within that file the change you want to introduce. Compared to modern setups the values need to be specified within single quotes such asKBB_VARPREFIX= ‘%’
- Instead of calculating RUNNINGHOSTNAME and PRODUCT on the fly, instead figure out what they should be and add them to the KBB_RAS1_LOG definition.
- In the xx.ini file add the following line. /opt/IBM/ITM/config/overrideThis is known as a source include definition.
- For many agents this is sufficient. TEMS is a instanced agent and more work is needed.
- The first method is to do a ./itmcmd config -S -t <temsname> and accept all defaults.
- The second method is to add the source include line into the hostname_ms_temsname.config file at the end.
Windows
Windows diagnostic log segments have the same form as Linux/Unix.
Tracing is defined via the Managed Tivoli Enterprise Monitoring GUI.
- Right click on agent [such as TEMS or Windows OS Agent]
- Select Advanced
- Select Edit Trace Parms…
There are entry areas for
Maximum Log Size Per File [LIMIT]
Maximum Number of Log Files Per Session [COUNT]
Maximum Number of Log Files Total [MAXFILES]
As usual a recycle is needed to implement.
z/OS
The diagnostics log information is included in the RKLVLOG SYSOUT file.
When intensive tracing is configured this can grow to a large size. If that happens, the following command will close the current SYSOUT file and start another. In that way the current log can be captured to a disk file. That is often performed using SDSF after the switch to “print” to a disk file.
/f cmstask,TLVLOG SWITCH
The CLASS option can be used to specify the output class of the new sysout file.
/f cmstask,TLVLOG SWITCH CLASS=W
z/OS logs are not configurable with KBB_RAS1_LOG. Configuration variables are kept in the RKANDATU PDS member KxxENV.
i/5
The i/5 platform [previously AS/400] uses the following form for KBB_RAS1_LOG. This environment variable is placed in QAUTOTMP/KMSPARM file member KBBENV.
KBB_RAS1_LOG=
(QAUTOTMP/KA4AGENT01 QAUTOTMP/KA4AGENT02 QAUTOTMP/KA4AGENT03)
INVENTORY=QAUTOTMP/KA4RAS.INV
COUNT=3
LIMIT=5
PRESERVE=1
MAXFILES=20
KBB_RAS1_LOG is specific as one lone line although it is presented here on separate lines for clarity.
The meaning of the controls is identical to Linux Unix.
Chapter 2 Operations log
The operations log is a high level view of ITM operations. It bears a close relationship to the ITM Universal Message Console. Universal Messages are written into a wrap around in storage table and are also written to the disk log for later analysis. The interesting benefit here is that you can write situations against the Universal Message Console. Each ITM process has a universal message console and if you want more details see this document:
Viewing the Universal Message Console (UMC) in ITM 6.x
Linux/Unix ITM Operations Log
The TEMS operations log is contained in the /opt/IBM/ITM/logs directory and has the name format
<hostname>_ms_<epoch>.log
Where
<hostname> names the server the TEMS is running on
<epoch> is a decimal number corresponding to the POSIX epoch – the number of seconds since 1 January 1970 not counting leap seconds at the time the log started. Diagnostic logs use a hex representation of epoch in the logs.
ITM Agents on Linux/Unix are contained in the same /opt/IBM/ITM/logs directory and has the name
<agent_name>.LG0
Where
<agent_name> is the managed system name [or Agent Name]
During each startup, the current <agent_name>.LG0 is copied to <agent_name>.LG1.
The Warehouse Proxy Agent operations log is not kept as a disk file. Those logs are preserved in a data warehouse table.
Windows ITM Operations Log
The Windows TEMS operations log is in the C:\IBM\ITM\cms directory and has the name kdsmain.msg.
The Windows Agent operations log has the same name format as Linux/Unix but any colons [:] in the agent name are converted to underlines. That is required
because the Windows operating system does not permit colons in file names. The location depends on many factors including 32 versus 64 bit agent instance.
z/OS TEMS and Agent operations log
There is no separate log for z/OS operations log. Instead it is inter-mixed with diagnostic log lines in the RKLVLOG SYSOUT file.
i/5 Agent operations log
The operations log information is a message queue
QAUTOMON/KMSOMLOG
Chapter 3 ITM Diagnostic String Format and Explanation
In the next chapter we show how to implement diagnostic traces but first let’s talk about the strings that define the diagnostic trace.
Every diagnostic string should start with
error
That is the default and is enough to understand many issues. In the diagnostic log you will see informational and “error” class messages. Text not within quotes is case independent, you can code that as error or ERROR and it means the same exact thing.
When you need more information one or more series of trace controls are added. Here is one example
(unit:kpxrpcrq,Entry:”IRA_NCS_Sample” state er)
Each control extends how much will be traced. The unit: setting relates to the source name from which the binary was compiled. There is a kpxrpcrq.cpp source program. The name can be contracted – for example unix:kpx would mean all source units that start with “kpx”. The Entry restricts the tracing to one function name. The following strings represent the sort of messages desired. More about that in a minute.
There is a rarely used control called comp: for component. Here is an example:
(comp:kde,unit:kdeprxi,Entry:”KDEP_ReceiveXID” all er)
Some source units are tagged with a component. The above means component kde [remote procedure call functions] and the rest as usual. Incidentally this last could also be coded
(comp:kde,unit:kdeprxi,Entry:”KDEP_ReceiveXID” all er)
I just happen to view = as more pleasing in this context.
Error classes are the following:
All, Error, Flow, Time, State, Input, Output, Metrics, Detail, Any
Each can be contracted to the first two characters.
These are not documented completely in any product documentation. The exact usage depends entirely on how the developers used the setting in creating the product code. The following is very general and close to reality.
All means the maximum number of messages.
Any means none of the messages. (!!)
Error means an error condition was observed. That might or might not mean a product error! I often explain to people that diagnostic and operations logs are used to do diagnosis on customer problems. The logs themselves mean little without knowledge of the customer observed issue. There is a project TEMS Audit which suggests possible issues based on TEMS diagnostic logs.
To conclude this chapter here is a hub TEMS workload/heartbeat diagnostic string:
error (unit:kpxrpcrq,Entry:”IRA_NCS_Sample” state er) (unit:kshdhtp,Entry:”getHeaderValue” all) (unit:kshreq,Entry:”buildSQL” all) (unit:kfastpst,Entry:”KFA_PostEvent” all er) (unit:kdssqprs in metrics er) (unit:kdsstc1,Entry:”ProcessTable” all er) (unit:kraafira,Entry:”runAutomationCommand” all) (unit:kglhc1c all)(UNIT:kfaprpst ST ER) (comp:kde,unit:kdebp0r,Entry:”receive_vectors” all er) (comp:kde,unit:kdeprxi,Entry:”KDEP_ReceiveXID” all er)
Each section adds an additional source section and type of diagnostic message. They are all ORed together.
You can put spaces between the () sections or not depending on what you prefer.
Efficiency
In general tracing is very efficient. That said, one customer set KBB_RAS1=ALL on a remote TEMS and the remote TEMS was unstable. Usually IBM Support will specify a trace for a particular service. It also gets used to capture performance data.
In some appendixes near the end there are multiple examples of useful traces.
Chapter 4 Communications Tracing – and Why to Avoid!!
This deserves a separate chapter for two reasons.
1) Communications tracing is usually misused and often delays diagnostic progress for days or week or months.
2) The style of setting communications tracing is wildly different than the style just documented.
Never do Communication Tracing Unless Specifically Instructed
Many ITM issues present as communication issues. Remote TEMSes disconnect from hub TEMS, agents switch from one remote TEMS to another, agents go on and offline, new agents cannot connect to a TEMS, and the list could go on and on. These cases are almost never a communication issue!! Most the time these are workload or configuration issues. That may seem counterintuitive but it is an everyday fact.
There have been some tens of thousands of ITM PMRs a year since 2004. As a guess without research perhaps 120,000 from 2004 to 2017. Perhaps 5,000 have had some apparent communications issue. And yet I have been aware of maybe 50 actual communications problems. If you double that to include cases I was not aware of, that means 100 – maybe one a month. That means 98% of the time there are no underlying communications issues and that is probably an underestimate.
When you add communications tracing beyond the default, most of the diagnostic trace log slots are used for that purpose. You cannot see diagnostic log entries that might inform on what is happening. The diagnostic log segments wrap around and often the communications trace entries cause the log segments to wrap around in 10-20 seconds or lower. You have a much smaller window of time and that might not include the time of most interest.
In proper communications tracing you need tracing at each point of the ITM conversations – like the agent side and the TEMS side. You also usually need a TCP packet trace on each side, a substantial effort. Turning on communication tracing on one side rarely adds useful information and prevents recording of useful information. It is 100% wrong to turn on extended communication tracing unless it has been specifically requested by IBM Support L3 or Development or by an experienced support IBM engineer.
Turn added communication tracing on and you delay resolution 98% or more the time. Please take this into consideration before wasting your and IBM’s valuable time and delaying resolution of your problem. If another IBMer asks you to do that push back unless the request comes from an experienced source. If a colleague suggest that, resist that unless you can do all the needed work and get the go ahead from IBM Support experienced engineers.
Communications tracing Details
Tracing communications involves a large suite of interconnected functions. Because of this the default is to turn them on at different levels at once. There are 3 environment variables involved and one super environment variable that can turn them on or off together.
KDC_DEBUG: Low level TCP logic
KDE_DEBUG: Remote Procedure Call logic
KDH_DEBUG: Internal Web Server logic
KBS_DEBUG: the Logical combination of above three at once. This is not available in all circumstances. I think it was introduced in ITM 621 and some cases still don’t have.
Possible Settings
N – Normal. This default level gives enough information about status changes and errors to track what is happening.
Y – Yes. Adds more details to understand the sequence of logic.
I – Inhibit. No communication tracing at all.
V – Verbose. More details
S – State. State changes
T – Trace.
M – Maximum. Adds more. This can be useful if you have a mysterious “failure to resolve DNS name” condition. It will show you that name and then you can track it down in the configuration.
D – Detail. Adds dumps of internal control blocks and packets
A – All. Adds “private” functions which are normally suppressed as trivial and uninteresting.
In later chapters we will see how to change these settings statically or dynamically.
Chapter 5 Implementing Diagnostic Trace Controls – Statically
Modern environment files – Linux/Unix
This was introduced at ITM 623 GA – 2011/8/31.
To implement this tracing control, which tracking incoming agent results
error (unit:kpxrpcrq,Entry:”IRA_NCS_Sample” state er)
Using the technique from Chapter 1 create a file xx.environment. Add this line
KBB_RAS1=error (unit:kpxrpcrq,Entry:”IRA_NCS_Sample” state er)
Note that no quoting is needed around that string.
The one exception is the Warehouse Proxy Agent – HD – before ITM 630 FP7. In that case the ini file is needed.
If you are using communication tracings, this is where they are added, like
KBS_DEBUG=Y
Update ini files – Linux/Unix
At all levels you can add the line
KBB_RAS1=error (unit:kpxrpcrq,Entry:”IRA_NCS_Sample” state er)
to the xx.ini line. That is the only thing needed for normal agents. Each time the agent is started [or stopped] the .config file is recreated.
If you are using communication tracings, this is where they are added, like
KBS_DEBUG=Y
For instanced processes like TEMS you need to update the config file. Do that by updating the xx.ini file and reconfiguring the process. You can also update the config file for instanced agents. For example, on recent TEMS you can add this to ms.config or from the beginning hostname_ms_temsname.config
KBB_RAS1= ‘error (unit:kpxrpcrq,Entry:”IRA_NCS_Sample” state er) ‘
Note the single quotes. That is because this is logically a korn shell file inclusion. Make sure it follows any existing KB_RAS1 setting.
If you are using communication tracings, this is where they are added, like
KBS_DEBUG=’Y’
The advantages of updating xx.environment are one) no config needed, just a process recycle and two) such updates are automatically preserved based any ITM maintenance process.
Windows Static Trace Setting
Using MTEMs, right click on the process line, select Advanced, select Edit Trace Parms…
In the dialog box enter this in the Ras1 Filter box
error (unit:kpxrpcrq,Entry:”IRA_NCS_Sample” state er)
Note that there is no explicit KBB_RAS1 here. If you include it things won’t work.
In that dialog box you can alter KDC_DEBUG settings. For the other ones add using Advanced/Edit Variables…
i/5 Agents
Add the following line to QAUTOTMP/KMSPARM file member KBBENV and recycle the agent
KBB_RAS1=error (unit:kpxrpcrq,Entry:”IRA_NCS_Sample” state er)
If you are using communication tracings, this is where they are added, like
KBS_DEBUG=Y
z/OS
Add the following line to RKANDATU PDS member KxxENV, where xx represents the ITM process. For example KMSENV for TEMS or KD5ENV for the DB2Plex agent
KBB_RAS1=error (unit:kpxrpcrq,Entry:”IRA_NCS_Sample” state er)
If you are using communication tracings, this is where they are added, like
KBS_DEBUG=Y
Chapter 6 Implementing Diagnostic Trace Controls – settrace
From ITM 623 FP2 new tacmd functions were introduced
Tacmd settrace – set trace at a managed system
Tacmd listtrace – display current settrace settings
Benefits:
- No need for direct access to the system where the ITM process running
- The initial userid validation serves instead of needing a per system validation.
- Works transparently across all firewalls, KDE_Gateways and ephemeral connections.
Requirements: All links must be at ITM 623 FP2 or later.
With this you can alter six tracing properties
KBB_RAS1
KDC_DEBUG
KDE_DEBUG
KDH_DEBUG
KLX_DEBUG [z/OS only]
KBS_DEBUG
First do a tacmd login first to the hub TEMS. A typical command is
Linux/Unix
cd <installdir>/bin
./tacmd settrace -m <managed_system_name> -p KBB_RAS1 -o ‘error ‘
Windows
cd <installdir>\bin
tacmd settrace -m <managed_system_name> -p KBB_RAS1 -o “error ”
The reset option changes the tracing property back to what is was before the most recent settrace
Linux/Unix: ./tacmd settrace -m <managed_system_name> -p KBB_RAS1 -r
Windows: tacmd settrace -m <managed_system_name> -p KBB_RAS1 -r
Here is an example from earlier:
Linux/Unix
./tacmd settrace -m <managed_system_name> -p KBB_RAS1 -o ‘error (unit:kpxrpcrq,Entry:”IRA_NCS_Sample” state er)’
Windows looks more complicated. The surrounding quotes are a double quote ” and any embedded double quotes must be tripled.
./tacmd settrace -m <managed_system_name> -p KBB_RAS1 -o “error (unit:kpxrpcrq,Entry:”””IRA_NCS_Sample””” state er) “
Here is a basic workload trace with heartbeat observation at hub TEMS
./tacmd settrace -m <hubtemsnodeid> -p KBB_RAS1 -o ‘error (unit:kpxrpcrq,Entry:”IRA_NCS_Sample” state er)(unit:kshdhtp,Entry:”getHeaderValue” all) (unit:kshreq,Entry:”buildSQL” all)(unit:kfastpst,Entry:”KFA_PostEvent” all er)(unit:kdssqprs in metrics er)(unit:kdsstc1,Entry:”ProcessTable” all er)(unit:kraafira,Entry:”runAutomationCommand” all)(unit:kglhc1c all)(UNIT:kfaprpst ST ER)(comp:kde,unit:kdebp0r,Entry:”receive_vectors” all er)(comp:kde,unit:kdeprxi,Entry:”KDEP_ReceiveXID” all er) ‘
The tacmd settrace is tremendous improvement compared to what was done earlier. If you need continued tracing starting now a static trace is set and also a tacmd settrace to start immediately. That way tracing will continue after a process restart.
The major case where tacmd settrace is not useful is when a condition occurs so early there is no time to get it specified before the condition occurs. In that case a static is the only possible method.
Communication Trace controls change dynamically
Linux/Unix: ./tacmd settrace -p KBS_DEBUG -o ‘Y’
Windows: tacmd settrace -p KBS_DEBUG -o “Y”
Chapter 7 Implementing Diagnostic Trace Controls – earlier methods
The major earlier method is “Monitoring Service Console”. You start the process by a web browser session to the system where the ITM process of interest is running.
This will present a list of services. The one of interest is IBM Tivoli Monitoring Service Console. Click on that line. At that point you need to enter a userid/password which is validated using “native” services – appropriate to the system platform. For example if system is Windows, that must be a valid userid/password on that system.
When you arrive at the service console screen there is an area to compose or copy/paste in commands and then press Submit button. The commands are
ras1 – help function
ras1 set – turn on/off tracing
The format looks a lot like above. Here is an example
ras1 set error (unit:kpxrpcrq,Entry:”IRA_NCS_Sample” state er)
One thing that is different is how to turn such a trace off. “any” turns everything off and add er [error] to turn back on tracing.
ras1 set error (unit:kpxrpcrq,Entry:”IRA_NCS_Sample” any er)
ras1 list – list current tracing
You can also set KDC_DEBUG/KDE_DEBUG/KDH_DEBUG. It does NOT recognize KBS_DEBUG.
bss1 config KDC_DEBUG=Y
bss1 config KDC_DEBUG=N
Problems with the Service Console
First you need password and userid credentials on the target system. In large environments that may not be available.
Second, you need to get to communication access. If firewalls are involved that can be quite complicated. If there are several ITM processes, each one will have an internal web server. Only one gets to own the port 1920 [default]. The other web servers register with the port owner. When you click on the Monitoring Web Server, that might well refer to another listening port like 57246. Firewalls might be open to 1920 but not to that other and suddenly you cannot get to the service console.
Third, z/OS agents can connect via VTAM/SNA [not TCP] and so cannot get to that internal web server usually.
Fourth, turning traces on and off are crazy hard because of needing to track what was turned on and then using ANY to turn it back off later.
Chapter 8 z/OS dynamic trace changing.
Another way of dynamically changing traces exists only on z/OS ITM processes. At this writing it has been largely supplanted by tacmd settrace, but it could still be useful in a largely z/OS ITM environment. Following is an example taken from a technote published a few years ago.
1) Create a member in &rhilev.&rte.RKANCMDU called TRACEON and place these statements in it:
CTDS TRACE ADD FILTER ID=BL1 UNIT=KFAOT CLASS(ALL)
CTDS TRACE ADD FILTER ID=BL2 UNIT=KO4RULEX CLASS(ST,ER)
CTDS TRACE ADD FILTER ID=BL3 UNIT=KO4SITMA CLASS(ST,ER)
CTDS TRACE ADD FILTER ID=BL4 ENTRY:IBInterface::sendAsyncRequest (FL,ER)
CTDS TRACE ADD FILTER ID=BL5 UNIT=KO4LODGE CLASS(ST,ER)
2) Create a member in &rhilev.&rte.RKANCMDU called TRACEOFF and place these
statements in it:
CTDS TRACE REMOVE FILTER ID=BL1
CTDS TRACE REMOVE FILTER ID=BL2
CTDS TRACE REMOVE FILTER ID=BL3
CTDS TRACE REMOVE FILTER ID=BL4
CTDS TRACE REMOVE FILTER ID=BL5
3) Issue the following command from SDSF or your system
console:
/f <tems_task>,TRACEON
4) Recreate the problem and
then:
/f <tems_task>,TRACEOFF
Remember that you can use
/f cmstask,TLVLOG SWITCH
to close the current RKLVLOG sysout file and start a new one. The just closed one can be “printed” to a disk file in SDSF in preparation for sending to IBM support.
Summary
This concludes the discussion on how to defined, start and stop ITM dynamic tracing. ITM development worked hard to make the product traceable to enable effective support.
Sitworld: Table of Contents
History and Earlier versions
1.00000
Initial publication
1.01000
Chapter 2 on Operations Log added
1.02000
Correct Linux/Unix KBB_RAS1_LOG example
1.03000
Chapter 3 on Diagnostic Trace Strings
1.04000
Chapter 4 on Communications tracing
Chapter 5 on Static diagnostic trace settings
1.05000
Chapter 6 on modern dynamic trace changes
1.06000
Chapter 7 on older dynamic trace change methods
1.07
Chapter 8 on z/OS only tracing
Photo Note: Jealous Octopus Holding On to Jade Sculpture
John, This is great information. Is there a specific trace for tacmd login on TEMS ?
LikeLike