Version 1.00000 – 8 October 2018
John Alvord, IBM Corporation
jalvord@us.ibm.com
Inspiration
In August 2014, the Database Health Checker began running at IBM ECUREP as an Analysis On Arrival task on each incoming hub and remote TEMS pdcollect. Since then TEMS Audit and Event History Audit reports have been added. The reports are very useful for by identifying known error condition and thus speeding ITM diagnosis of issues. Each of the tools can be run by any customer, but the AOA reports are not immediately visible. Any customer could ask for them but not being visible no one ever asks. At the same time the reports have become more complex and challenging to digest.
With a recent change, the process has been extended to create a short list of critical issues which will automatically be added to the S/F Case or PMR as a short email text. That creates visibility for critical issues. This document presents one specific critical issue – excess MS_Offline type situation usage.
Please note that the conditions identified may not be the issue the problem case was opened for. For example one recent case was a FTO hub TEMS switch to backup that was unexpected. After close study, the major issues was mal-configured agents including duplicate name cases, Virtual Hub Table Update floods and several other items. There are also rare cases where a report will be produced concerning an obsolete TEMS that is definitely installed but not in action use. In that case the report could be ignored – although uninstalling the TEMS would be a good idea.
Getting more information
If you are viewing this document as an customer working with IBM Support, you are welcome to request copies of the Analysis On Arrival reports if they are available. Be sure to mention the unpack directory from the AOA Critical Issue report.
TEMS Audit – temsaud.csv [any hub or remote TEMS]
Database Health Checker – datahealth.csv [any hub TEMS]
Event History Audit – eventaud.csv [any hub or remote TEMS]
There are cases when no report is generated. Sometimes that means there were no advisories. TEMS Audit is not produced when the relevant log files cannot be identified. Database Health checker is run but skipped if it appears to be a remote TEMS. Event History Audit and Database Health Checker are not run if there are errors detected in the table extract process.
Visit the links above to access the AOA programs if you want to run the AOA programs at your own schedule.
MS_Offline conditions
MS_Offline dataserver evaluation rate $prate agents/sec dangerously high
MS_Offline SITMON evaluation rate $prate agents/sec dangerously high
MS_Offline type situations – $miss_reason are missing the Reason *NE FA test. See DATAREPORT017
MS_Offline type situations are high impact and too many running to often in a large system can affect hub TEMS stability. See Sitworld: MS_Offline: Myth and Reality for a deep dive introduction to the operation of the offline process. The quick takeaways are
1) The Offline detection process is rather leisurely, on the order of 10-20 minutes so a low sampling interval like 1 minute or 30 seconds wastes resourcesto little advantage.
2) The test Reason *NE FA test is critical to avoid checking offline status when the hub TEMS has literally no idea of current status like just after startup.
3) The dataserver evaluation rate is how many times a second the TEMS dataserver or SQL process has to evaluate
4) The SITMON evaluation rate is how often the Situation Monitor logic has to calculate what is happening. This is mostly driven by MS_Offline type situations with Persist>1 but it is also driven by offline agents. SITMON evaluation is 10-20 times more expensive than dataserver evaluation.
The recovery action plan is simple: Stop all the MS_Offline type situations and set Run at Startup to off. Use the product provided MS_Offline situation but specified to 10 or 15 minutes. If multiple end users must be notified, do that work in the event receiver. In one recent small environment without an event receiver, the email alert was changed to a shell command which determined who needed to be emailed based on product or agent name. Never do that for large numbers of agents because those emails can all start at once and destabilize the hub TEMS.
Summary
This document explains about excess MS_Offline type situation activity and how to correct the issue
History
1.00000
Initial release
Note: 2018 – Home Grown Meyer Lemons