John Alvord, IBM Corporation
jalvord@us.ibm.com
Introduction
The Wonderful World of Situations total blog passed 50,000 views recently. Here are some statistics and pointers to items you might have missed and a list of the ones I consider most important.
The Total views list is biased toward blog posts that have been available longest. The Rate view is biased toward more recent blog posts.
My “Most Important” posts are the ones that have the biggest potential to reduce the cost of customer and IBM Support by eliminating problems or speeding up recovery. I have already seen cases where a blog post allowed a customer to recover without no added support. That was “Running TEMS without Sitmon” and everybody benefited. I expect there are other such cases unadvertised.
Top 10 by total views
Sorted by total Views
Views |
Days |
Rate |
Title |
3531 |
392 |
9.01 |
MS_Offline – Myth and Reality |
3376 |
250 |
13.50 |
ITM Agent Health Survey |
3369 |
412 |
8.18 |
ITM Silver Blaze – Agent Responsiveness Checker |
2777 |
420 |
6.61 |
ITM TEMS Stress Tester Experiment |
2459 |
351 |
7.01 |
Mixed Up Situations |
2016 |
317 |
6.36 |
Detecting and Recovering from High Agent CPU Usage |
1997 |
405 |
4.93 |
Auditing TEMS for Improved Performance |
1876 |
371 |
5.06 |
Rational Choices for Situation Sampling Intervals |
1871 |
306 |
6.11 |
Best Practice TEMS Database Backup and Recovery |
1700 |
263 |
6.46 |
Sampling Interval and Time Tests |
Top 10 By View Rate
Sorted by Rate of views per day
Views |
Days |
Rate |
Title |
375 |
16 |
23.44 |
A Situation By Any Other Name… |
1081 |
55 |
19.65 |
ITM Situation Audit |
516 |
37 |
13.95 |
Running TEMS without SITMON |
3376 |
250 |
13.50 |
ITM Agent Health Survey |
3531 |
392 |
9.01 |
MS_Offline – Myth and Reality |
1122 |
134 |
8.37 |
Situation Limits |
3369 |
412 |
8.18 |
ITM Silver Blaze – Agent Responsiveness Checker |
129 |
16 |
8.06 |
Do It Yourself TEMS Table Display |
2459 |
351 |
7.01 |
Mixed Up Situations |
2777 |
420 |
6.61 |
ITM TEMS Stress Tester Experiment |
Top 5 By Importance [My Prejudiced View]
Best Practice TEMS Database Backup and Recovery
The most costly support cases are when a customer does not have a proper backup. One memorable case was after a Storage Access Network device lost power and the most recent backup was over a year ago. I talk to people every day where TSM is used to make copies of the TEMS Databases and that almost every time is insufficient. This post was written jointly by a top L3 engineer and myself. If everyone did this the time to recover would drop substantially.
MS_Offline type situations are extremely weighty and cause problems “at a distance”. For example a recent case with 9545 agents and 22 MS_Offline situations with 5 minute sampling interval has spawned multiple IBM Support interactions. They all come back to this one issue. If the Persist>1 is set, the problems are much worse. The blog photo here shows a California Condor [VERY LARGE VULTURE] lurking outside a window. Treat MS_Offline type situations as dangerous creatures and you will reduce your risk of injury and pain.
This has been available for 4+ years. It is a wonderful way to examine the dynamic impact of workload [Situations, SOAP, real time data requests,etc] on a TEMS. With that knowledge you can make changes to avoid problem conditions. I have one customer who runs this on every TEMS each weekend and if “advisory messages” are present [noted via a non-zero exit code] sends the report to an analyst for review. The rate of emergency IBM Support meetings has dropped to near zero… at least for this area.
This tool provides a view of agents which are online but possibly non-responsive. Cases like this mean that real time data response is slow and partially missing, situations are not running, historical data is not being recorded. These are things everyone should worry about. This exposes the guard dog that doesn’t bark.
This is the most recent project. It performs a static analysis on all distributed situations and produces report of warning messages. It also reports which situations need TEMS filtering [instead of Agent filtering] which is a prime performance killer. Together with TEMS Audit you can really increase efficiency – reducing the cost of monitoring. This also gets early warning for situations with problems. Surprisingly, 50 of 51,000 situations studied actually had syntax errors – like VALUE instead of *VALUE. Anyway – I expect this to be an important tool over time.
Summary
This is a fifteen month review of the blog posts.
Sitworld: Table of Contents
Photo Note: 400 pound Jade piece by Don Wobber