I bet we would all agree that the ultimate goal of IT monitoring is to prevent IT service outages. But admit, even with all the monitors set up in System Center Operations Manager and alerts configured to notify admins when vital thresholds have been reached, we are still not being efficient enough in solving (not to mention preventing) critical issues on time.
Dealing with alerts generated in SCOM has multiple complications. Too many alerts being generated and no clear priority system are a few examples, leading administrators into 'alert ignorance'.
In this article we are going to look at a different approach to identifying upcoming issues in your IT environment which will introduce clarity and guidance into the assorted jungle of alerts and capacity issues.
Built-in performance reports
When raw data is collected in SCOM it is also automatically moved to Operations Manager Data Warehouse and aggregated from RAW data into Hourly and Daily tables.
Each counter has 3 values stored for every point in time- the minimum, maximum and average.
Provided we know which managed entity and instance parameter are causing trouble, we could run a built in performance report to see all three of these values in one chart over specified period of time as shown in the example below.
This report is a good starting point when trying to understand the situation you are in. For example, when investigating an active alert. It provides a historical perspective which is important when trying to evaluate how good or bad the current values are.
It does, however, face some of the most common issues shared among all of the System Center reports. It is time consuming to run and it's difficult to gain some true value from it. Thus, to make it run faster we tend to use shorter time frame in the filter, giving us less accurate ground for assessment. Not to mention the main reason to run this is when it is already too late and we either have service failure or an alert has already been raised...
So in the end we have a tool, but we are still left with our gut feeling when trying to understand if current values are good or bad and if it is this issue that we should focus on and solve or if there is something more urgent going on somewhere else in our landscape.
Let us tackle these common report issues individually, starting with historical perspective.
The main reason for viewing some sort of historical data is to have a good base for judgement of current situation. By having an understanding of what was previously 'normal' for a specific counter we will be able to calculate deviation from normality. This deviation will be one of our guides when interpreting information and deciding on just how big of trouble we are in right now (and if we are in trouble at all).
In this chart we see the 3 standard measures- Min, Max and Average as 3 separate lines. In addition to standard values, we calculated the normal range for this counter, depicted in gray. To add a little precision to visual information, we show the actual Delta value, marking exactly how much current month is outside regular range.
Baseline + Forecast = Clear Priority
As mentioned previously, it is sometimes hard to decide what to act on while being flooded with the great amount of data coming from various managed entities. Using baseline range is already great help. But hold on. Remember last weeks article about Forecasting System Center Operations Manager data? Now we have another perfect opportunity to incorporate the forecast data. It will help you deciding on your prio list!
Lets compare two counters that are currently outside of the boundaries of their baseline. First is the one from previous example. This time with forecast added to it. We can see that according to forecast, this issue of C: disk consumption is going to escalate over time. A clear indication that something is chewing up your system drive.
Second managed entity is also below the normal range. Do note the forecast values though. We have a clear indication of an upcoming recovery.
Now activity planning is simple:
- Fix drive C: so that the problem stops escalating.
- Make sure D: is recovering as planned.
This shows just how much easier decision making becomes when our knowledge expands across time and our environment. But at this point we are still stuck with a massive amount of managed entities and instances and their counters...
So far we created easy to understand charts and filled them with useful data. But we still have so many of them that it is hard to read through them all, not to mention making any clever decisions and choosing an action path. We need to focus on something. The biggest problems. Problems that are within personal area of responsibility.
Let's start by filtering. So we cut the total list of monitored Managed Entities to only the ones that have gone outside of normal behavior by more than x %. The list just got shorter! Now we can order them by Delta value (current month difference from normal range), starting with the largest deviation from the norm.
And here is your action list by order of importance:
But these are all kinds of unrelated managed entities you might think... Not your area of responsibility?
This is exactly why all ITIL roads lead to IT Services.
The next smartest thing is to group technical issues by the services which they affect. And here is your final action list, all compressed into one chart, grouped by IT Service and ordered by number of objects with abnormal behavior:
Time to brew some coffee and click on the relevant bar to see the details. Well, you could do that in our demo, not on the screenshot above... Give us a shout if you want to see how this works!