Saturday, June 29, 2013

CCIE DC: Thresholds in UCS

Hi Guys

Here is another amazing example how cool UCS is, have you ever wondered what the threshold policies are in UCS?

Threshold policies and statistic collection are two features of UCS that to be honest, in a lot of other blade chassis solution's you just either don't have them, or if you do you have to pay for them! In UCS it's totally free and it rocks.


As usual I have to credit the following for helping me work it all out:
http://www.cisco.com/en/US/docs/unified_computing/ucs/sw/gui/config/guide/141/UCSM_GUI_Configuration_Guide_141_chapter47.html

Also here is another great blog on the topic:


http://benincosa.com/blog/?p=622


OK! Let's chat about it.

So first of all, you have two things your worrying about here: Thresholds and statitics.

Now statistics are collected all the time, you don't have to do anything to make sure that the statistics are collected, all you have to do is potentially change the interval in which UCS collects these statistics, let me explain

There are two params that UCS uses, the collection param and the reporting Param, both of which you change under the ADMIN tab -> Stats Management -> Collection Policies ->

You can configure differing collection policies for diffirent UCS objects, but you cannot modify or rename them or change what it is that it's collecting the statistics for.






Collection policies are shown:






So the COLLECTION INTERVAL, is how often UCS will sample the values, the reporting interval is how long UCS will actually hold onto this data, so that you can see it in the GUI, the sampling interval (the collection interval) you always want to set nice and low (lowest is 30 seconds.) so that you can easier detect spikes etc.
The info from the last 5 Reporting intervals is stored in UCS, so in our example, we set the reporting interval to 2 minutes, so 2 x 5 = 10 minutes, and that's roughly what we can see on our graph:




On our motherboard graph, which uses the default collection policy setting which is set to 15 minutes. We can see that we can see the stats for 15 x 5 minutes



Let's take an easy example like a NIC on a server:


A lot of devices will have a statistics tab, from which you can gather quite a bit of info. What we are looking at here is the AVG, Min, Max and Delta, and finally the value column which is the running total since the statistics have started being collected.


Let's look at a NIC adapter to get some statistics.



So the Columns used are min, max and average, the Average is obviously the average of that particular collection, the Maximum is the maximum the value has been during the SINGLE reporting period, the minimum is the minimum value seen during this reporting period, and finally the DELTA is probably the most important: This is the value that was LAST SEEN, so the DELTA is the current value of the most recent statistics collection during the collection period.

Think of it this way: The reporting period is the whole lap in a Formula 1 Race, the Delta is just that particular Sector time ;).


Pay attention to the delta as we will be using this later!

OK the next thing to talk about is threshold policies, A threshold policy is used for us to monitor values in UCS and raise alerts on them based on them hitting certain values, imagine being able to get an alert raised if the temp of the motherboard exceeds a defined value, or if our uplinks start having a certain amount of traffic. AND IT'S ALL BUILT INTO UCS FOR FREE!





(Note: this picture taken from the Cisco Live breakout session "BRKCOM-2004")


So what you can do, is create a threshold policy, what this does is say, when a particular value is exceeded, please raise an alert, (the high water mark), when the value drops below a certain value (low water mark), clear the error.

To configure a policy, go to the tab of the threshold you are trying to look at (in our case, that would be the LAN tab), and find the threshold policies section, there will be multiple under each tab.








From here you can  modify the default threshold policy if you wish, but under certain tab's and policies you can only modify the default policy, either option is fine, but when you create the policy or modify existing, you will be able to "create threshold class"




This is where you choose what value you want to base your threshold on, so in server tab for example this would be temperature of power consumption, for us we are going to stick with a threshold for vNICs to alert us when traffic goes over a certain value. Note you can have multiple classes per threshold policy if you wish to monitor multiple values.

(You can find vNIC threshold policies under LAN Tab - Policies, Threshold Policies)


OK, let's select from the drop down the appropriate value:
 

 Next we define our threshold definitions



 The Normal value should be 0.0, since we expect the interface "normally" in a non-configured state to be 0 bytes every 30 seconds, if we see 20 bytes above normal (which is 0) on the interface for bytes transmitted, we will raise a critical alarm, when the value goes back down below 10 + the normal value (which is 0 as we discussed), then the alarm will clear, see below for an example (I chose low values in this example to make it easier to raise the alarms for you, obviously your values would differ:




Here we see the fault being cleared:






The NORMAL value is a bit confusing, when would this be something other than 0? Temperature would be a good example, you might want the "normal" temperature to be 30, and then you would say if the temp exceeds 50 degrees warn me, but clear the error when it goes back to down to 40. In this case you would set the normal value to 30, the up value to 20 (since 30 + 20 = 50 degrees) and the down value to 10 (since 30 + 10 = 40)

I hope that makes sense!



Now, probably one of the most common things you will want to do with UCS thresholds, is know when an interface is exceeding a certain amount of bandwidth, this requires some Math because the values you will be given are not in a format where you can just say "Tell me when the interface starts going over 8 gigabits per second"

No, unfortunately it is not that easy, our first problem is: the value is in bytes, the second issue is, it's a delta value, meaning it's just the most recent collection, as my brother put it when explaining it, think of it like the Delta is a lap time and your trying to work out how fast they are going per sector (The opposite of what I said before I know ;))

Anyway, let's use a simple example, we want to know when the link goes over 1 Megabit Per second. But our delta is every 30 seconds, so that is our first issue, the value we will see for bytes is the value over 30 seconds, not 1 second

Our next issue is that this value is in bits, not bytes, So right now we KNOW how many bytes where transferred over 30 seconds (since we have the delta) what we NEED is how many bits over each second.

If we can work out what 1 Megabit per second is in bytes per 30 seconds (because 30 seconds is our collection interval, if your collection interval was diffirent you would need to change your calculations accordingly)

1 Bit * 1000 = 1 Kilobit
1000 Bits * 1000 = 1 Megabit

1000000 = 1 Megabit

Next, we divide this by 8 to get 1 megabit per second to bytes:

125000

So 125000 bytes per second, so now we just need to times this by our delta interval, which is 30 seconds in our example since our collection interval is 30 seconds!

So 125000 x 30 = 3750000


So! Now we know, 1 Megabit per second is equal to 3750000 bytes every 30 seconds, so we we just times this by however much bandwdith we are looking to be alerted on, so let's say we wanted something huge like 8 gig, we would times 3750000 by 8000 (since there are 8000 megabits in a 8 gig link)

The result?:

30,000,000,000 bytes

Pretty huge number to be dealing with, but if you know the formula you will be OK!

just remmeber, 1000000 bits per megabit, divide by 8, times by the sampling rate, then times by however many megabits per second your trying to measure. (so if 1 gig, 3750000 by 1000 for example)

Wicked! Easy right?