ESX CPU Monitoring – a hefty task

As virtualisation becomes a larger and larger part of the industry – so does the need to manage and monitor your infrastructure correctly. Working in one of the pioneers of dedicated virtual hosting within Australia has its benefits. The team here has been tuning and improving a number of elements over time. Some things are only learnt by experience in the industry.

As the VM fleet grows – so does the supporting infrastructure. We have had a few problems recently monitoring the underlying ESX CPU usage. Via the VMWare soap API we have had pretty good visibility over the guest CPU/MEM usage and we have tied this in really well with Netsaint (I know I know netsaint != nagios). Since I have come along I’ve kept an eye on the REAL CPU values provided by the API – I always felt something was off as we were hitting 90% – 95% REAL CPU but when we manually checked host usage via VIC the figures were much lower.

I took on the task of determining what was going wrong – and more important how to get around it. This would reduce false alerts and increase sleep for the oncall engineer.

The first trick was to determine that the API figures were not good to use – I concluded this when I built a new ESX host for our infrastructure and during provisioning alerting paged out that we had hit 99% REAL CPU. I confirmed immediately this wasn’t the case.

Now – how to monitor the figure accurately. Initially I wanted to snmpwalk to a MIB that contained the physical CPU usage – sadly after much searching I couldn’t find anything I wanted. I had to turn to scripting a solution. I discovered a file in ESX 3.5 that gave me some ideas:
/proc/vmware/sched/idle
This provides a break down of idlesec Vs usedsec for each core – but wasn’t what I was after.

The solution came after finding another similar fix for nagios monitoring (irk see above – still using nagios). Essentially the trick was to pull the figure from ‘esxtop’ – as this would give the most realistic and up-to-date figure available. To make this even easier and fit in with current monitoring requirements the decision was made to make this value available via an snmpwalk. The solution comes together from three elements:

1) Script to rip esxtop apart (awk figures will change depending on core numbers):

#!/bin/sh
/usr/bin/esxtop -d 2 -n 1 | /bin/awk -F "," '{print $24}' | /usr/bin/tail -1 | /bin/sed -e 's;";;g'

2) An snmp walkable script that will pass the output to the snmpwalk command – in our case this was the form of a perl script – here are some pieces:

my $miboid = 'snmp-walk-numbers-here';
my $command = '/home/user/bin/cpu-usage.sh';
my $return="Error: Cannot execute";


my $msg;
if($msg = qx("$command")){
$return = $msg;
}


# Needs to fit in a UDP packet
$msg = substr($msg, 0, 1300);


print "$miboid\n";
print "string\n";
print "$msg";

3) snmpwalk configured to point to the above

So far this has served us well – we have compared graphing and found the ESXTOP method to have far less spikes which is great for our needs. Easy to set up. Low load on the system to provide the result.

Comments are closed.