« On Nukes | Main | Dot E »

Monday, April 25, 2011

Enhanced check_esxi_hardware.py for Nagios and pnp4nagios


Having spent a bit of time implementing Trond Hasle Amunsen's wonderful check_openmanage plugin for Nagios to monitor the Dell Windows and Linux servers at work, I came to wondering if the same was possible for our VMware ESXi boxes. I was monitoring them with the check_esxi_hardware.py plugin, maintained by Claudio Kuenzler. That, unfortunately, didn't collect performance data and lacked the clever html links to Dell documentation found in check_openmanage.

So, I got to work, emulating some of check_openmanage's features.

The features I collect performance data for are those found on our ESXi boxes, Dell M600, R815, and R905 models.

M600

Power consumption

System board ambient temperature

R815

Power consumption

System board fan speeds

System board ambient temperature

System Internal Expansion Board 1 IO1 Planar Temp

System Internal Expansion Board 1 IO2 Planar Temp

Power supply voltages and currents

R905

Power consumption

System board fan speeds

System board ambient temperature

Power supply voltages and currents

I've also created a check_esxi_hardware.php template for pnp4nagios.

They're here in human-readable form:

check_esxi_hardware.py.html

check_esxi_hardware.php.html

Or download check_esxi_hardware.zip

check_esxi_hardware.py (not formatted as html)

check_esxi_hardware.php (not formatted as html)

Update, April 28th:

Now includes:

Indentation of the verbose output

Support for the HP Proliant BL460c, and, drum roll....

Proper parameter handling, which gracefully fails back to the original commandline format:

  usage: check_esxi_hardware.py https://hostname user password system [verbose]
  example: check_esxi_hardware.py https://my-shiny-new-vmware-server root fakepassword dell
or, using new style options:
  usage: check_esxi_hardware.py -H hostname -U username -P password [-V system -v -p -I XX]
  example: check_esxi_hardware.py -H my-shiny-new-vmware-server -U root -P fakepassword -V auto -I uk
or, verbosely:
  usage: check_esxi_hardware.py --host=hostname --user=username --pass=password [--vendor=system --verbose --perfdata --html=XX]

The hardware vendor string defaults to unknown, which is treated the same as ibm. intel has a slight quirk with BIOS identification. dell is similar to the previous cases, but also allows html links to product documentation and warranty information. hp have their own CIM return values to handle, so they are a special case. But the best of all is auto, which determines the vendor (if it can), from the Manufacturer information from CIM.

That's it for now, I consider it stable enough for production.

One improvement would be better handling of CIM numeric sensor names we haven't encountered yet. That should be possible with a bit of thoughtful regular expression wizardry, but I'm going to pass on that for the forseeable future.

Update, April 29th:

Rewritten perfdata code should now do something sensible on any vendor's hardware.

By peeking at the CIM UnitType attribute, I now correctly handle HP's Virtual Fan (or anyone else's) speed as a percentage, and can distinguish between power consumption (Watts) and current (Amps) automatically.

Mopping up of any quirky sensor name formatting can be done in check_esxi_hardware.php

Update, May 3rd:

Minor bug fixes, code reorganisation, and sorted performance data.

Performance data is now sorted by sensor name within sensor categories in the following order: Power, Voltage, Current, Temperature, Fan Speed, and (Virtual) Fan percentage.

A major side effect of these changes is that the sensor data previously created by check_esxi_hardware.py in /usr/local/pnp4nagios/var/perfdata is not compatible with my new code, and will have to be erased.

Update, May 4th:

More fixes:

Minor code changes and documentation improvements

Remove redundant mismatched ' character in performance data output

Output non-integral values for all sensors to fix problem seen with system board voltage sensors on an IBM server (thanks to Attilio Drei for the sample output)

Update, May 5th:

Added --no-power, --no-volts, --no-current, --no-temp, and --no-fan options to suppress performance data output by category

A few minor optimisations

Update, May 6th:

Added -t / --timeout parameter, ensuring it doesn't run on Windows (it works in Cygwin, though)

Made the new file:passwordfile option work for old-style command lines too

Update, May 7th:

On error, include the numeric sensor value in output

Example from this morning, aircon fail in one of our datacentres:

 

Things got rather hot, and the system fans all went into overdrive:

 

Power consumption on the few boxes and blade chassis' I looked at increased by 20 to 25 percent above normal.

Update, April 2nd, 2012

I've updated check_esxi_hardware.py to fix Dell warranty links (when you click on the displayed Tag No) to point to the new Dell Support site.


Posted by Phil at 10:51 PM
Edited on: Wednesday, April 01, 2015 10:49 PM
Categories: IT, Waffle