« Environment | Main | Music »

Monday, April 29, 2013

Migrating Nagios Configuration from Nagmin to Check_MK's WATO


When I first set up a Nagios server, many years ago, in the days of Nagios 1.x, the best configuration tool I could find was Fred Reimers' Nagmin. That has since turned into abandonware, but there is a fork, NagminV, under development.

I'd patched Nagmin to support Nagios 2.x and 3.x, and added a few fields to its database, but it was still buggy and quirky.

So, after I'd installed Mathias Kettner's Check_MK for its livestatus broker module for use with PNP4Nagios, I started investigating Check_MK's broader features.

What soon caught my eye was Check_MK's use of rulesets based on host tags. The temptation of editing text files (python scripts in disguise) was too great for me, so I started converting my Nagios service checks into Check_MK format.

These are very rough notes, proceed with caution. Back up everything first!!!

First thing to do was get Check_MK's agent on all our hosts.

Then, to list all of them in /etc/check_mk/main.mk:

# don't generate host config yet
# comment this out when Nagmin is decommissioned
generate_hostconf = False
all_hosts = [
host1,
host2,
]

Then I added the obvious tags, win, linux, etc, and created config files for legacy checks in /etc/check_mk/conf.d, until eventually all service checks were defined in Check_MK and not in Nagmin.

After each check was migrated to Check_MK, I'd run

check_mk -U
nagios -v /etc/nagios/nagios.cfg

The first command generates a new Nagios config in /etc/nagios/check_mk.d/check_mk_objects.cfg

The second validates the resulting config. Here I'd find duplicate definitions, reminding me to delete them from Nagmin.

So, after a while plodding away at this - in my case this meant over a year of coexistence - all that was left in Nagmin was hosts, host groups, contacts, contact groups, and timeperiods.

Time to abandon Nagmin and get serious.

Do not, under any circumstances, try to generate a Nagios config using Nagmin again. It will always fail!

Comment out the Command.cfg include in nagios.cfg

#cfg_file=/etc/nagios/Command.cfg

Try validating the nagios config again, with nagios -v. If it fails, you've forgotten to define some commands as legacy checks

Rinse, repeat until you've got it right.

Check the contents of /etc/nagios/Services.cfg - if it contains any service definitions, you've forgotten something.

Do the same with all the service-related .cfg files; if things fail, change your config to use Check_MK's default service templates.

Get your custom Time Periods into WATO - check_mk has a default, hidden timeperiod 24X7 (Nagmin's is 24x7, case matters), and comment out TimePeriods.cfg in nagios.cfg. Revalidate.

And so on till you're left only with hosts, host groups, contacts, and contact groups from the original Nagmin.

So now was the time to take the leap and use Check_MK's WATO to configure nagios.

I created contacts and contact groups manually in WATO. There's no conflict with your existing Nagios config until you associate a contact group with hosts and/or services in WATO. That was the last thing I did.

Now comes the fun bit.

We need to import our hosts into WATO.

Run the attached hosts.py which will generate a list of hosts in the format hostname;alias;parents;ipaddress

python hosts.py >wato.csv

Our import script requires a file containing

wato folder;hostname;alias;parents;ipaddress;tags

where tags is a list of check_mk tags separated by |

So, we'll have to manually edit it. I didn't use WATO folders so just preceded each line with a ;

Tags are the fun bits. By now you may have been using them in your check_mk config. WATO comes with some predefined ones, which we need to list in our import file (wato.csv).

You'll need one each of the 'agent', 'criticality', and 'networking' tags. As well as your custom tags, which should also be entered into WATO.

Download and edit my_wato_import.py adding your tag definitions into tagz, run:

python my_wato_import.py wato.csv

and watch the output scroll by. If there are any errors reported, then you've forgotten to add a tag definition to tagz, or misspelt a tag in wato.csv. Rinse and repeat until all's well.

Set testing = False near the top of the script, and run again.

Comment out generate_hostconf = False in your main.mk, and check_mk -U

Comment out Host.cfg in /etc/nagios/nagios.cfg

Validate your nagios configuration. It should be OK.

A look in WATO will show all your hosts, with appropriate tags.

Hidden away in /usr/share/doc/check_mk/treasures is a script wato_host_svc_groups.py

I ran this against my original Nagmin Hosts.cfg file, which produced output which was easily massaged into the form needed for WATO. I could have amended the script to produce output of the following form, but a few regular-expression search and replaces got me there quickly enough.

host_groups = [
( 'group1', []. ['server1', 'server2']),
( 'group2', []. ['server3', 'server3']),
]

Place that code into /etc/check_mk/conf.d/wato/rules.mk and create the host groups (with descriptions) in WATO. Before applying changes in WATO, run check_mk -U

Comment out the Hostgroup.cfg include in nagios.cfg, and validate config once more.

#cfg_file=/etc/nagios/HostGroup.cfg

Now, if you've done everything properly, the Nagios config validation will succeed, and on a restart of Nagios your host groups will be there as before.

That just leaves Contacts, Contact Groups and notifications.

I'll leave that as an exercise for the reader. Hint: don't try any of the above until you've figured out how to apply contact groups to hosts and services.

And you'll also need to adjust host and service check intervals and retries in WATO too, otherwise everything gets polled every minute, which probably isn't what you want.



Posted by Phil at 9:59 PM
Edited on: Wednesday, April 01, 2015 9:24 PM
Categories: IT, Software

Friday, May 18, 2012

Disabling Forefront for Exchange 2010 when Installing Exchange Service Packs and Hotfix Rollups


Belatedly installing Microsoft Exchange 2010 Service Pack 2 Hotfix Rollup 2 this week, I once again was niggled by the need to manually disable Forefront for Exchange first.

Unknown to me, the Microsoft Exchange designers included the right hooks into the product to make this easy.

A quick web search led to an Exchange Team blog post from back in June 2010, entitled "Sample script to disable and enable Forefront service during patching".

Unfortunately, their sample script leaves a lot to be desired, and isn't general enough to be useful everywhere.

So I've tweaked it into a more sensible form, which you can download here.

It should be placed in <Exchange installation folder>\Scripts\Customization

On my Exchange 2010 systems, that's C:\Program Files\Microsoft\Exchange Server\V14\Scripts\Customization

Create the Customization directory if it does not already exist.

Fixes:

1: If Forefront for Exchange isn't installed, do nothing.

2: If one of the listed services is not present on the box the script is run on, treat it as successfully started/stopped and continue.

3: Wait until the service is successfully stopped/started, or for 3 minutes, whichever happens first (easily modified to suit your environment and experiences). Based on a code snippet by andreister on StackOverflow.

The end result is that the same copy of the script can be deployed to all Exchange 2010 servers in your organisation, and it just does the right thing.

Enjoy.

Postscript, May 30th

There's a bug in the original which rendered the script ineffective. The path to the FSCController executable, retrieved in line 62 of the script, is enclosed in double quotes. These need to be stripped off for the script to do the right thing.

Adding

    $imagePath = $imagePath -replace '"(.*)"', '$1'

after line 65 strips the offending " characters. Fixed, properly tested in an Exchange 2010 SP2 RU3 install.



Posted by Phil at 11:20 PM
Edited on: Wednesday, April 01, 2015 10:52 PM
Categories: IT

Monday, May 14, 2012

[Updated] Compiling an RPMForge-compatible Nagios 3.5.0 RPM


At some point I had the brilliant idea of replacing our hand-compiled build of Nagios with RPMForge's RPM version.

All was well with that, but RPMForge is still stuck at version 3.2.3, and Nagios 3.4.1 has just been released.

There's one patch from Icinga which we really want, by Icinga core developer Michael Friedrich (@dnsmichi), which fixes perfdata issues (a regression in nagios 3.3.1 which is still not fixed in 3.4.1):

re-allow perfdata with empty results being put on perfdata channel, disable via opt-in cfg option

Updated, Sept 5th, 2012:

Nagios users have reported memory leaks if embedded perl is compiled in, even when not used, so I've removed it from the configuration in the nagios.spec file.

In this build, I also re-implement execv (bug 346) using my fixed regexp. Double quotes and escape characters are no longer an issue. This is based on the changes to checks.c in Icinga to implement execvp.

I've also fixed a problem with pagination in pages generated by a hostname search in the Nagios 3.4.1 status.cgi.

Fixed too is a sorting issue on paginated pages (bug 381).

Also included are cvelasco's patches for a problem with scheduled downtimes (bug 338) and some memory leaks in 3.4.1 (bug 339).

So, to build a Nagios 3.4.1 RPMForge-layout compatible RPM on CentOS 5.x, I did the following:

1: Download nagios-3.2.3-3.rf.src.rpm and install (you'll need the --nomd5 switch in rpm).

2: Download nagios-3.4.1.tar.gz into /usr/src/redhat/SOURCES

3: Download my perfdata.patch into /usr/src/redhat/SOURCES

4: Download my execv-v2.patch into /usr/src/redhat/SOURCES

5: Download my status.patch into /usr/src/redhat/SOURCES (this will be in Nagios 3.4.2)

6: Download my status-paginate.patch into /usr/src/redhat/SOURCES

7: Download the downfix-6.patch into /usr/src/redhat/SOURCES

8: Download the leaks1-2.patch into /usr/src/redhat/SOURCES

9: Download my nagios.spec into /usr/src/redhat/SPECS

10: rpmbuild -bb /usr/src/redhat/SPECS/nagios.spec

The built RPMs will be left in /usr/src/redhat/RPMS/i386.

They'll happily install over the old RPMForge Nagios 3.2.3 RPMS.

Enjoy.

Postscript, July 18th

The perfdata patch has been checked in to Nagios by the developers. Lets hope the execv one follows.

Postscript, November 13th

The forthcoming Nagios 3.4.3 includes all but the execv patch, so a slight mod to the spec file to change the version to 3.4.3 and include only that patch (or not, as you choose) is all that's needed to get 3.4.3 built in RPMForge-compatible style.

Postscript, May 8th, 2013

Now updated for Nagios 3.5.0. I've left the above instructions intact for historical reasons.

Apart from my execv patch, there are two additional patches, from the Open Monitoring Distribution team. Were I to build a Nagios server today, I'd use OMD.

The build instructions now are:

1: Download nagios-3.2.3-3.rf.src.rpm and install (you'll need the --nomd5 switch in rpm).

2: Download nagios-3.5.0.tar.gz into /usr/src/redhat/SOURCES

3: Download my execv-v2.patch into /usr/src/redhat/SOURCES

4: Download 0006-fix_f5_reload_bug.dif into /usr/src/redhat/SOURCES

This fixes an annoying screen refresh bug in the Nagios web interface.

5: Download 0007-fix_downtime_struct.dif into /usr/src/redhat/SOURCES

This reverts a Nagios API change which was incompatible with check_mk, which manifested itself as crashes at midnight during Nagios log rotation, and maybe at other times too.

6: Download my nagios.spec into /usr/src/redhat/SPECS

7: rpmbuild -bb /usr/src/redhat/SPECS/nagios.spec

The built RPMs will be left in /usr/src/redhat/RPMS/i386.

Enjoy.



Posted by Phil at 5:24 PM
Edited on: Wednesday, April 01, 2015 10:52 PM
Categories: IT

Saturday, March 10, 2012

One vCheck Plugin to Rule Them All


Alan Renouf has recently updated his fabulous vCheck Powershell script to support a plugin architecture, with one (sometimes more) checks per plugin. You can disable the plugins by renaming them manually, but that quickly becomes a hassle.

Here's my solution, Select-Plugins.ps1, a GUI picklist from which enabling/disabling plugins is no longer such a chore.

It requires vCheck 6.10 or later.

It can be copied into your vCheck directory and invoked from there, or copied in to your Plugins directory and renamed to be the last plugin to run. vCheck has already loaded its list of plugins before any are run, so using it as the first plugin would not have the results you expect.

To go with it, there's a "Report on Plugins" Plugin too, which I've sent to Alan. It's not of much interest unless you're disabling plugins with Select-Plugins.ps1, or want a list of the plugins used in each run of vCheck.

Enjoy!

Postscript:

Updated to use a conditional expression to create 'new' filename. Much more elegant.

Postscript 2, March 19, 2012:

Updated to detect the situation where both pluginname.ps1 and pluginname.ps1.disabled exist, and to warn user without deleting anything.


# Select-Plugins.ps1

# selectively enable / disable vCheck Plugins

# presents a list of plugins whose names match *.ps1 or *.ps1.disabled
# 
# disabled plugins will be renamed as appropriate to <pluginname>.ps1.disabled
# enabled plugins will be renamed as appropriate to <pluginname>.ps1

# To use, run from the vCheck directory
#     or, if you wish to be perverse, copy to the plugins directory and rename to 
#         "ZZ Select Plugins for Next Run.ps1" and run vCheck as normal.

# Great for testing plugins.  When done, untick it...

# If run as a plugin, it will affect the next vCheck run, not the current one,
#   as vCheck has already collected its list of plugins when it is invoked
#   so make it the very last plugin executed to avoid counter-intuitive behaviour

# based on code from Select-GraphicalFilteredObject.ps1 in
#  "Windows Powershell Cookbook" by Lee Holmes.
#  Copyright 2007 Lee Holmes.
#  Published by O'Reilly ISBN 978-0-596-528492
# and used under the 'free use' provisions specified on Preface page xxv

$Title = "Plugin Selection Plugin"
$Author = "Phil Randal"
$PluginVersion = 2.0
$Header =  "Plugin Selection"
$Comments = "Plugin Selection"
$Display = "None"
# Start of Settings # End of Settings
$PluginPath = (Split-Path ((Get-Variable MyInvocation).Value).MyCommand.Path) If ($PluginPath -notmatch 'plugins$') { $PluginPath += "\Plugins" } $plugins=get-childitem -Path $PluginPath | where {$_.name -match '.*\.ps1(?:\.disabled|)$'} | Sort Name | Select Name, @{Label="Plugin";expression={$_.Name -replace '(.*)\.ps1(?:\.disabled|)$', '$1'}}, @{Label="Enabled";expression={$_.Name -notmatch '.*\.disabled$'}} ## Load the Windows Forms assembly [void] [Reflection.Assembly]::LoadWithPartialName("System.Windows.Forms") ## Create the main form $form = New-Object Windows.Forms.Form $form.Size = New-Object Drawing.Size @(600,600) ## Create the listbox to hold the items from the pipeline $listbox = New-Object Windows.Forms.CheckedListBox $listbox.CheckOnClick = $true $listbox.Dock = "Fill" $form.Text = "Select the plugins you wish to enable" # create list box items from plugin list, tick as enabled where appropriate ForEach ($plugin in $Plugins) { $i=$listBox.Items.Add($plugin.Plugin) $listbox.SetItemChecked($i, $Plugin.Enabled) } ## Create the button panel to hold the OK and Cancel buttons $buttonPanel = New-Object Windows.Forms.Panel $buttonPanel.Size = New-Object Drawing.Size @(600,30) $buttonPanel.Dock = "Bottom" ## Create the Cancel button, which will anchor to the bottom right $cancelButton = New-Object Windows.Forms.Button $cancelButton.Text = "Cancel" $cancelButton.DialogResult = "Cancel" $cancelButton.Top = $buttonPanel.Height - $cancelButton.Height - 5 $cancelButton.Left = $buttonPanel.Width - $cancelButton.Width - 10 $cancelButton.Anchor = "Right" ## Create the OK button, which will anchor to the left of Cancel $okButton = New-Object Windows.Forms.Button $okButton.Text = "Ok" $okButton.DialogResult = "Ok" $okButton.Top = $cancelButton.Top $okButton.Left = $cancelButton.Left - $okButton.Width - 5 $okButton.Anchor = "Right" ## Add the buttons to the button panel $buttonPanel.Controls.Add($okButton) $buttonPanel.Controls.Add($cancelButton) ## Add the button panel and list box to the form, and also set ## the actions for the buttons $form.Controls.Add($listBox) $form.Controls.Add($buttonPanel) $form.AcceptButton = $okButton $form.CancelButton = $cancelButton $form.Add_Shown( { $form.Activate() } ) ## Show the form, and wait for the response $result = $form.ShowDialog() ## If they pressed OK (or Enter,) ## enumerate list of plugins and rename those whose status has changed if($result -eq "OK") { $i = 0 ForEach ($plugin in $plugins) { $oldname = $plugin.Name $newname = $plugin.Plugin + $(If ($listbox.GetItemChecked($i)) {'.ps1'} else {'.ps1.disabled'})
If ($newname -ne $oldname) { If (Test-Path ($PluginPath + "\" + $newname)) { Write-Host "Attempting to rename ""$oldname"" to ""$newname"", which already exists - please delete or rename the superfluous file and try again" } Else { Rename-Item ($PluginPath + "\" + $oldname) $newname } } $i++ } }



Posted by Phil at 4:32 PM
Edited on: Wednesday, April 01, 2015 10:51 PM
Categories: IT

Monday, April 25, 2011

Enhanced check_esxi_hardware.py for Nagios and pnp4nagios


Having spent a bit of time implementing Trond Hasle Amunsen's wonderful check_openmanage plugin for Nagios to monitor the Dell Windows and Linux servers at work, I came to wondering if the same was possible for our VMware ESXi boxes. I was monitoring them with the check_esxi_hardware.py plugin, maintained by Claudio Kuenzler. That, unfortunately, didn't collect performance data and lacked the clever html links to Dell documentation found in check_openmanage.

So, I got to work, emulating some of check_openmanage's features.

The features I collect performance data for are those found on our ESXi boxes, Dell M600, R815, and R905 models.

M600

Power consumption

System board ambient temperature

R815

Power consumption

System board fan speeds

System board ambient temperature

System Internal Expansion Board 1 IO1 Planar Temp

System Internal Expansion Board 1 IO2 Planar Temp

Power supply voltages and currents

R905

Power consumption

System board fan speeds

System board ambient temperature

Power supply voltages and currents

I've also created a check_esxi_hardware.php template for pnp4nagios.

They're here in human-readable form:

check_esxi_hardware.py.html

check_esxi_hardware.php.html

Or download check_esxi_hardware.zip

check_esxi_hardware.py (not formatted as html)

check_esxi_hardware.php (not formatted as html)

Update, April 28th:

Now includes:

Indentation of the verbose output

Support for the HP Proliant BL460c, and, drum roll....

Proper parameter handling, which gracefully fails back to the original commandline format:

  usage: check_esxi_hardware.py https://hostname user password system [verbose]
  example: check_esxi_hardware.py https://my-shiny-new-vmware-server root fakepassword dell
or, using new style options:
  usage: check_esxi_hardware.py -H hostname -U username -P password [-V system -v -p -I XX]
  example: check_esxi_hardware.py -H my-shiny-new-vmware-server -U root -P fakepassword -V auto -I uk
or, verbosely:
  usage: check_esxi_hardware.py --host=hostname --user=username --pass=password [--vendor=system --verbose --perfdata --html=XX]

The hardware vendor string defaults to unknown, which is treated the same as ibm. intel has a slight quirk with BIOS identification. dell is similar to the previous cases, but also allows html links to product documentation and warranty information. hp have their own CIM return values to handle, so they are a special case. But the best of all is auto, which determines the vendor (if it can), from the Manufacturer information from CIM.

That's it for now, I consider it stable enough for production.

One improvement would be better handling of CIM numeric sensor names we haven't encountered yet. That should be possible with a bit of thoughtful regular expression wizardry, but I'm going to pass on that for the forseeable future.

Update, April 29th:

Rewritten perfdata code should now do something sensible on any vendor's hardware.

By peeking at the CIM UnitType attribute, I now correctly handle HP's Virtual Fan (or anyone else's) speed as a percentage, and can distinguish between power consumption (Watts) and current (Amps) automatically.

Mopping up of any quirky sensor name formatting can be done in check_esxi_hardware.php

Update, May 3rd:

Minor bug fixes, code reorganisation, and sorted performance data.

Performance data is now sorted by sensor name within sensor categories in the following order: Power, Voltage, Current, Temperature, Fan Speed, and (Virtual) Fan percentage.

A major side effect of these changes is that the sensor data previously created by check_esxi_hardware.py in /usr/local/pnp4nagios/var/perfdata is not compatible with my new code, and will have to be erased.

Update, May 4th:

More fixes:

Minor code changes and documentation improvements

Remove redundant mismatched ' character in performance data output

Output non-integral values for all sensors to fix problem seen with system board voltage sensors on an IBM server (thanks to Attilio Drei for the sample output)

Update, May 5th:

Added --no-power, --no-volts, --no-current, --no-temp, and --no-fan options to suppress performance data output by category

A few minor optimisations

Update, May 6th:

Added -t / --timeout parameter, ensuring it doesn't run on Windows (it works in Cygwin, though)

Made the new file:passwordfile option work for old-style command lines too

Update, May 7th:

On error, include the numeric sensor value in output

Example from this morning, aircon fail in one of our datacentres:

 

Things got rather hot, and the system fans all went into overdrive:

 

Power consumption on the few boxes and blade chassis' I looked at increased by 20 to 25 percent above normal.

Update, April 2nd, 2012

I've updated check_esxi_hardware.py to fix Dell warranty links (when you click on the displayed Tag No) to point to the new Dell Support site.



Posted by Phil at 10:51 PM
Edited on: Wednesday, April 01, 2015 10:49 PM
Categories: IT, Waffle