Systems Monitoring

Overview

The main ISFS server at SCP, gully, is running the Open Monitoring Distribution (OMD). This is a software distribution which combines NAGIOS with several NAGIOS extensions, such as check_mk, NAGVIS, and others.

SCP is using check_mk to generate and configure NAGIOS and all of the services which NAGIOS monitors. check_mk uses a simple shell script as the monitoring agent, and this script is deployed easily to the embedded DSMs.

The agent script is checked out in ~aster/isfs/projects/SCP/ISFF/omd/scripts. The install_check_mk script in that directory can be used to install the agent script to any and all of the DSM hosts.

For more background on this prototype installation, see JIRA issue ISFS-73.

Update: Rather than returning a single status check for all of the netcdf statistics, now all variables each have their own checks in nagios. Besides individual notifications, this means the measurements can be seen in the graphs. From the main check_mk overview page, enter variable names like Ifan or RH in the search box to see those variables. Hover over the graph icon to see that graph, or click on it to see all the graphs of that variable over different time periods. The problem with this approach is that if a whole station goes down, that causes dozens of notifications. I'm not sure how best to scale the checks and notifications. The goal is to balance the ability to drill down to a single sensor problem with detecting higher-level problems like an interruption in the dsm or network.

To see the variables for one particular station, eg, Ap10, search for '_Ap10'. Searching for 'Ap10' matches exactly with a host name, so only the checks for that host are shown. All the variable checks belong to host 'localhost'.

OMD Web Server

Access to the OMD web pages is restricted by default. Use the defaults for now: omdadmin, omd.

The OMD web server is on port 5000 on gully. So at the site, here is the URL for the check_mk main overview page:

http://gully:5000/scpmonitor/check_mk

For off-site access, there is a ssh tunnel from port 31080 on eol-rt-data to port 5000 on gully. The eol-rt-data port is only accessible from UCAR hosts or through VPN. Here is the URL to the same page above:

http://eol-rt-data.fl-ext.ucar.edu:31080/scpmonitor/check_mk

The rest of the URLs on this page use the eol-rt-data URL.

Web View Examples

Main overview (check_mk)

OMD View Index Page

All usbdisk usage

NagVis SCP Map View

Following are other interesting examples to try out.

Using Search Graphs to plot USB disk usage

Browse to the check_mk main overview page, then on the left sidebar in the Views block, click on Search Graphs under Addons. Now on the right, look for the Service box. Enter a regular expression to match the usbdisk filesystem checks:

fs_.*usbdisk

Click the Search button, and the result should be a web page of usage and trend plots for all hosts with a USB disk. The graph search can be further restricted to just DSMs or to particular hosts with the Hostname and Hostgroup selection boxes.

A similar search can plot some NTP statistics for all DSM hosts, but those are not very interesting plots because they are all sub-millisecond measurements.

To go back to a web page at a later time, bookmark the link underneath the all the plots, the one with the hover hint URL to this page including sidebar.

Service and Host Checks

The check_mk inventory feature is being used to survey automatically all the hosts on the SCP network and identify standard service checks on those hosts, such as disk space, mount options, networking, and NTP. Here are the hosts being monitored:

Server	localhost
DSM Hosts	A13 A15 A16 A17 A18 A19 A4 A7 Ah1 Ah2 Ah5 Ah6 Ap10 Ap12 Ap14 Ap8 Ap9 Aph3 Ars11 C20 M21 Mu22
Ping-only hosts	sodar netgear btap3 ap24 ec eis eos

NAGIOS pings all of these hosts to verify they are up and reachable. Additionally, the DSMs and localhost (gully) respond to service status requests through the check_mk_agent.

There is one custom service check running on gully. The 5-minute averaged netcdf files are being checked by the iss_catalog.py script, and any Ifan values below 15 are flagged as a critical error, causing NAGIOS to send an email notification. This check relies on python code which reads the netcdf files and detects unexpected gaps or lags in the sample times, besides the custom check for the Ifan values. The python code is checked out in ~scpmonitor/iss_python.

Email Alerts

Right now NAGIOS is configured to email alerts to Gary. NAGIOS is very flexible in how it notifies of problems, so this can evolve in whatever way is useful for ISFS.

Page tree