Introduction

Purpose

The Saudi-c3 cluster (“the cluster”) is a high-performance computing cluster designed to run the WRF weather model. The system, including all hardware and software, was purchased and configured by NOAA and NCAR for PME. In March of 2014, NCAR will turned over responsibility for maintaining the system to PME staff. This document describes the cluster and procedures for maintaining it.

Audience

This document is intended for use by computer systems administrators. It assumes the reader has a high level of skill with computer hardware, the Linux operating system, IP networking and computer systems administration.

Scope

This document only covers information about the hardware, operating system software, various software libraries installed outside of the operating system's distribution channels and other 3rd party software used for system operations and maintenance of the cluster. It describes the setup, configuration and customizations made by NCAR systems administrators. It does not cover general information about these tools and technologies that are not specific to how they are installed and configured on the cluster. Readers unfamiliar with a specific product or technology may find such general-purpose information on the vendor's website.

This document does not cover usage the installation, configuration or usage of WRF weather model or other meteorological or scientific software.

Hardware

The cluster consists of the following computers:

  • 1 Interactive node - Dell R710, 2 Intel Xeon X5650 2.6GHz 6-core CPUs, 24 GB Memory, Dell PERC 700 Integrated RAID controller, 1.6 TB usable disk space configured at RAID6
  • 20 Interactive nodes – Dell R610, 2 Intel Xeon X5660 2.8GHz 6-core CPUs, 12 GB Memory, 146 GB disk space configured at RAID 1
  • Dell PowerVault MD1200 Storage array, 16.3 TB disk space configured at RAID6, connected to Interactive node via Dell PERC 800 RAID controller.
  • Dell/APC PowerEdge 2700W UPS (powers the interactive node and PowerValut disk array only)
  • 1 Mellanox IS5030 36-port Infiniband Switch for low-latency WRF MPI data traffic
  • 1 Dell PowerConnect 6224 24 port GbE Switch for IP data traffic
  • 1 Dell PowerConnect 6224 24 port GbE Switch for management/idrac/Lights-Out-Management/impi traffic

Replacements Under Warranty

All hardware is under warranty from Dell until May 1st, 2017. Contact Dell directly to receive replacements for failed hardware. You will need the serial number (also known as the Service Tag) of the failed piece of equipment, which are as follows:

Type

Hostname

Model

Service Tag/

Serial Number

Interactive Node

int1

R710

DQLN4V1

Compute Node

node1

R610

4P5RNS1

Compute Node

node2

R610

4P5TNS1

Compute Node

node3

R610

4P2WNS1

Compute Node

node4

R610

4P2VNS1

Compute Node

node5

R610

4P4RNS1

Compute Node

node6

R610

4P4PNS1

Compute Node

node7

R610

5ZPNNS1

Compute Node

node8

R610

5ZNSNS1

Compute Node

node9

R610

5ZPVNS1

Compute Node

node10

R610

5ZPQNS1

Compute Node

node11

R610

5ZQSNS1

Compute Node

node12

R610

5ZPSNS1

Compute Node

node13

R610

5ZQQNS1

Compute Node

node14

R610

5ZQNNS1

Compute Node

node15

R610

5ZPTNS1

Compute Node

node16

R610

SZNTNS1

Compute Node

node17

R610

4P1WNS1

Compute Node

node18

R610

4P4TNS1

Compute Node

node19

R610

4P1VNS1

Compute Node

node20

R610

4P3PNS1

Storage Array

N/A

MD1200

4CMVPS1

UPS

N/A

PowerEdge 2700W

30TR9S1

IP Data Switch

switch-data

PowerConnect 6224

HWGQ7V1

IP Management Switch

switch-mgmt

PowerConnect 6224

14YJ7V1

IB Switch

N/A

IS5030

2c90200472958

 

All serial number are also printed on the devices themselves.

Networks

Only the interactive node (saudi-c3-int1) is connected to the PME network. It uses IP address 10.20.4.115 (assigned by PME). All other equipment is accessible only by first logging on to saudi-c3-int1. The cluster uses 3 private networks:

IB (Infiniband) Network (192.168.1.0/24)

Performance of the WRF application is dependent on a low latency network transport. We use IB. The IB network runs over the Mellanox IB switch. Note that WRF MPI traffic only uses the ibverbs protocol, an alternative to the IP protocol for Infiniband networks.

We have configuring IP-over-IB (IPoIB) networking to server as a backup IP data network, but WRF will still operate normally with IpoIB disabled and all interfaces on the 192.168.1.0 subnet offline. When troubleshooting IB, use the `ib*` commands, such as `ibhosts` and `ibqueryerrors` exclusively. The compute nodes may be addressed on this network using the normal hostname suffixed with '-ib'. For example, node1's hostname on this network is 'node1-ib'.

Data Network (192.168.3.0/24)

This network serves at the primary connection between the interactive and compute nodes. It carries IP traffic such as SSH, NFS and NTP. MPI applications (such as WRF) use this network for the initial execution of the program on the compute nodes, so this network must be working for parallel WRF runs even though it doesn't carry the MPI data traffic. Additionally, filesystem data, including /usr/local, /d1 and /d2, are shared between the interactive node and compute nodes over this network using the NFS protocol. The hostnames of systems on this network have no special suffix.

Management Network (192.168.4.0/24)

All the compute nodes include a Dell Remote Access Controller (drac or idrac for short). These devices allow you to view hardware logs and control system power and BIOS boot options over the network. The compute nodes' idrac network interfaces are connected to this network.

Note that although the interactive node has an idrac, one of its operating system interfaces, rather than the idrac interface, is connected to this network. This allows you to manage the compute nodes remotely from the interactive node using the IPMI protocol. The command ipmitool is installed for this purpose. The compute nodes may be addressed on this network using the normal hostname suffixed with '-mgmt'. For example, node1's hostname on this network is 'node1-mgmt'.

IDRAC/Remote Management Credentials

The username and password for the idracs has been left at the default (username: 'root', password 'calvin') for simplicity. Since the management network is accessible only from the interactive node, not from the PME production network or the internet, this does not represent a security concern.

Systems Administration Tools

The following systems administration tools are installed:

  • pdsh – Allows you to run a command on all nodes in parallel.
    Example:
    pdsh -g compute uptime
    This would run the `uptime` command on all compute nodes. The groups (specified by the -g flag) are defined in /etc/dsh/group/
  • ipmitool – Allows you to do things like view the system event (hardware) log, set the boot device and reboot compute nodes remotely.
    Example:
    ipmitool -H node7-mgmt -U root -P calvin sel list
    This would display any hardware errors on node7, similar to what you might see on the LCD display on the node itself.
  • /opt/LSI-Tools/megacli-overview – This script (written at NCAR) uses the /opt/LSI-Tools/megacli command (provided the LSI, the manufacturer of the RAID controller) to display the status of the RAID arrays on the interactive node.
  • /opt/LSI-Tools/sas2ircu – Displays the status of the RAID array on the compute nodes. Example:
    ssh node1 /opt/LSI-Tools/sas2ircu 0 STATUS

Defining and updating nodes

The master list of all compute nodes in kept in /etc/nodes. This file is used by the script /usr/local/sbin/update_nodes.pl to generate various other configuration files and ensure they consistently represent all nodes. Files managed using /etc/nodes and update_nodes.pl are:

  • /etc/hosts
  • /etc/dhcp/dhcpd.conf
  • /etc/ssh/ssh_known_hosts

/etc/ssh/ssh_known_hosts file is actually managed by puppet. The puppet template that creates it (/etc/puppet/templates/etc/ssh/ssh_know_hosts.erb) is generated by /usr/local/sbin/keyscan.sh, which reads /etc/hosts. Keyscan.sh is not run automatically by update_nodes.pl because it takes a while to run. If you add a node, change any of a node's IP addresses, or change the ssh key (such as by reinstalling it), do the following on the interactive node:

  1. Run update_nodes.pl
  2. Run puppet agent –test to update the interactive node's /etc/hosts file.
  3. Delete all the ssh keys (but not the ruby code at the top) from /etc/puppet/templates/etc/ssh/ssh_known_hosts.erb.
  4. Run keyscan.sh >> /etc/puppet/templates/etc/ssh/ssh_known_hosts.erb to update puppet's ssh_known_hosts template.
  5. Run puppet on all systems to push the new ssh_known_hosts file. (pdsh -g all puppet agent --test)

Non-automated updates for node changes

In addition to the things automatically updated from /etc/nodes as described above, when changing, adding or removing nodes, you may also need to update the following:

  • /etc/dsh/group/*
  • Update the node's status in Torque/PBS using the qmgr command.

Operating System

The cluster runs the Debian distribution of the Linux Operating system. Debian Linux is a free distribution and may be used without purchasing licenses from Debian.

Deployment: Fully Automated Installation (FAI)

The compute nodes were installed using FAI to ensure installations are identical and to make the installation process quick. New nodes can be added and existing nodes reinstalled using FAI. The interactive node is configured as an FAI server. To reinstall a node1, for example, you would do the following:

  1. Set up FAI to allow node1 to PXEboot
    fai-chboot -IPFv node1
  2. Set node1 to boot from the network (PXE) on next boot
    ipmitool -H node1-mgmt -U root -P calvin chassis bootdev pxe
  3. Reboot the node
    ipmitool -H node1-mgmt -U root -P calvin chassis power soft

The node should reboot and reinstall via FAI. If problems occur, check the following:

  • Is the isc-dhcp-server service running?
  • Is the tftpd-hpa service running?
  • Are there errors in /var/log/messages related to dhcp, tftp or the mac address of the compute node?

Software Updates

The interactive node runs apt-mirror via the cron script /etc/cron.d/apt-mirror on a daily basis to download (but not install) the latest software updates from Debian. All nodes in the cluster are configured access this Debian repository mirror vai the aptitude config file /etc/apt/sources.list.d/saudi-c3.list. The lates software updates may be applied by running the command aptup as root. This command is a bash shell alias found in /root/.bashrc and gives you the option to skip kernel updates. If kernel updates are applied, the system will need to be rebooted.

In general, applying software updates using aptitude is usually not disruptive to operations. However, PME should weigh the stability of the current system against potential security concerns about older software releases before applying updates. Updates should be tested on a single compute node before being applied to the entire cluster.

Debian Upgrades

Debian 6.0 (codename “Squeeze”) is installed and was the current stable release of Debian at the time the cluster was configured. Debian 7.0 (codename “Wheezy”) is the current Debian release as of this writing. Debian will stop supplying security updates for Squeeze when the next release, “Jessie” becomes production.

PME should weigh the stability of the current system against potential security concerns that may arise with the existing release when considering upgrading. Upgrading will involve recompiling libraries in /usr/local, as well as the WRF model code and is likely to result in significant downtime and effort on the part of both systems administrators and modelers to complete these tasks.

Locally compiled software libraries

WRF and other NCAR software requires some software not provided by Debian as well as newer versions of certain software than Debian provides. Such libraries are compiled from source and kept in /usr/local per the Filesystem Hierarchy Standard (FHS). Sources are kept in /usr/local/src.

 

All software installs are scripted using custom makfiles created by NCAR. Note that these are in addition to the makfiles provided with the software source code. NCAR makefiles call the software's makefile, as well as do other tasks necessary to install the software. For example, to reinstall netcdf, you would run the following commands:

 

  1. Find the netcdf sources
    cd /usr/local/src/netcdf
  2. Remove previous compilations
    make clean
  3. Compile netcdf
    make

 

That should be all you need to do. The NCAR makefile in /usr/local/src/netcdf includes the appropriate configure and make options to compile everything for the saudi-c3 cluster and install it in /usr/local/netcdf-4.1.2-pgi12.8.

Managing multiple versions of libraries

We have multiple versions of certain libraries or multiplie copies of the same version compiled with different compilers installed in /usr/local. All software is therefore installed in a version and compiler specific path. /usr/local/netcdf-4.1.2-pgi12.8, for example, is netcdf version 4.1.2 compiled with pgi 12.8.

We provide a recommended version of software as well as a default versions of software that will automatically be located by compilers and by the runtime liner using symlinks. For example, /usr/local/netcdf represents the recommended version of netcdf and is a symlink to /usr/local/netcdf-4.1.2-pgi12.8. Likewise, the shared library /usr/local/lib/libnetcdf.so is a symlink to /usr/local/netcdf-4.1.2-pgi12.8/lib/libnetcdf.so. Both the comiler and runtime linker will find that file by default.

We use the script /usr/local/sbin/update-usr-local-links to manage these symlinks. This script uses the configuration file /usr/local/etc/packages to determine which software it manages and which version is the default.

Distributing locally compiled software

Locally compiled software need only be compiled once on the interactive node. /usr/local is then shared via NFS to the compute nodes.

Configuration Management (Puppet)

The cluster uses the puppet configuration management tool to ensure software and other configurations are consistent across all nodes. All changes to compute nodes and most changes to the interactive node are done exclusively through puppet. This ensures the same changes are made on all nodes and that we can reinstall a node from bare metal or add new nodes and be sure they will be exactly the same.

The puppet server runs on the interactive node. It's configuration files are in /etc/puppet and it runs as the service 'puppetmaster'. The interactive node also runs the puppet agent via the service 'puppet', as do the compute nodes. This service automatically synchronizes the node with the configuration on the server every ½ hour.

You can run puppet manually with the following command:
            puppet agent –test

This command color-codes it's output. Green and blue lines are informational and can safely be ignored. Yellow and red lines represent potential problems that the systems administrator should investigate.

User Management

For MPI programs like the WRF weather model to work correctly, user accounts must be created consistently across all compute nodes. To ensure user accounts are created consistently across all nodes, do so using puppet. User are configured in /etc/puppet/manifests/site.pp. Follow an existing local_user {} stanza as an example.

Monitoring (Opsview/Nagios)

We use Opsview to monitor the cluster for various failure conditions and send e-mail notifications then they occur. Opsview is free, open-source software based on the more popular monitoring software, Nagios. Opsview adds web-based configuration and automatic graphing to Nagios.

Accessing the web interface

You can access the opsview web interface at https://saudi-c3.rap.ucar.edu/opsview/.

You can log in with the same username and password you use to log in to the cluster.

What to look for

Opsview shows a hierarchical organization of equipment in the cluster with Red and Yellow indicating problems and Green indicating things that are operating normally. Summaries of each group are shown on the Host Group Summary (the first screen you see after logging on). Click on any red or yellow boxes to drill down to the failures.

The most important thing Opsview monitors for is the status of the RAID arrays. When disks fail in the RAID array, the cluster will continue to operate normally. However, multiple disk failures will result in data loss. Systems administrators at PME should therefore configure opsview to notify them by e-mail when a disk fails so it can be replaced promptly. Additionally, PME systems administrators should manually check that all RAID arrays are in the Optimal state (indicating no disk failures) on a regular basis. Logging on to the opsview web interface and looking for the red and yellow boxes indicating failures is a quick and easy way to do this. The opsview RAID checks are 'megacli' and 'megacli_bbu' on the interactive node and 'sas2ircu' on the compute nodes.

Opsview also provides a quick status and historical graphing of important metrics, such as CPU, memory, disk and network traffic. This may help quickly identify the source problems occurring on the cluster.

User Management

Although our instance of opsview uses unix PAM authentication and therefore shares the username and password with the operating system on the interactive node, not all OS users automatically have access to opsview. To give someone who has a login on the cluster access to opsview, create a contact (opsview's term for users) using the web interface with the same username as the person uses in the OS. Select 'unix' as the authentication realm.

To give a user who does not have or need an operating system login access to opsview, simply select the 'local' authentication realm when creating the contact and set the user's password in the fields provided.

Installation, configuration, upgrades and NCAR customizations

Opsview isn't included as part of Debian, but Opsera (the vendor) maintains a Debian-compatible software repository. The cluster maintains a local mirror of the Opsview Debian repository (see the Software Updates section above), so opsview can be installed, reinstalled and upgraded using the normal Debian aptitude process. However, we make the following customizations to opsview:

  • PAM-based logins
    To allow users to authenticate to opsview using the same username and password they use for the operating system, we added a PAM module. It involves the following files:
    • /etc/pam.d/opsview
    • /usr/local/opsview_web_local.yml
    • /opt/opsview/pam/perl/lib/perl5/Catalyst/Authentication/Credential/PAM.pm

If authentication doesn't work after an upgrade or reinstall, restore those files from backup and restart the opsview-web service.

NCAR also installs custom service checks in /usr/local/nagios/libexec. However, these are installed by puppet, so simply running puppet on all systems should install and configure these correctly.

Procedures

The cluster is configured to automatically restart after a power outage with no manual intervention. The interactive node and MD1200 disk array (but not the compute nodes) are also connected to an uninterpretable power supply, which should keep them online during short power outages. However, to manual procedures are below:

Shut Down

  1. Shut down all compute nodes
    pdsh -g compute shutdown -h now
  2. Confirm all compute nodes are down
    for f in `seq 1 20` ; do ipmitool -H node${f}-mgmt -U root -P calvin chassis power status; done
  3. If any nodes are NOT off, they can be forced off with the command
    ipmitool -H node#-mgmt -U root -P calvin chassis power off
  4. Shutdown the interactive node
    shutdown -h now
  5. Wait for the interactive node to power off
  6. Push the power button on the PowerVault RAID array
  7. Turn off the UPS by holding down the power button until it powers itself off.
  8. Unplug the switches to power them off if necessary.

Startup

  1. Plug in the switches if they were unplugged.
  2. Push the power button on the UPS and wait for it to beep once, indicating it's powered on and receiving power from the grid rather than its own batteries.
  3. Push the power button on the PowerVault MD raid array and wait 1 minute for the disks to spin up.
  4. Start the interactive node
    This must be done manually by pushing the power button. The interactive node does have an idrac so it could be configured to allow remote startup, but it would need to be connected to a secure network accessible from the desired remote location.
  5. Start the compute nodes
    for f in `seq 1 20`; do
       ipmitool -H node${f}-mgmt -U root -P calvin chassis power on;
    done

Backups

The cluster does not have off-site backups. A RAID array failure would therefore result in the loss of all data. PME should consider backing up the cluster to protect against risks such as multi-disk failure, theft or natural disasters.

The cluster does keep a local disk-based backup of the operating system and /home partition using rsnapshot, a script that creates historical backups in a manner that uses disk space efficiently.

The large data partitions /d1 and /d2 are excluded from these backups due to their size.

Security

As of this writing, the cluster does not run a host-based firewall. It is also partially exposed to the public Internet using NAT. PME's firewall allows SSH, HTTP, HTTPS and FTP. Because it's exposed to the public Internet, PME should consider the potential consequences both the cluster and to other internal systems at PME if unauthorized access of the system were to occur, and manage the system accordingly. Procedures to consider are:

  • Keep the software listening on the public protocols up to date. This includes the Apache web server, pure-ftpd and openssh server.
  • Running a host-based firewall on the interactive node as a fail-safe if the PME firewall were to mistakenly be configured to allow access on additional, unnecessary ports.
  • Require users logging on via SSH to use ssh public/private key pairs rather than passwords. Train all users on how to properly protect their private keys.

 

  • No labels