How to check and restart processing

Check DSM archive files (lsu)

On each DSM, the lsu command displays the last 10 archive files. On the flux system, the lsu command runs lsu on each of the tower DSMs in succession. After 0Z, the rsync scripts copy and remove the previous day's files, so you will generally see only files for the current day. The modification time of the last file shown for each DSM should be the current time in UTC. If run a second time the size of the file should have grown.

Cockpit

On flux, you can run the cockpit command from a terminal window, to monitor the real-time data. A popup window will appear. Select "Search" to connect to the server on the localhost, port 30005.

cockpit will then create a set of tabbed pages, each containing small time-series plots for each sampled variable of a DSM. Each second a vertical pixel line is drawn from the minimum to the maximum value of the variable over that second. When the trace reaches the right-hand-side of the plot, the trace is greyed out, and a new trace begun. If a sensor is not reporting, RIP will be displayed on the plot. The greyed-out traces provide a history of the data. The history can be cleared by selecting GlobalSetup -> Color -> Cleanup History.

cockpit can be configured to cycle through the tabbed pages via GlobalSetup -> AutoCycleTabs.

Check Services (sstat)

The post processing of CABL data is done on the flux laptop at the BAO tower, and on porter2 at EOL. The systemd service manager of RedHat Linux is used to start and monitor the services which run the various processing steps. On flux, the processes are running under the aster userid. On porter2 they are running under user maclean. They are started automatically by systemd at bootup.

To check the status of the services, use the sstat command. It displays a tree of the various services, followed by an indication of "all services seem to be running", or it will list the missing services.

If a process of a service isn't running, look at the system log file, /var/log/isfs/isfs.log to help track down the problem. Many of the scripts run by the services listed below also write to log files on $ISFF/projects/CABL/ISFF/logs.

On flux, the services are:

nc_server: the NetCDF server process that writes data received by statsproc and R to the NetCDF files
dsm_server@noqc_instrument: dsm_server process that receives and archives data from the DSMs on the tower.
statsproc@qc_geo_notiltcor: computes statistics from the 300m tower for the qc_geo_notiltcor dataset, i.e. the files in netcdf_geo_notiltcor
statsproc@noqc_instrument: computes statistics from the 300m tower for the noqc_instrument dataset, i.e. the files in netcdf_noqc_instrument
rsync_dsms: script that wakes up periodically and rsync's files from the DSMs on the tower, then does merge_nightly.sh to merge and reprocess the previous day's files.
R_derived: runs R every 5 minutes to create derived values in the files on netcdf_geo_notiltcor
ssh_tunnel: creates the ssh tunnel to NCAR

On porter2:

nc_server
cabl_flab_statsproc@qc_geo_notiltcor: computes statistics from the 300m tower for the qc_geo_notiltcor dataset, i.e. the files in netcdf_geo_notiltcor
cabl_flab_statsproc@noqc_instrument: computes statistics from the 300m tower for the noqc_instrument dataset, i.e. the files in netcdf_noqc_instrument
cabl_flab_statsproc2@qc_geo_notiltcor: computes statistics from the bao and ehs flux stations for the qc_geo_notiltcor dataset, i.e. the files in netcdf_geo_notiltcor
cabl_flab_statsproc2@noqc_instrument: computes statistics from the bao and ehs flux stations for the noqc_instrument dataset, i.e. the files in netcdf_noqc_instrument
rsync_flab: runs rsync_loop_flab.sh script, which wakes up periodically and rsync's files from flux, then does merge_nightly_flab.sh to merge and reprocess the previous day's files.
R_derived
proc_restarter: runs every 10 seconds to check if a user has requested to restart any services

sstat will also show rsync_loop and statsproc@trh_test services on porter2. Those are running in support of the CentNet project.

Note that NetCDF files of 5 minute statistics are being created independently on flux and on porter2. The files on flux may not be really needed, in which case the statsproc and R_derived services on flux could be disabled. The files created by porter2 are used by Ncharts and are rsync'd periodically to ftp://ftp.eol.ucar.edu/pub/archive/isff/projects/cabl.

Restart real-time service (restart_service, restart_statsproc)

If you make a change to the XML or a calibration file, you will usually want to restart the real-time statsproc processes. Only if an XML change effects the archive of the raw data do you need to restart dsm_server on flux.

To restart the statsproc processes on flux or porter2, use the restart_statsproc command. On flux it does a systemctl --user restart of the two statsproc services.

On porter2 the processes are running under the maclean login, and only that user has permission to restart the services. As a work-around, restart_statsproc writes a string to the file $ISFF/projects/$PROJECT/ISFF/logs/restart_proc.txt. The proc_restarter service wakes up every 10 seconds, checks that file, and if it contains the string "statsproc", does a systemctl --user restart on the four statsproc services.

Or you can use the command restart_service to restart any service running for CABL. You will be prompted to choose a service, by number, and that service will be restarted in a similar way to restart_statsproc.

Reprocess statistics

To recalculate the statistics for the whole project, run this command on an EOL server (porter2, barolo, tikal), after setting your project to CABL:

statsproc -S qc_geo_notiltcor -B "2015 feb 18 00:00" -E "2015 jun 1 00:00"

If you want to recalculate the noqc_instrument dataset, set the -S option accordingly.

The value of the NC_SERVER environment variable should be "porter2" so that the data is sent to nc_server on porter2.

On EOL systems, the default value of the DATADIR environment variable should be "merge", in which case statsproc will process all files on /scr/isfs/projects/CABL/merge. If you want to process a different set of files, you can pass the list of files instead of the start and end time. For example, the 50m files:

cd /scr/isfs/projects/CABL/raw_data

statsproc -S qc_geo_notiltcor 50m*

Recalculate derivations

To have the R_derived service recalculate the derived variables for the entire project period the next time it runs, remove this file:

rm $ISFF/projects/CABL/ISFF/logs/R_derived_last.txt

Time keeping

The DSMs each have a GPS with a pulse-per-second signal. Using NTP reference clock software, each DSM is then a stratum 1 time server. NTP on the DSM uses the GPS reference clock to adjust the CPU system clock, and generally reports that the GPS reference clock has less than a 50 micro-second offset from the system clock.

To query the system clock on a tower DSM from flux, use the ntpq -p command, for example 50m:

ntpq -p 50m
 remote       refid  st t when poll reach delay offset jitter
==============================================================================
 LOCAL(0)    .LOCL. 10 l 20d     64   0   0.000  0.000 0.000
oGPS_NMEA(0) .GPS.  0  l   5     16 377   0.000 -0.001 0.031

The above shows the GPS reference clock is offset from system CPU clock by -0.001 milliseconds. The "reach" value for GPS_NMEA should be 377 (octal value of all 1's). The reach for the LOCAL clock is always 0.

flux is configured to use all 6 DSMs as network time servers, using chrony, a NTP client. To display the current chrony status, use chronyc sources:

chronyc sources
210 Number of sources = 6
MS Name/IP address Stratum Poll Reach LastRx Last sample
===============================================================================
^+ 50m 1 10 377 558 -298ns[+5000ns] +/- 742us
^+ 100m 1 10 377 179 +52us[ +52us] +/- 662us
^+ 150m 1 10 377 219 +2000ns[+2000ns] +/- 739us
^* 200m 1 10 377 338 +9656ns[ +15us] +/- 671us
^+ 250m 1 10 377 439 +5677ns[ +11us] +/- 732us
^+ 300m 1 10 377 95 +11us[ +11us] +/- 794us

The above shows that the clock on flux agrees to within a maximum 52 microseconds with each DSM, as indicated by the "Last sample" value in brackets, which is the last calculated offset of the reference clock (in this case the DSM) from the system time on flux. A positive offset means the reference clock is ahead of the system clock.

The second character, under "S" should be '*' (indicating chrony on flux is sync'd to this server) or '+' (good server). You may also see '-' (recently on 100m for some reason) indicating chrony does not have a high opinion of its time information, relative to the others.

The "reach" values should again be 377. If not, it means the DSM is not on the network, or its NTP server is not responding or sync'd to its GPS.

One can also ssh into bao or ehs and check their clocks with "ntpq -p".

Blog