Blog from March, 2015

TP01 processing

Vpile.off appears to go slightly negative at ehs during the day.  Somehow, the mote code converts this to 65505 uV, which NIDAS .xml then clips to a nan value.  Thus, all the TP01 values (and Gsoil which is derived from it) become nan during the day.  We need to fix this signed/unsigned value error (but the data are fine).

 

Files for Feb 18 to March 9 have been copied to the HPSS, using the htar command.

 

[maclean@barolo ~]$ hsi ls -l /EOL/2015/cabl/surface/isfs/raw_data
Username: maclean UID: 2423 Acct: 46350000(P46350000) Copies: 1 Firewall: off [hsi.4.0.1.2p3 Thu Mar 28 14:37:05 MDT 2013]
/EOL/2015/cabl/surface/isfs/raw_data:
-r--r--r-- 1 maclean eoldmg 13441634304 Mar 11 17:39 20150218-20150309.tar
-r--r--r-- 1 maclean eoldmg 263456 Mar 11 17:39 20150218-20150309.tar.idx

 

On the /scr/isfs/projects/CABL/raw_data disks on flux and EOL servers, those files have been moved to an "archived" subdirectory.

 
ehs down Mar 11 am

The ehs flux station is down. The battery has not been charging since just  after installation.

On Mar 11 10:00 MDT the battery voltage reached 11.5 V and the station died, probably because the charge controller shut off the load.

Mote sensor IDs

Below is an e-mail from Gordon, run on bao.  I'll append a run on ehs soon...


Periodically the sensors report their serial numbers, which statsproc logs.

ssh flux
cd /var/log/isfs
fgrep SN= isfs.log-20150220
Here's what the log shows for bao today (after removing log message date/time).
At a quick glance they don't seem to correspond you what you noted in the logbook. Seems the EEPROM serial number and the label may disagree...
mote=17, sensorType=0x3b SN=4, typeName=1 field of Int16, often Wetness 
mote=17, sensorType=0x3b SN=4, typeName=1 field of Int16, often Wetness 
mote=17, sensorType=0x56 SN=8, typeName=Uplooking Pyranometer (Rsw.in)
mote=17, sensorType=0x5b SN=2, typeName=Downlooking Pyranometer (Rsw.out)
mote=17, sensorType=0x67 SN=7, typeName=Uplooking K&Z Pyrgeometer (Rlw.in)
mote=17, sensorType=0x6a SN=6, typeName=Downlooking K&Z Pyrgeometer (Rlw.out)
mote=17, sensorType=0x56 SN=8, typeName=Uplooking Pyranometer (Rsw.in)
mote=17, sensorType=0x5b SN=2, typeName=Downlooking Pyranometer (Rsw.out)
mote=17, sensorType=0x67 SN=7, typeName=Uplooking K&Z Pyrgeometer (Rlw.in)
mote=17, sensorType=0x6a SN=6, typeName=Downlooking K&Z Pyrgeometer (Rlw.out)
mote=2, sensorType=0x22 SN=1, typeName=Tsoil
mote=2, sensorType=0x22 SN=1, typeName=Tsoil
mote=2, sensorType=0x23 SN=2, typeName=Tsoil
mote=2, sensorType=0x28 SN=1, typeName=Qsoil
mote=2, sensorType=0x29 SN=6, typeName=Qsoil
mote=2, sensorType=0x23 SN=2, typeName=Tsoil
mote=2, sensorType=0x28 SN=1, typeName=Qsoil
mote=2, sensorType=0x29 SN=6, typeName=Qsoil
mote=1, sensorType=0x20 SN=15, typeName=Tsoil
mote=1, sensorType=0x20 SN=15, typeName=Tsoil
mote=1, sensorType=0x23 SN=18, typeName=Tsoil
mote=1, sensorType=0x26 SN=7, typeName=Gsoil
mote=1, sensorType=0x28 SN=9, typeName=Qsoil
mote=1, sensorType=0x2f SN=8, typeName=TP01
mote=1, sensorType=0x23 SN=18, typeName=Tsoil
mote=1, sensorType=0x26 SN=7, typeName=Gsoil
mote=1, sensorType=0x28 SN=9, typeName=Qsoil
mote=1, sensorType=0x2f SN=8, typeName=TP01
...and run on ehs....

mote=10, sensorType=0x20 SN=4, typeName=Tsoil

mote=10, sensorType=0x21 SN=3, typeName=Tsoil

mote=10, sensorType=0x2a SN=5, typeName=Qsoil

mote=10, sensorType=0x2b SN=4, typeName=Qsoil

mote=8, sensorType=0x39 SN=10, typeName=1 field of Int16, often Wetness

mote=8, sensorType=0x57 SN=4, typeName=Uplooking Pyranometer (Rsw.in)

mote=8, sensorType=0x59 SN=6, typeName=Downlooking Pyranometer (Rsw.out)

mote=8, sensorType=0x66 SN=3, typeName=Uplooking K&Z Pyrgeometer (Rlw.in)

mote=8, sensorType=0x6b SN=8, typeName=Downlooking K&Z Pyrgeometer (Rlw.out)

mote=4, sensorType=0x21 SN=12, typeName=Tsoil

mote=4, sensorType=0x22 SN=17, typeName=Tsoil

mote=4, sensorType=0x24 SN=5, typeName=Gsoil

mote=4, sensorType=0x2a SN=12, typeName=Qsoil

mote=4, sensorType=0x2f SN=4, typeName=TP01

(as before, no relation between these SN numbers and the numbers printed on the sensor cables (sad) )

 

bao EC150 restarted

Gordon had noticed that the EC150 had been dead since 5 Mar.  At about 1700-1715, visited the site.  The 1A fuse appears to have blown – replaced by a 3A fuse and all is happy.

Noticed that an EC150 sock had come off (was on the ground).  Didn't check the other sock.

 

ehs installed

From about 1330-1630 installed this station.  Of note:

  • wind observer "north" transducer to the NE
  • CSAT boom approximately to the NW (about like bao)
  • 20cm Tsoil&Qsoil are IDs 24 and 14, respectively
  • 10cm Tsoil&Qsoil are IDs 25 and 15, respectively
  • epoxy-coated picklefork is installed upside-down (coming in as Tsoil); brown-coated pickleform is normal (coming in as Tsoil_x)
  • had trouble getting Xbees to receive – mote RF power likely set too low; ended up putting "base mote" on rad stand, along with the other 3 (1 rad/2 soil) motes.
  • soil motes are ID4 (normal) & 10 (extended)
  • took a soil sample at about 1430 from 2–5cm (labeled "1 (0-3cm)"), but forgot to weigh it!  I'll try my paper towel trick tomorrow, first thing.
  • several mast issues (including some that will require a second visit).  Kurt will update.
  • Kurt reported that DC power from his AC supply was low – Vbatt has been going down slowly.  We'll keep an eye on it.

TODOs:

  • take photos
  • get GPS location
  • shoot boom angles

 

We've missed the fact that the EC150 sonic and IRGA on the bao flux station has not been reporting since Mar 5 18:04 MST.

This does not look like the previous problem of bad communications causing dropped or corrupted characters.  

cktty 8 does not show framing errors:

cktty 8
8: uart:ST16654 port:F1000118 irq:3 tx:14 rx:104935140 RTS|CTS|DTR|DSR|CD|RI

Powering the EC150 down and back up with "eio 8 0; eio 8 1" did not help.  Sending CRs via "rs 8" gets no response.

Shutting down data collection and then resetting port 8 to RS422 also did not help:

ddn
emode 8 422
dup 

Rebooting did not help.  Needs physical attention. Blown fuse on that port?

Check DSM archive files (lsu)

On each DSM, the lsu command displays the last 10 archive files.  On the flux system, the lsu command runs lsu on each of the tower DSMs in succession. After 0Z, the rsync scripts copy and remove the previous day's files, so you will generally see only files for the current day.  The modification time of the last file shown for each DSM should be the current time in UTC.  If run a second time the size of the file should have grown.

Cockpit

On flux, you can run the cockpit command from a terminal window, to monitor the real-time data.  A popup window will appear.  Select "Search" to connect to the server on the localhost, port 30005.

cockpit will then create a set of tabbed pages, each containing small time-series plots for each sampled variable of a DSM. Each second a vertical pixel line is drawn from the minimum to the maximum value of the variable over that second. When the trace reaches the right-hand-side of the plot, the trace is greyed out, and a new trace begun.  If a sensor is not reporting, RIP will be displayed on the plot.  The greyed-out traces provide a history of the data. The history can be cleared by selecting GlobalSetup -> Color -> Cleanup History.

cockpit can be configured to cycle through the tabbed pages via GlobalSetup -> AutoCycleTabs.

Check Services (sstat)

The post processing of CABL data is done on the flux laptop at the BAO tower, and on porter2 at EOL. The systemd service manager of RedHat Linux is used to start and monitor the services which run the various processing steps.  On flux, the processes are running under the aster userid.  On porter2 they are running under user maclean.  They are started automatically by systemd at bootup.  

To check the status of the services, use the sstat command.  It displays a tree of the various services, followed by an indication of "all services seem to be running", or it will list the missing services. 

If a process of a service isn't running, look at  the system log file,  /var/log/isfs/isfs.log to help track down the problem.  Many of the scripts run by the services listed below also write to log files on $ISFF/projects/CABL/ISFF/logs.

On flux, the services are:

  • nc_server: the NetCDF server process that writes data received by statsproc and R to the NetCDF files
  • dsm_server@noqc_instrument: dsm_server process that receives and archives data from the DSMs on the tower.
  • statsproc@qc_geo_notiltcor:  computes statistics from the 300m tower for the qc_geo_notiltcor dataset, i.e. the files in netcdf_geo_notiltcor
  • statsproc@noqc_instrument:  computes statistics from the 300m tower for the noqc_instrument dataset, i.e. the files in netcdf_noqc_instrument
  • rsync_dsms: script that wakes up periodically and rsync's files from the DSMs on the tower,  then does merge_nightly.sh to merge and reprocess the previous day's files.
  • R_derived: runs R every 5 minutes to create derived values in the files on netcdf_geo_notiltcor
  • ssh_tunnel:  creates the ssh tunnel to NCAR

On porter2:

  • nc_server
  • cabl_flab_statsproc@qc_geo_notiltcor:  computes statistics from the 300m tower for the qc_geo_notiltcor dataset, i.e. the files in netcdf_geo_notiltcor
  • cabl_flab_statsproc@noqc_instrument:  computes statistics from the 300m tower for the noqc_instrument dataset, i.e. the files in netcdf_noqc_instrument
  • cabl_flab_statsproc2@qc_geo_notiltcor:  computes statistics from the bao and ehs flux stations for the qc_geo_notiltcor dataset, i.e. the files in netcdf_geo_notiltcor
  • cabl_flab_statsproc2@noqc_instrument:  computes statistics from the bao and ehs flux stations for the noqc_instrument dataset, i.e. the files in netcdf_noqc_instrument
  • rsync_flab: runs rsync_loop_flab.sh script, which wakes up periodically and rsync's files from flux, then does merge_nightly_flab.sh to merge and reprocess the previous day's files.
  • R_derived
  • proc_restarter:  runs every 10 seconds to check if a user has requested to restart any services

sstat will also show rsync_loop and statsproc@trh_test services on porter2. Those are running in support of the CentNet project.

Note that NetCDF files of 5 minute statistics are being created independently on flux and on porter2.  The files on flux may not be really needed, in which case the statsproc and R_derived services on flux could be disabled. The files created by porter2 are used by Ncharts and are rsync'd periodically to ftp://ftp.eol.ucar.edu/pub/archive/isff/projects/cabl.

Restart real-time service (restart_service, restart_statsproc)

If you make a change to the XML or a calibration file, you will usually want to restart the real-time statsproc processes.  Only if an XML change effects the archive of the raw data do you need to restart dsm_server on flux.

To restart the statsproc processes on flux or porter2, use the restart_statsproc command.  On flux it does a systemctl --user restart of the two statsproc services.  

On porter2  the processes are running under the maclean login, and only that user has permission to restart the services. As a work-around, restart_statsproc writes a string to the file $ISFF/projects/$PROJECT/ISFF/logs/restart_proc.txt. The proc_restarter service wakes up every 10 seconds, checks that file, and if it contains the string "statsproc", does a systemctl --user restart on the four statsproc services. 

Or you can use the command restart_service to restart any service running for CABL.  You will be prompted to choose a service, by number, and that service will be restarted in a similar way to restart_statsproc.

Reprocess statistics

To recalculate the statistics for the whole project, run this command on an EOL server (porter2, barolo, tikal), after setting your project to CABL:

statsproc -S qc_geo_notiltcor -B "2015 feb 18 00:00" -E "2015 jun 1 00:00"

If you want to recalculate the noqc_instrument dataset, set the -S option accordingly.  

The value of the NC_SERVER environment variable should be "porter2" so that the data is sent to nc_server on porter2.

On EOL systems, the default value of the DATADIR environment variable should be "merge", in which case statsproc will process all files on /scr/isfs/projects/CABL/merge.  If you want to process a different set of files, you can pass the list of files instead of the start and end time.  For example, the 50m files:

cd /scr/isfs/projects/CABL/raw_data

statsproc -S qc_geo_notiltcor  50m*

Recalculate derivations

To have the R_derived service recalculate the derived variables for the entire project period the next time it runs, remove this file:

rm $ISFF/projects/CABL/ISFF/logs/R_derived_last.txt

Time keeping

The DSMs each have a GPS with a pulse-per-second signal. Using NTP reference clock software, each DSM is then a stratum 1 time server.  NTP on the DSM uses the GPS reference clock to adjust the CPU system clock, and generally reports that the GPS reference clock has less than a 50 micro-second offset from the system clock.

To query the system clock on a tower DSM from flux, use the ntpq -p command, for example 50m:

ntpq -p 50m
remote refid st t when poll reach delay offset jitter
==============================================================================
LOCAL(0) .LOCL. 10 l 20d 64 0 0.000 0.000 0.000
oGPS_NMEA(0) .GPS. 0 l 5 16 377 0.000 -0.001 0.031

The above shows the GPS reference clock is offset from system CPU clock by -0.001 milliseconds. The "reach" value for GPS_NMEA should be 377 (octal value of all 1's). The reach for the LOCAL clock is always 0.

flux is configured to use all 6 DSMs as network time servers, using chrony, a NTP client. To display the current chrony status, use chronyc sources:

chronyc sources
210 Number of sources = 6
MS Name/IP address Stratum Poll Reach LastRx Last sample
===============================================================================
^+ 50m                                 1  10     377     558    -298ns[+5000ns] +/- 742us
^+ 100m                               1  10     377     179     +52us[    +52us] +/- 662us
^+ 150m                               1  10     377     219 +2000ns[+2000ns] +/- 739us
^* 200m                                1  10     377     338 +9656ns[    +15us] +/- 671us
^+ 250m                               1  10     377     439 +5677ns[    +11us] +/- 732us
^+ 300m                               1  10     377       95     +11us[    +11us] +/- 794us

The above shows that the clock on flux agrees to within a maximum 52 microseconds with each DSM, as indicated by the "Last sample" value in brackets, which is the last calculated offset of the reference clock (in this case the DSM) from the system time on flux. A positive offset means the reference clock is ahead of the system clock.

The second character, under "S" should be '*' (indicating chrony on flux is sync'd to this server) or '+' (good server).  You may also see '-' (recently on 100m for some reason) indicating chrony does not have a high opinion of its time information, relative to the others.

The "reach" values should again be 377. If not, it means the DSM is not on the network, or its NTP server is not responding or sync'd to its GPS.

One can also ssh into bao or ehs and check their clocks with "ntpq -p".

 

 

The USB flash drive at 250m failed on March 5.  I didn't notice it until Mar 6.  I tried powering it up and down (tusb 1 0; tusb 1 1), and doing a soft reboot but it is still not mountable:

mount /dev/sda1 -t ext3 /tmp/usbdisk 
mount: No medium found
fsck.ext3 /dev/sda1
e2fsck 1.35 (28-Feb-2004)
fsck.ext3: No medium found while trying to open /dev/sda1

 

Note: on the "flux" laptop there is now a "lsu" command (as on the DSMs) which does a ssh to each DSM and runs lsu there, so it is a quick way to check whether the internal archives are working.

250m has a flash drive, not a pocketec.  I had 2 fail during testing in the staging area, which is quite unusual. They were not mountable on any system. There seems to be something about the titan usb port or the interface panel that is frying the flash drives.

With the duplicate data stream over the network, we did not loose any data from 250m (which was a bit lucky, since there was a problem with that too, which I also need to log). 

As a workaround, 250m is now writing data to the compact flash drive (/media/cf). That has about 1.6 GByte available.  One day's data for the 250m system is about 117 Mbyte, there is plenty of space.

Writing to /media/cf is enabled via a "bind" mount in  $ISFF/projects/CABL/ISFF/scripts/dsm/other.sh

if [ $(hostname) == 250m ]; then
    if ! mount | fgrep -q /media/usbdisk; then
        mount --bind /media/cf /media/usbdisk
    fi
fi

May want to do this on the other systems too, if this is going to be a problem in the future...

 

Status Update Thursday

Thursday March 5th:  

Tower Status:

u / v / w / spd /dir: spd.nw.300m only has reported data between 11:30 and 12:30 on 3/6

ldiag: the 250 meter ldiag sensors have several correlated flags?

 T / Tc / tc / Tcase / Tirga: ok

RH: ok

Ifan: ok

P: ok

 

BAO Status: data not populated past 18:00 on 3/5

T / Tcase.in / Tcase.out: ok

RH: ok

Ifan: ok

irgadiag: ok

u / v / w / spd / dir: ok

U / V / Spd / Dir: ok

Pirga: ok

Wetness: ok

co2 / h2o: ok

Rsw.in / Rsw.out: ok

Rpile.in / Rpile.out: ok

Tsoil / dTsoil_dt / Qsoil / Gsoil / Vheat / Tau63 / Lambdasoil: ok

Vmote / Vbatt / Iload / Icharge / Tbatt: Odd oscillation with Vote.soil2.bao voltage???

 

EHS Status:

...

The DSM on the surface station at the BAO has a questionable port 4 - intermittent EC150 readings

bao ec150 up

Kurt and I went to the bao site to fix the EC150.  Swapping cables, reseating the terminal block connectors, and swapping EC100 electronics boxes didn't help.  Changing from a titan to an emerald port did.  This sensor is now on port 8 (this required Kurt to change the jumper on this port).  Gordon made the config changes from Boulder.  A detail: we also added a screen to the (new) EC100 temperature inlet.

In the process of lowering and raising the mast, the mast height shifted.  We moved it back, but it now is 1-2cm higher than before.  Also, the boom angle may have changed.  I took two photos to try to document the new boom angle.  Using the previous GPS location and resolving what the photos see on the horizon using GoogleEarth, I estimate 54 degrees rel N for one of the 2D sonic paths and 142 degrees rel N into the CSAT3 boom.  (These <should> be 90 degrees apart.)

I estimate that the mast was down from 1030-1110, then we messed with the mast height until 1130.

Conditions were snow flurries, T ~ -4C, wind 4 m/s.  Not great for this type of work!

 

Status update Monday

Monday March 2nd:  

Tower Status:

u / v / w / spd /dir: as of 16:00 today (3/2), all the following are reporting ok in u, v, w, dir, and spd

"nw.50m","se.50m","nw.100m","se.100m","nw.150m","se.150m","nw.200m","se.200m","nw.250m","se.250m","se.300m"

ldiag: flags for ldiag.nw.50m until 3/2 at 14:00

irgadiag: ok 

 T / Tc / tc / Tcase / Tirga: T.50m is NA starting 3/1 @ 07:00

RH: RH.50m is NA

Ifan: Ifan.50m is at 20 mA (half the value of the others)

P: ok

 

BAO Status:

T / Tcase.in / Tcase.out: ok

RH: ok

Ifan: ok

u / v / w / spd / dir: NA

U / V / Spd / Dir: ok

P / Pirga: ok, Pirga has lots of NA's

Wetness: ok

co2 / h2o: NA, NA

Rsw.in / Rsw.out: ok

Rpile.in / Rpile.out: ok

Tsoil / dTsoil_dt / Qsoil / Gsoil / Vheat / Tau63 / Lambdasoil: ok

Vmote / Vbatt / Iload / Icharge / Tbatt: Odd oscillation with Vote.soil2.bao voltage???

 

EHS Status:

...