Blog

external network outage

The network to the outside world wasn't working from about 0800-1200 this morning.  The LAN was all fine, so we didn't lose any data.  Somehow the cradlepoint on the top of rim died.  Power cycling from rimup with vio 7 0; vio 7 1 brought it back.  The "router_check" script on rimup should have brought it back up, but apparently ran once at 0800 and not again.  Gordon is investigating why this didn't work.

None of our data were affected by this issue.  Hopefully it was late enough that it didn't impact the just-finished IOP operations.

Comment by Gordon: The above was to due to a bug in the crontab entry that checks the internet connection:

*/20 * * * *   net_check.sh eth0 192.168.0 192.168.0.5 && router_check.sh 7 www.google.com

For some reason it appears that the ethernet interface on the router died this morning, such that it didn't respond to pings from the DSM. The above crontab entry does not power cycle the router if the DSM can't ping it. The idea is not to power cycle the router and modems if the problem is at our end.

Changed it to the following, which will do a net_check.sh and router_check.sh every 10 minutes:

*/10 * * * *   net_check.sh eth0 192.168.0 192.168.0.5; router_check.sh 7 www.google.com
P.nne down yesterday

10/17, twh

I am concerned that P.nne.flr went down yesterday from 9:45 until 12:10.

SPO: 10/17 PM

We're noticing that data from this Bluetooth mote is unique in having more jitter in data reporting timing.  All of the other Bluetooth motes have an average sample rate of 1.00 s, with a min/max dt of about 0.9/1.1s.  P.nne is coming in with the same average (1.00), but min/max dt are 0.6/1.5s.  This indicates that its radio isn't performing as well?  Could this be suggestive of a problem that could cause an outage?

Daily status, October 17

10/17/13

Summary:
Hardwired floor radiation and soil to the DSM yesterday.
IOP 3 last night.

T/RH: RH.40m.near and RH.35m.rim perhaps a bit high.
         Ifan suspicious at near.30m and T a bit high both day and night.
P: P.nne down 10/16 09:45 - 12:10
csat u,v: ok
csat ldiag: ok
csat w, tc: ok
fluxes: ok
kh2o: FLR variance high at night, but also tc'tc'
motes: FLR OK! after 14:00 10/16
           FAR soil mote stopped 10/16 13:45 - 15:40.
           NEAR ok
Wetness: ok
radiation: ok
Tsoil: Tsoil.2.5cm.flr & far (linear avg) do not fit profile
Gsoil: ok
Qsoil: ok
Cvsoil: ok
2D sonic: ok

flr network outages

Dave reports several outages in the crater during the IOP tonight.  The following script reports eantf at 54dB (good?), and says it has been up for 12 hours.  On the other hand, far has only been up 3 min and near only 2 hours, even though their signal levels are now 37 and 39 dB, respectively (okay values).

As expected, data_stats shows that the raw_data archive is fine at flr.

On a subsequent run of check_ap24, eanth@sodar came back, with a signal level of 49dB.

All of the above comments are consistent with Gordon's message early in the project.

[aster@flux ~]$ check_ap24.sh

local  remote      (         mac-addr)    uptime  ccq  txrate  rxrate  rxsig  snr #txpkts  #rxpkts  txretry  rxretry txB/sec rxB/sec

ap24   ap24-3@near (00:15:6D:20:01:90)   1:56:06  98% 54.0Mbs 48.0Mbs -61dBm 39dB 9851146 9320039       0%       0%  103070  222934

ap24   ap24-2@rim  (00:15:6D:10:1C:0D)  92:33:31  98% 48.0Mbs 54.0Mbs -60dBm 36dB 9528375 9851681       0%       0%    4445   12784

ap24   eanti@far   (00:20:F6:05:24:56)   0:03:14 100% 11.0Mbs  2.0Mbs -63dBm 37dB     719     723       0%       0%     223    1186

awk: cmd. line:95: (FILENAME=- FNR=51) fatal: division by zero attempted

ap24-2             (00:15:6D:10:29:B2)  92:33:34 100% 54.0Mbs 48.0Mbs -51dBm 44dB 9929362 9450666       0%       0%   12784    4445

ap24-2 eantf@flr   (00:20:F6:05:24:5A)  12:07:31 100% 11.0Mbs 11.0Mbs -43dBm 54dB  982727  898352       0%       0%   17755    8838

awk: cmd. line:93: (FILENAME=- FNR=39) fatal: division by zero attempted

ap24-3 ap24@base   (00:15:6D:10:3C:BC)   1:56:09  97% 54.0Mbs 54.0Mbs -63dBm 32dB 9449388 9721786       0%       0%  222842  103027

10/16/13 ~21:50Z

Changed the Power Mote ID on 'up' from ID10 to ID110. (rs18)

This was done because there is the spn1 mote that also has ID10 on this station and this might make things a bit more clear

In prep for the swap from Xbee to serial operation, I've now set pp=0 on all motes (ids 1,2,17).  I didn't "reboot" the motes, in order to get the maximum amount of data before the swap.  When Sebastian and Eric swap the cables shortly, the motes obviously will reboot.

The plan is for them to disconnect the receive mote on port 2, then plug cables from each of the 3 motes to ports 2, 9, 10.  Gordon updated the config earlier this morning in anticipation of this change.

14:00 The swap is done.  All data coming in as expected!  Thanks Sebastian and Eric!

10/16/13 ~18:30Z

Flr,Near,Far; Changed all xbee status message rates from 1/2hourly (sx=360 at 5sec datarate) to 0 to disable them.

The reason is because the flux computer 'wisardMessageDecoder' of raw-files looking for mote COMMENTS was not showing any of these type messages coming in which is abnormal.   The procedure does work because I tried it via operator command (xs) to get mote responses at flr:

ID1:\0x01\0x02Xbee:  CH=15 ID=6 DL=40625DF1 SP=3E8 ST=7D0 SM=0 SO=0 NP=49 PL=4 U\0x03\0x04\r

ID2:\0x01\0x02Xbee:  CH=15 ID=6 DL=40625DF1 SP=3E8 ST=7D0 SM=0 SO=0 NP=49 PL=3 Q\0x03\0x04\r

ID17:\0x01\0x02Xbee:  CH=15 ID=6 DL=40625DF1 SP=3E8 ST=7D0 SM=0 SO=0 NP=49 PL=4 m\0x03\0x04\r

However, we have been having outages especially at flr and it is conceivable the mote's attempt to grab status values from the xbee was causing problems.   The motes perform this task by switching the xbee radios into 'command-mode' and then issue specific commands to the xbee for grabbing these parameters.   Timing is important, so if the radios were to remain in command mode even though the mote issued the 'go back to data mode' then they would be unable to send any data messages sent.   Testing in Boulder in the past showed that the method was working, but perhaps in a noisy rf environment the interaction becomes more dicey.   Timing also relies upon the 'guard-time' needed for the xbee i/o, the mote rtcc interrupts, etc.   Maybe this will help eliminate radio outages.

I tweaked the sonic azimuths on the profile towers and Gordon reran covars.

Then I tweaked the sonic tilt angles for the period 9/29 00:00 to 10/15 00:00.

Finally, I reran the covers again this morning.

Daily status, October 16

10/16/13

Summary:
Fixed Rlw.out.far pyrgeometer once again.
Today will attempt to hardwire floor Zigbee motes (radiation and soil) to the DSM.

T/RH: RH.40m.near and RH.35m.rim perhaps a bit high.
P: ok
csat u,v: ok
csat ldiag: ok
csat w, tc: ok
kh2o: FLR variance high at night, but also tc'tc'
motes: FAR ok 
           NEAR ok
           FLR all off and on; outages generally last 1:00 - 1:05 hrs.
Wetness: ok
radiation: Rpile.out.far ok
Tsoil: Tsoil.2.5cm.flr & far (linear avg) does not fit profile
Gsoil: ok
Qsoil: ok
Cvsoil: ok
2D sonic: ok

10/16/13 ~1:55Z

logged into flr.   id1 and id2 reporting, not id17.   <cr> restarted id17 connection.

because flr 'hb' cron now runs every hour, set xr=7200 for 2hourly resets if no hb received.

rs 2 again and found id17 not coming in.   played around a bit and then id2 also wasn't coming in.  However, all 3 still responding to any command.
'reboot' command to all 3 motes.     that looks better: all 3 messages coming in at once instead of individually (or not).
let's see if any of this helps....

Service visit to NEAR

10/15/13, twh

Cleaned radiometers around 15:29.

P.S. I don't notice much of an effect on Rsw.in -- certainly no more than 1 W/m2.  This is good.

Service visit to FAR

10/15/13, twh

Replaced Rlw.out.far.  It works!  Cleaned all radiometers.  Left site at 15:15.

At John's suggestion, I just modified the crontab on FLR to send a mote heartbeat every hour.  It had been set for every 4 hours.  Hopefully, this will bring back connections faster if they break.

Daily status, October 15

10/15/13

Summary:
10/14
Replaced floor serializer; kh2o improved
Tried recycling power on Rlw.out.near - no joy
Wack-A-Mote on crater floor
Replaced battery at P.ssw1 and fixed charging at P.w

10/15
Replaced radiation shield at 20m.rim; fan wired backwards

T/RH: RH.40m.near and RH.35m.rim perhaps a bit high.
P: ok
csat u,v: ok
csat ldiag: ok
csat w, tc: ok
kh2o: FLR variance high at night, but also tc'tc'
motes: FAR ok 
           NEAR rad out 10/15 01:35-05:20
           FLR all off and on; now restarting on the hour
Wetness: ok
radiation: Rpile.out.far flat-lined after 10/13, about 16:45
Tsoil: Tsoil.2.5cm.flr & far (linear avg) does not fit profile
Gsoil: ok
Qsoil: ok
Cvsoil: ok
2D sonic: Dir.10m.flr at times appears to be off by 180 deg.

Pulled FAR Rlw.out

Done from about 10:09-10:12.  We'll try to fix it in the base.

P.S. In the base, Tom pulled this sensor apart.  I noticed that the negative (blue) side of the thermopile was electrically connected to the outer conductive ring of the sensor disk due to a misplaced solder blob.  I heated this connection with a soldering iron and removed this contact.  Now there is no connection (infinite resistance) between this contact and the ring.  We'll replace it at far.