Blog from May, 2017

Daily status

Yet again, this is abbreviated.  One reason is that our software tools are broken:

  • I can't get to nagios through ops-perdigao.dyndns.org.  I thought Gary's last message said it should work.  Presumably, I could vpn in to ustar and run nagios locally, but I haven't tried this.  Running from within the ops center on the .100 subnet serving from 192.168.1.10 works.
  • eddy's WiFi stopped working yesterday.  It connects, but ping drops 50–80% of packets on all 3 wifi networks.  cockpit can't connect.  Since this is the only machine (other than ustar on the balcony) that can run cockpit, we no longer have cockpit.  My workaround plan is to run an ethernet cable from the balcony down to the ops center main floor to connect eddy back to the net.

Started the morning with a drive to v07, that I had recalled didn't come up yesterday.  Found that v07b is fine, but the power front panel on v07b squeals when I connect v07t power through the AUX connector.  I suspect something is quite wrong with v07t.  (I also recall that it was in our plan to ground v07t, but I don't think we did it.)  Since Andy took the day off, I couldn't climb to check this out.  v07b is running fine and archiving to its USB stick.

Then got a call from José Carlos that Generg was working on the power on the sw ridge.  With their work, replacing 2 12V power supplies, and resetting one differential protection, the entire ridge is now up.  (The Generg crew may have reset more differential protection circuits.)  The Generg crew at first thought that some of the problem was that we had connected the ground wire to our power supplies to neutral, so they cut the ground wire off.  However, when they started to demonstrate the problem to me, they couldn't replicate it (Murphy's Law).  Thus, I think some of the ground wires are now connected and others aren't.  At tnw03, I found that +24VDC from the DC-DC converter wasn't on and replaced it, however the new one didn't start immediately, yet eventually did while I was "poking around".  Not sure yet what action worked.

 

 

Afternoon jaunt

After the daily meeting, and a trip with Lou to grab stakes, ran back up to rsw08 to reset its power.  Found that all INPUT power in the fuse box was 0.  Was mystified, then got back to the ops center to find that the entire network was dead except for 3 towers (rne06, tnw09?, and tnw12, the latter 2 presumably paired with rne06).  Suspected a widespread power failure, so went to upper Orange Grove to look at breaker panel servicing tse13.  Found Lou with power and nothing obviously tripped, thus a journey up the mountain was needed....

Drove the path: rsw01/02/03/tse13/12/11/10/09.  Found one blown DC supply (rsw03), but didn't have a spare, so had to return later on.  The differential protection had tripped at all of the tall masts.  Resettting appeared to bring these back to life.

Next drove up to tnw11 and walked to rne07.  Found another blown DC supply, and still didn't have a spare so had to drive back to ops and up again.  That brought rne07 up.  Also had to hit the differential protection at tnw10.

Finally, drove back up to rsw (since I had seen none of the rsw masts come online).  Nothing had tripped on the main breaker panel (near rsw06) and all the fuses checked okay.  Nevertheless, I cycled all of the breakers and differential protection switches.  After that, checked rsw06.  The differential protection had tripped (for the tower, but the DLR lidar was okay), yet input power was still 0 on all phases.  Same with rsw05.  Conclude that service by the electrical contractor is needed.

Thoughts:

  1. This network is unservicable.  Who would put a major network access node in a location where you have to hike to and create multiple points of failure?
  2. The power system is unservicable.  At every one of the towers, you have to carry an extension ladder to access the fuse box which needs a tumbler-type key.  (Including rne07 that you have to hike up to.)
  3. Our DC supplies all should have differential protection circuits.  Fortunately, we have a lot of these supplies – at least the 12V versions.
  4. There needs to be a better system to distribute information in a timely fashion.  I resorted to creating a fake daily briefing message, because I knew it would be sent to all participants.  The "boss phone" is fine for receiving updates, but not the other direction.

 

Morning jaunt

Found this morning that rsw03 and rsw06 were down (and all of the networking that rsw06 relies upon).  This happened about 2030 last night.  After 2 trips up the ridge, found that the GFI switch in each tower's fuse box had tripped, presumably due to thunderstorms last night.  Both towers came up upon resetting these GFIs.  Now that we're back in the ops center, I see that rsw08 also is down.  This will require a third trip up the ridge.

While at rsw06, we finally replaced the TRH at 20m.  As Steve had mentioned to me, the tape on the clamp nearest the TRH had slipped, tilting the boom to channel water to the TRH connector.  Some water came out when I opened this up.  I attempted to relevel the boom and plugged the end with putty.  On the way down, I also plugged the 10m TRH boom.

 

tnw02 initial install

About 13:30, we powered up tnw02.   Due to rain (and forgetting 2 screws), we only installed the NR01 and the soils about 45min later.  These are now reporting.  When the rain abates, we'll add the sonics and TRHs.  Soil core taken during installation, as usual.

 

tnw08 now on

About 10:15, we powered up tnw08.  Everything appears to be working.  TOWER #39!!

Daily status

Again, pretty much ignored data – glad Gary is on top of this!

Morning: Finished prepping, though it took some work.  Couldn't get one EC100 box to send data, though the connection seemed fine.  (I <think> I got data out of it briefly after talking to it through ECMon (on Scot Lehrer's Windows laptop), but died again.)  Had one last EC100 that was noted to have questionable pressure, but got it working (mostly, just by reseating its connectors).  Also found that our last NR01 wasn't reporting Rlw.out (the other 3 values were fine).  Upon inspection, found 4 mashed wires inside (black, white, blue, grey).  Spliced them back together and all is happy.

Afternoon: Installed sensors at tnw08 (3 RMY).  While we were doing that, INEGI installed solar panels at tnw02, tnw05, and tnw08.  Thus, we <should> have been able to start tnw08.  However, we didn't have a power cable built up to connect to their box at the time, and they have now left with the only key to their battery boxes.  Andy will look for a key (triangle slot) in Castelo Branco tomorrow.

Supposedly, Samortecnica replaced the stake at rne04 today.  I haven't been up to see it yet.

I'll update the todo list with the new tasks.

 

Cycled power on rsw02

rsw02 is the PC104 Titan DSM which is only samples a Setra analog pressure sensor.  It went down about 9 May 05:30 UTC, so I cycled power on it using 'pio aux' on rsw02x.  It's back up now and sampling pressure at 20 Hz.

...Actually, it only stayed up for a few minutes and then stopped again, so I cycled power a second time.  The problem appears to be related to Diamond A/D interrupts (same as in ISFS-149?), but I have not investigated any further.  rsw02 seemed to stable for a long time, so I wonder what changed.

> less /var/log/messages
...
May  7 10:05:42 localhost kernel: pc104_irq_watchdog bark! #4107243, quiet IRQs=  5
May  7 10:10:42 localhost kernel: pc104_irq_watchdog bark! #4109043, quiet IRQs=  5
May  7 10:15:42 localhost kernel: pc104_irq_watchdog bark! #4110844, quiet IRQs=  5
May  7 10:20:42 localhost kernel: pc104_irq_watchdog bark! #4112644, quiet IRQs=  5

> irqs
IRQ      Interrupt Type            Total Int  Int/sec   
------------------------------------------------------
5:       PC104 dmd_mmat:           50         10        
19:      SC ohci_hcd:usb1:         26         5.2       
42:      SC ost0:                  237        47.4      
126:     GPIO eth0:                67         13.4      
129:     GPIO GPIO_17-PC104:       20         4         
Daily status

Another abbreviated status.  Andy should now be in Lisbon.  UPorto class tour today.  I'm still the "boss".

Quick look through cockpit shows no new problems:

  • TRH.20m.rsw06 still needs to be replaced
  • Rsw.in.rsw04 needs to be reprogrammed. Andy brought sensor to swap today.
  • sonic+li7500.v06 brought themselves back to life yesterday!
  • 4 ARL Li7500s should arrive in the next day or two
  • 3 tnw masts should be built today.  Now prepped in ops center
  • 3 tne masts should be started tomorrow, built by end of the week.  Prepping now... tse05 done.

Daily status

Didn't do too much monitoring today, hence no real status update.  Didn't have a full day due to a) sleeping in b) lunch at the Alvaide Association c) dinner at the "kid" restaurant.  Nevertheless, did prep almost all of the remaining 2 tnw towers.  (For some reason, I could get the last mote to send a power message, but NIDAS isn't parsing it to create Vdsm.  If I change to ASCII output, I can see it in the mote message and I can see that every 5th Wisard message is longer (and contains a 0x49 = "I"), but mote_dump (data_dump) and prep display no Vdsm messages.  The .xml files and mote "eecfg" parameters look identical to a mote that is working.)

Prepping required digging into the ARL and UND seatainers to get more RMYoungs.  Turns out UND packed many of their boxes with 2 per box, so we had plenty – at least 8? spares now – though only 1 spare data cable.

Tomorrow (actually, later today), I'll see how many of the 3 tse towers I can prep before Andy arrives.  Jose Palma also is having his students take a tour tomorrow.

 

Daily status

Nice weather after the cold front/rain yesterday.  Semmer has left, Andy to arrive Mon, so no pickup for 2 days.  I'm the Boss for another 3 days...

Still 38 instrumented towers.

  • rne04 supposedly was retensioned yesterday or Thurs.  Will wander up there to check.  If good, can add RMY. Tensions are now higher, up to 350, but no change to anchors.  Thus, this tower is now LESS SAFE.  Wish they had told me when they were there (as I had asked).
  • tnw02/04/08 supposedly have concrete in place, ready for towers.  Will wander out there too to check today.  Checking prepping in ops.  tnw02 concrete work is done.
  • tse05/14/15 will be worked on next week.  Prepping in ops.

Issues

  • li7500.v07 wanders off with bad values, but 4-hour crontab restart has been bringing it back
  • li7500.v06 is similar, but lidiag is aways set, thus no values are coming through statsproc.  Will need some action.
  • tp01.tse02/07/12 all not reporting – suspect plugged into wrong mote port (only works on right-most port with the new fuses)  tse12 was plugged into wrong port.  tse07 came back after being reseated (may have had a drop of water inside TP01 side connector). tse02 was plugged in correctly; I reseated, but it didn't look bad.  It might have been okay, but I panicked and replaced the PIC.  Now working, but has wrong coefficients (PIC is for TP01:200241, probe was TP15/200668.)
  • trh.20m.rsw06 still bad.  will need action.
  • During prepping, have noticed things missing:
    • not enough motes.  The remaining 2 will be used for the 2 soil installations
    • only one mote console cable! (3 have a broken pin).  Will repair with spare connector.
    • having to use a not-Ted's ethernet cable, since I gave one to CU (thinking I had extra).
    • Ted had given away the ubiquiti for tse15.  I'm reprogramming the next-to-last spare. I think this is now ready.

 

RSW06 TRH.60m restarted

14:20  RSW06 TRH.60m went down again due to high current. I reset the max current to 150ma.

 

11:45   TRH was giving bad data. I suspect the problem is the connection between the SHT and the PIC. A PIO command fixed it.

 

5/5/17 Status

Weather:  Rain during the night, starting to clear this morning.

Stations:
      tnw05t - 20m sonic down, rain?
      rsw07 - 10m sonic down, rain?
      rsw06 - 20m and 60m TRH down, PIO did not help
      tse04b - 10m TRH, PIO restarted TRH, changed Ifan current to 150
      v06 - 20m sonic down, rain?

 

 

Daily banging

li7500.v06: pio 0/1; now set to a crontab every 4 hours

li7500.v07: pio 0/1; now set to a crontab every 4 hours

TRH.10m.tse04: CUR -> 150

 

15:40   TRH.40m, tse09, showing bad data. PIO got it going again. Will monitor.