You are viewing an old version of this page. View the current version.

Compare with Current View Page History

Version 1 Next »

Okay, this will attempt to document issues during the past 2 days.

  1. ehs USB stick died.  I do not yet know when.  fdisk reports no partition.  I replaced this with a new USB stick about an hour ago and it is now working.
  2. Santiago rebuilt/started eol-rt-data yesterday.  Apparently, in the process the ssh configuration was modified to disallow (some) connections to porter2.  
  3. Yesterday, in an attempt to solve the eol-rt-data issue by myself, I tried restart_process of ssh_tunnel on flux.  This killed ssh to flux from the outside.  SRS rebooted flux last night at about 1700 and I rebooted it again at about 1200 today (2 separate trips to the tower), which restored the connection from flux to eol-rt-data, but still failed to get all the way through to the eol machines due to eol-rt-data configuration issues.  I don't know if this was a red herring and would have fixed itself when eol-rt-data was fixed.
  4. Even once eol-rt-data was fixed (by Ted restoring an old image of the virtual machine!) (that restored ssh tunnel to flux), connections to bao and ehs were broken.  We found that Gordon's check_udp_... only reported errors and failed to restart the udp_ tunnel process.  Running this by hand finally got data flowing.

Lessons:

  1. Santiago/Ted were unaware that CABL even existed and didn't think to check with Gordon before proceeding with system work.  They did ask (vacationing) Gary, who didn't educate them.
  2. We still need to fix check_udp_... to restart automatically – I'm guessing that this is just a path issue.
  3. Ted thinks that virtual machine "snapshots" were the cause of the config errors and had decreed that they should be avoided in the future.
  4. Santiago wants to better document eol-rt-data and is thinking about splitting up some of its services onto other virtual machines. 

Impact:

  1. BAO tower data should be fine since the DSMs kept running and flux was up to rsync their data.  All of these data should be rsynched (and new statsproc files and wwwplots generated) soon.
  2. bao data also should be fine since its DSM was up and saving to local storage.  It appears that porter2 was able to rsynch its data from 0325 and will be able to run tonight.  flux wasn't able to get these data and has a gap between 0325_195959 to 0326_202418.
  3. ehs data were archived on flux (via udp transmission) until the 0325_195959 time and are missing until 0326_202418, as with bao.  The porter2 files are even worse, with no data from 0325, undoubtedly due to the bad USB stick.  More data will not be filled in by the nightly merge.  Call this 24 hours of data lost.

Whew!

 

 

 

 

 

  • No labels