View Source

Netcdf files hadn't been updated at EOL since this morning.

data_stats sock:barolo showed no data.

nidas_udp_relay was running on eol-rt-data, though it had been restarted this morning at 09:50 MDT. It wasn't due to a reboot, it's been up for 50 days.

On eol-rt-data, "data_stats sock::30010" showed data coming in. Or you can do, from any system at EOL:

data_stats sock:eol-rt-data.fl-ext.ucar.edu:30010

These errors started showing up in /var/log/isfs/isfs.log, every 10 seconds:

Sep 25 09:41:10 barolo dsm_server[44405]: ERROR|SocketConnectionThread: IOException: inet:128.117.188.122:30010: connect: Connection refused

Eventually the socket open succeeded, but then this error:

Sep 25 09:50:11 barolo dsm_server[44405]: WARNING|SampleInputStream: inet:128.117.188.122:30010: raw sample not of type char(0): #bad=1,filepos=0,id=(609,25461),type=28,len=779247971

As with reading disk data, the reader skips forward one byte and looks for a good sample.

Not sure why the corrupt data, and why it didn't recover. Would be good to look at the logs on eol-rt-data.

Did a kill -TERM of dsm_server on barolo, and ran check_vertex_procs.sh by hand, rather than waiting for crontab.

Updated crontab to check the procs every 15 minutes, rather than 30.