Netcdf files hadn't been updated at EOL since this morning.  

data_stats sock:barolo   showed no data.

nidas_udp_relay was running on eol-rt-data, though it had been restarted this morning at 09:50 MDT.  It wasn't due to a reboot, it's been up for 50 days.

On eol-rt-data,  "data_stats sock::30010" showed data coming in.  Or you can do, from any system at EOL:

data_stats sock:eol-rt-data.fl-ext.ucar.edu:30010

These errors started showing up in /var/log/isfs/isfs.log, every 10 seconds:

Sep 25 09:41:10 barolo dsm_server[44405]: ERROR|SocketConnectionThread: IOException: inet:128.117.188.122:30010: connect: Connection refused

Eventually the socket open succeeded, but then this error:

Sep 25 09:50:11 barolo dsm_server[44405]: WARNING|SampleInputStream: inet:128.117.188.122:30010: raw sample not of type char(0): #bad=1,filepos=0,id=(609,25461),type=28,len=779247971

As with reading disk data, the reader skips forward one byte and looks for a good sample.

Not sure why the corrupt data, and why it didn't recover. Would be good to look at the logs on eol-rt-data.

Did a kill -TERM of dsm_server on barolo, and ran check_vertex_procs.sh by hand, rather than waiting for crontab.

Updated crontab to check the procs every 15 minutes, rather than 30.

2 Comments

  1. Part of this mess was me.  I tried to kill dsm_server on eddy while I was flailing away at getting /scr/isfs mounted.  (I thought it wouldn't be nice to unmount this disk while raw_data files were being written to it.)  I wanted to stop dsm_server, not restart it, so I didn't use the script and just used kill.  However, dsm_server restarted itself immediately.  I ended up doing a bunch of other stuff rather than unmount the disk.

    I guess I should mention that, before I worked on it this morning, /scr/isfs (where both VERTEX/netcdf and /raw_data are) had been offline with I/O errors (that I <may> have fixed) for at least a day.

     

  2. Yea, systemd restarts dsm_server if it exits.  Perhaps there's a way to configure it so that it won't restart it after a specific signal, such as TERM. Looks like the setting is "Restart=on-abnormal" instead of "on-failure"..

    To stop it so that systemd won't restart it:

    systemctl --user stop dsm_server

    To start it again, substitute "start" for "stop".

    Since it's a user process, don't need sudo.