After seeing Jacquie's blog post yesterday about no data I did some troubleshooting (see comment on Jacquie's post) but only succeeded in taking the DSM off the net entirely. I went out to Marshall to reboot it twice, each time bringing it up briefly but then it would go down again. Went back for a longer visit in the afternoon with Gary (thanks Gary!) to do more troubleshooting. 

The DSM always came up after reboot, but wouldn't stay up very long. When we were logged in over the console, we could see kernel error messages before it went down. In one case we saw that nidas was running correctly just after rebooting, but eventually stopped (couldn't find anything relevant in the logs) and had to be restarted with dup. Seems like this is the same problem as what Jacquie noted originally. We also sometimes saw error messages about I/O errors trying to access the usb stick when starting nidas or using the lsu command. 

We did some testing of the pio command, and it seems that using pio -v to see the current state sometimes turns power off to some or all of the ports. pio can turn off power to bank1 (powering the switch) and 28v (powering the ubiquiti), either of which would take the dsm off the net when you're logged in remotely. Until we can fix the pio -v behavior we unplugged the cable between the autoconfig board and the power panel, so using pio -v shouldn't be able to turn off power to bank1 or 28v. 

We also noticed that it was taking a very long time for the dsm to get the correct time after reboot (sometimes like half an hour). It turns out that pps0 wasn't giving any data, it was on pps1 instead. We tried modifying chrony.conf and disconnecting the pps, but neither seemed to do much (Gary, you can add stuff if you remember more of the details).

Eventually we decided that the DSM wasn't reliable enough to leave up because of all the kernel errors, so we replaced the pi with one from a spare DSM we brought. That worked, and Gary made some updates remotely:

I had to install xinetd on the DSM to get check_mk to work from snoopy, and then nagios noticed that the USB filesystem was missing.  So I decided to upgrade the whole DSM before rebooting it, and it turns out there were quite a few updates.  The USB came back and has not failed all night, but it's good that it's getting checked now.

I also modified the chrony.conf to add the iburst option to the tardis server line.  That makes chrony sync much faster to the network NTP server, and it makes sense to rely on that since the PPS is not working.

Finally, I discovered a bug with the data_stats and dashbaoard, where the data_stats will just keep counting up from 2010 if the system time was wrong when data_stats started.  The quick workaround was to restart the json_data_stats service after the system time was correct.

The DSM went down again at 8:20 this morning. Dan is at Marshall this morning and rebooted the DSM around 11:30, which brought it back on the net. I looked through the logs and it seems like this time the DSM was collecting data while it was off the net, so it may have just been an ethernet issue. DSM is back off the net now though, about 10 minutes after Dan rebooted it.

Since the DSM clearly still isn't reliable, I'm thinking the next step would be setting up a whole new dsm + SD card as tt when I'm in the office tomorrow...