On Friday afternoon, 2021-08-20, Steve, Chris, Jacquie, and I were at Marshall. The DSM and Ubiquiti were again rebooting intermittently. Jiggling connectors inside the battery box did nothing. Steve saw a little corrosion on the cable connection at the victron which supplied power to the DSM. Chris pointed out that the power cables have two conductor pairs for 12V, so it seemed extremely unlikely the problem could be a poor connection in the power cable. Steve replaced the Victron, but the power interruptions continued after that.

We looked for a pattern in the timing of the reboots:

egrep -a -C 3 "Booting Linux" messages > booting-linux-messages.txt

The file is attached: booting-linux-messages.txt.  (This would be a great situation in which to have a battery-backed system clock on this DSM.)

Nagios shows ttstation down during 16:00 hour at these times:

Host Down[08-20-2021 16:57:45] HOST ALERT: ttstation;DOWN;HARD;1;CRITICAL - 128.117.81.51: rta nan, lost 100%
Host Down[08-20-2021 16:53:46] HOST ALERT: ttstation;DOWN;HARD;1;CRITICAL - 128.117.81.51: rta nan, lost 100%
Host Down[08-20-2021 16:49:47] HOST ALERT: ttstation;DOWN;HARD;1;CRITICAL - 128.117.81.51: rta nan, lost 100%
Host Down[08-20-2021 16:45:43] HOST ALERT: ttstation;DOWN;HARD;1;CRITICAL - 128.117.81.51: rta nan, lost 100%
Host Down[08-20-2021 16:37:43] HOST ALERT: ttstation;DOWN;HARD;1;CRITICAL - 128.117.81.51: rta nan, lost 100%
Host Down[08-20-2021 16:33:39] HOST ALERT: ttstation;DOWN;HARD;1;CRITICAL - 128.117.81.51: rta nan, lost 100%
Host Down[08-20-2021 16:26:39] HOST ALERT: ttstation;DOWN;HARD;1;CRITICAL - 128.117.81.51: rta nan, lost 100%
Host Down[08-20-2021 16:21:40] HOST ALERT: ttstation;DOWN;HARD;1;CRITICAL - 128.117.81.51: rta nan, lost 100%
Host Down[08-20-2021 16:13:48] HOST ALERT: ttstation;DOWN;HARD;1;CRITICAL - 128.117.81.51: rta nan, lost 100%
Host Down[08-20-2021 16:06:40] HOST ALERT: ttstation;DOWN;HARD;1;CRITICAL - 128.117.81.51: rta nan, lost 100%

The reboots file looks like reboots happened during that same hour at these times:

22:52
22:47
22:41
22:39
a few reboots in between which never got time sync
22:23

No clear pattern. It's even difficult to see if the nagios ttstation failures line up with the reboots. Past incidences did seem to be at regulat 10-12 minute intervals, but not this time.

The SPOL conjecture

Steve and Chris measured the batteries, looked for problems on the power interface card, and we scratched our heads for a while, then came the real kicker. Steve realized that SPOL had been running while the reboots were happening, and now that SPOL had stopped, the DSM had not rebooted. Since we could find no indication of a problem in the hardware, the current running theory is that SPOL radiation was somehow interrupting power. That also happens to explain unexpected outages in the iss4station Ubiquiti radio. Nothing else at that site has gone down, except for the ubiquiti, with its antenna pointed almost directly at SPOL.

Steve has emailed RSF to find out when SPOL has been running recently, so we can see if that coincides with the problem periods.

In the meanwhile, the DSM was turned around on the pole, so the metal backplate is facing SPOL and can provide some shielding. Steve also grounded the DSM, as a matter of good practice.

No DSM reboots in the following 24 hours, and no iss4station outages, but probably SPOL has not been running.

Mitigation?

If the problem does turn out to be SPOL, and assuming the DSM can be shielded, I'm not sure how we can mitigate the interference with the Ubiquiti radios.  The other radios have not had problems, so apparently it's enough to point the radios away from SPOL.  However, that would mean introducing another radio which can point at pedestal without pointing towards SPOL, then pointing iss4station and ttstation at that intermediate radio.


  • No labels

1 Comment

  1. Actually, I didn't have appropriate stuff to ground the DSM, so this task remains to be done.

    What I did do was wrap one of the guy wires around an old ground rod, but even that should be redone with a real ground rod clamp.

    The data I did get from RSF appears to confirm the interference theory!