This afternoon I went to Marshall to swap out the existing tower dsm for a replacement I had set up over the last week. The existing dsm has been going off the net intermittently for the past 10 or so days (see previous posts). It had been on the net the entire time since I rebooted it on Tuesday, but when I checked just before leaving Foothills (~4pm) it was down. So, more reliable than it was earlier, but still not great.

The replacement DSM had new everything except the ethernet and USB patch cables and the POE injector. After powering it up I noticed it had no power to the serial board, then remembered we had to put a 3A fuse for it to work, so used the one from the other DSM, which fixed the power issue. I also noticed there was no power to the ubiquiti, which turned out to be fixable by wiggling all the connections for the POE injector (sorry Steve). I also unplugged the ribbon cable from the serial board to the power board, like in the existing DSM, so pio can't turn off power to the ubiquiti and the switch.

After making sure everything was powered I logged in, first with the console cable and then through the ethernet switch, to check data and networking, and found that the DSM was randomly rebooting or hanging. In the hour or so I spent trying to troubleshoot I don't think it successfully stayed up for more than 5 or so minutes between reboots. I also noticed that sometimes the dsm process wasn't running despite the DSM recently being rebooted, which sounds like the same problem I noticed in the old DSM last week that started this whole mess. It seemed that the DSM would work normally upon reboot at first, then after a few minutes would get much slower (commands like ifconfig and lsusb taking much longer than usual), and eventually the console would stop responding entirely until I rebooted the DSM. Often the ethernet interface seemed to go down first and I'd lose the ssh connection, but could still access the console through the console cable, then it would get slower and slower and finally hang. Did a little looking through logs but haven't found anything of use. Got part of one kernel error message, but the rest was cut off and I couldn't scroll up in minicom to see it:

[  259.784447]  r7:00000001 r6:8147db18 r5:80fe6108 r4:bb8110f0
[  259.793432] [<80184d70>] (generic_handle_irq) from [<80675fb8>] (bcm2836_arm_irqchip_handle_ipi+0xa8/0xc8)
[  259.809290] [<80675f10>] (bcm2836_arm_irqchip_handle_ipi) from [<80184db4>] (generic_handle_irq+0x44/0x54)
[  259.825292]  r7:00000001 r6:00000000 r5:00000000 r4:80e90d10
[  259.834362] [<80184d70>] (generic_handle_irq) from [<80185514>] (__handle_domain_irq+0x6c/0xc4)
[  259.849308] [<801854a8>] (__handle_domain_irq) from [<801012c8>] (bcm2836_arm_irqchip_handle_irq+0x60/0x64)
[  259.865649]  r9:81564000 r8:00000001 r7:81565f64 r6:ffffffff r5:60000013 r4:801088c4
[  259.879499] [<80101268>] (bcm2836_arm_irqchip_handle_irq) from [<80100abc>] (__irq_svc+0x5c/0x7c)
[  259.894869] Exception stack(0x81565f30 to 0x81565f78)
[  259.903313] 5f20:                                     00000000 0004aea8 b6b682c4 80118e40
[  259.917758] 5f40: ffffe000 80f05058 80f050a0 00000008 00000001 81031700 80cfc46c 81565f8c
[  259.932221] 5f60: 81565f90 81565f80 801088c0 801088c4 60000013 ffffffff
[  259.942642] [<8010887c>] (arch_cpu_idle) from [<809fee60>] (default_idle_call+0x4c/0x118)
[  259.957068] [<809fee14>] (default_idle_call) from [<80156514>] (do_idle+0x118/0x168)
[  259.970946] [<801563fc>] (do_idle) from [<80156838>] (cpu_startup_entry+0x28/0x30)
[  259.984619]  r10:00000000 r9:410fd034 r8:0000406a r7:81049390 r6:10c0387d r5:00000003
[  259.998656]  r4:00000092 r3:60000093
[  260.005199] [<80156810>] (cpu_startup_entry) from [<8010eaec>] (secondary_start_kernel+0x164/0x170)
[  260.020696] [<8010e988>] (secondary_start_kernel) from [<001018ac>] (0x1018ac)
[  260.031885]  r5:00000055 r4:0155806a
[  260.038433] CPU2: stopping
[  260.043826] CPU: 2 PID: 0 Comm: swapper/2 Tainted: G      D  C        5.10.52-v7+ #1441
[  260.057830] Hardware name: BCM2835
[  260.064012] Backtrace: 
[  260.068935] [<809eff28>] (dump_backtrace) from [<809f02b8>] (show_stack+0x20/0x24)
[  260.082188]  r7:ffffffff r6:00000000 r5:60000193 r4:80fe5e94
[  260.091192] [<809f0298>] (show_stack) from [<809f44c8>] (dump_stack+0xcc/0xf8)
[  260.102210] [<809f43fc>] (dump_stack) from [<8010e488>] (do_handle_IPI+0x30c/0x340)
[  260.115558]  r10:80cfc46c r9:81562000 r8:8147e000 r7:00000001 r6:35cc2000 r5:00000002
[  260.129144]  r4:81049380 r3:f00ffb56
[  260.135469] [<8010e17c>] (do_handle_IPI) from [<8010e4e4>] (ipi_handler+0x28/0x30)
[  260.148636]  r9:81562000 r8:8147e000 r7:00000001 r6:35cc2000 r5:00000015 r4:814c13c0
[  260.162013] [<8010e4bc>] (ipi_handler) from [<8018bd70>] (handle_percpu_devid_fasteoi_ipi+0x80/0x154)
[  260.177267] [<8018bcf0>] (handle_percpu_devid_fasteoi_ipi) from [<80184db4>] (generic_handle_irq+0x44/0x54)
[  260.193277]  r7:00000001 r6:8147db18 r5:80fe6108 r4:bb8110e0
[  260.202272] [<80184d70>] (generic_handle_irq) from [<80675fb8>] (bcm2836_arm_irqchip_handle_ipi+0xa8/0xc8)
[  260.218146] [<80675f10>] (bcm2836_arm_irqchip_handle_ipi) from [<80184db4>] (generic_handle_irq+0x44/0x54)
[  260.234163]  r7:00000001 r6:00000000 r5:00000000 r4:80e90d10
[  260.243229] [<80184d70>] (generic_handle_irq) from [<80185514>] (__handle_domain_irq+0x6c/0xc4)
[  260.258182] [<801854a8>] (__handle_domain_irq) from [<801012c8>] (bcm2836_arm_irqchip_handle_irq+0x60/0x64)
[  260.274535]  r9:81562000 r8:00000001 r7:81563f64 r6:ffffffff r5:60000013 r4:801088c4
[  260.282803] SMP: failed to stop secondary CPUs
[  260.288405] [<[[ 2n0.2960e0]p---[ -nd ternelipan c - not syncing  Fataleexception-in in+errupt ]-)-
[  260.308653] Exception stack(0x81563f30 to 0x81563f78)

I eventually decided I wasn't getting anywhere with troubleshooting, so I swapped back to the existing DSM (which at least stays up on the order of days, instead of minutes). The original DSM is back in place and is still online as of writing this blog post. Presumably if it goes down again a reboot will work to get it running again.

Since we've seen similar behavior in two DSMs at Marshall (and since the replacement DSM wasn't doing things like this while I was setting it up) I'm wondering if it's something to do with power after all. Or maybe a problem that only occurs when sensors/ubiquiti/etc are plugged in?