Logged in to see what's up. Steve fixed the udev rules so now the pwrmon is reporting. I noticed in the dsm logs that dsm statsproc from relampago was still trying to run, so I disabled it and turned off the service. Steve rsynced the data files to barolo.

Looked at logs to see if I could figure out why it's been off the net so much. Looks like it's rebooting frequently due to USB problems:

Mar  8 15:15:10 s17 kernel: [   42.553988] usb 1-1.5-port1: cannot disable (err = -71)
Mar  8 15:15:10 s17 kernel: [   42.555830] usb 1-1.5: Failed to suspend device, error -71
Mar  8 15:15:10 s17 kernel: [   42.562454] usb 1-1.5: USB disconnect, device number 118
^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@
^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@
^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@Mar  8 01:17:06 s17 kernel: [    0.000000] Linux version 4.9.35-v7+ (dc4@dc4-XPS13-9333) (gcc version 4.9.3 (crosstool-NG crosstool-ng-1.22.0-88-g8460611) ) #1014 SMP Fri Jun 30 14:47:43 BST 2017
Mar  8 01:17:06 s17 kernel: [    0.000000] CPU: ARMv7 Processor [410fc075] revision 5 (ARMv7), cr=10c5387d
Mar  8 01:17:06 s17 kernel: [    0.000000] CPU: div instructions available: patching division code
Mar  8 01:17:06 s17 kernel: [    0.000000] CPU: PIPT / VIPT nonaliasing data cache, VIPT aliasing instruction cache
Mar  8 01:17:06 s17 kernel: [    0.000000] OF: fdt:Machine model: Raspberry Pi 2 Model B Rev 1.1
Mar  8 01:17:06 s17 kernel: [    0.000000] cma: Reserved 8 MiB at 0x3a800000
Mar  8 01:17:06 s17 kernel: [    0.000000] Memory policy: Data cache writealloc
Mar  8 01:17:06 s17 kernel: [    0.000000] On node 0 totalpages: 241664
Mar  9 01:17:14 s17 kernel: [   16.952991] usb 1-1.5: Product: USB 2.0 Hub
Mar  9 01:17:14 s17 kernel: [   16.954917] hub 1-1.5:1.0: USB hub found
Mar  9 01:17:14 s17 kernel: [   16.955420] hub 1-1.5:1.0: 4 ports detected
Mar  9 01:17:14 s17 kernel: [   17.171205] hub 1-1.5:1.0: hub_ext_port_status failed (err = -71)
Mar  9 01:17:14 s17 kernel: [   17.172380] usb 1-1.5: Failed to suspend device, error -71
Mar  9 01:17:14 s17 kernel: [   17.226561] usb 1-1.5: USB disconnect, device number 28
Mar  9 01:17:14 s17 kernel: [   17.520553] usb 1-1.5: new full-speed USB device number 29 using dwc_otg
^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^
@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^
Mar  9 01:17:06 s17 kernel: [    0.000000] Linux version 4.9.35-v7+ (dc4@dc4-XPS13-9333) (gcc version 4.9.3 (crosstool-NG crosstool-ng-1.22.0-88-g8460611) ) #1014 SMP Fri Jun 30 14:47:43 BST 2017
Mar  9 01:17:06 s17 kernel: [    0.000000] CPU: ARMv7 Processor [410fc075] revision 5 (ARMv7), cr=10c5387d
Mar  9 01:17:06 s17 kernel: [    0.000000] CPU: div instructions available: patching division code
Mar  9 01:17:06 s17 kernel: [    0.000000] CPU: PIPT / VIPT nonaliasing data cache, VIPT aliasing instruction cache
Mar  9 01:17:06 s17 kernel: [    0.000000] OF: fdt:Machine model: Raspberry Pi 2 Model B Rev 1.1
Mar  9 01:17:06 s17 kernel: [    0.000000] cma: Reserved 8 MiB at 0x3a800000
Mar  9 01:17:06 s17 kernel: [    0.000000] Memory policy: Data cache writealloc
Mar  8 15:19:10 s17 kernel: [   42.911045] 
Mar  8 15:19:10 s17 kernel: [   42.911078] ERROR::handle_hc_chhltd_intr_dma:2215: handle_hc_chhltd_intr_dma: Channel 3, DMA Mode -- ChHltd set, but reason for halting is unknown, hcint 0x00000002, intsts 0x06000021
Mar  8 15:19:10 s17 kernel: [   42.911078] 
Mar  8 15:19:10 s17 kernel: [   42.911112] ERROR::handle_hc_chhltd_intr_dma:2215: handle_hc_chhltd_intr_dma: Channel 4, DMA Mode -- ChHltd set, but reason for halting is unknown, hcint 0x00000002, intsts 0x06000021
Mar  8 15:19:10 s17 kernel: [   42.911112] 
Mar  8 15:19:10 s17 kernel: [   42.911181] ERROR::handle_hc_chhltd_intr_dma:2215: handle_hc_chhltd_intr_dma: Channel 7, DMA Mode -- ChHltd set, but reason for halting is unknown, hcint 0x00000002, intsts 0x06000021
Mar  8 15:19:10 s17 kernel: [   42.911181] 
Mar  8 15:19:10 s17 kernel: [   42.911214] ERROR::handle_hc_chhltd_intr_dma:2215: handle_hc_chhltd_intr_dma: Channel 5, DMA Mode -- ChHltd set, but reason for halting is unknown, hcint 0x00000002, intsts 0x06000021
Mar  8 15:19:10 s17 kernel: [   42.911214] 
Mar  8 15:19:10 s17 kernel: [   42.911247] ERROR::handle_hc_chhltd_intr_dma:2215: handle_hc_chhltd_intr_dma: Channel 2, DMA Mode -- ChHltd set, but reason for hMar  8 01:17:06 s17 kernel: [    0.000000] Booting Linux on physical CPU 0xf00
Mar  8 01:17:06 s17 kernel: [    0.000000] Linux version 4.9.35-v7+ (dc4@dc4-XPS13-9333) (gcc version 4.9.3 (crosstool-NG crosstool-ng-1.22.0-88-g8460611) ) #1014 SMP Fri Jun 30 14:47:43 BST 2017
Mar  8 01:17:06 s17 kernel: [    0.000000] CPU: ARMv7 Processor [410fc075] revision 5 (ARMv7), cr=10c5387d
Mar  8 01:17:06 s17 kernel: [    0.000000] CPU: div instructions available: patching division code
Mar  8 01:17:06 s17 kernel: [    0.000000] CPU: PIPT / VIPT nonaliasing data cache, VIPT aliasing instruction cache
Mar  8 01:17:06 s17 kernel: [    0.000000] OF: fdt:Machine model: Raspberry Pi 2 Model B Rev 1.1

Lots of these reboots in the logs. Interestingly when the system reboots it seems to always come up with a time right around 01:17:05 of the current day, even if it means jumping back in time by minutes or hours.

There were some other usb messages in the logs that didn't seem to trigger a reboot, but were still notable:

Mar  8 01:17:10 s17 kernel: [   13.663002] usb 1-1.5: Product: USB 2.0 Hub
Mar  8 01:17:10 s17 kernel: [   13.664314] hub 1-1.5:1.0: USB hub found
Mar  8 01:17:10 s17 kernel: [   13.664799] hub 1-1.5:1.0: 4 ports detected
Mar  8 01:17:11 s17 kernel: [   13.980630] usb 1-1.5.1: new full-speed USB device number 35 using dwc_otg
Mar  8 01:17:11 s17 kernel: [   13.982863] usb 1-1.5-port1: cannot reset (err = -71)
Mar  8 01:17:11 s17 kernel: [   13.983375] usb 1-1.5-port1: cannot reset (err = -71)
Mar  8 01:17:11 s17 kernel: [   13.983959] usb 1-1.5-port1: cannot reset (err = -71)
Mar  8 01:17:11 s17 kernel: [   13.984481] usb 1-1.5-port1: cannot reset (err = -71)
Mar  8 01:17:11 s17 kernel: [   13.984980] usb 1-1.5-port1: cannot reset (err = -71)
Mar  8 01:17:11 s17 kernel: [   13.984988] usb 1-1.5-port1: Cannot enable. Maybe the USB cable is bad?
Mar  8 01:17:11 s17 kernel: [   13.985569] usb 1-1.5-port1: cannot disable (err = -71)
Mar  8 01:17:11 s17 kernel: [   13.986115] usb 1-1.5-port1: cannot reset (err = -71)
Mar  8 01:17:11 s17 kernel: [   13.986665] usb 1-1.5-port1: cannot reset (err = -71)
Mar  8 01:17:11 s17 kernel: [   13.987144] usb 1-1.5-port1: cannot reset (err = -71)
Mar  8 01:17:11 s17 kernel: [   13.987727] usb 1-1.5-port1: cannot reset (err = -71)
Mar  8 01:17:11 s17 kernel: [   13.988208] usb 1-1.5-port1: cannot reset (err = -71)
Mar  8 01:17:11 s17 kernel: [   13.988215] usb 1-1.5-port1: Cannot enable. Maybe the USB cable is bad?
Mar  8 01:17:11 s17 kernel: [   13.988765] usb 1-1.5-port1: cannot disable (err = -71)
Mar  8 01:17:11 s17 kernel: [   13.989280] usb 1-1.5-port1: cannot reset (err = -71)
Mar  8 01:17:11 s17 kernel: [   13.989862] usb 1-1.5-port1: cannot reset (err = -71)
Mar  8 01:17:11 s17 kernel: [   13.990342] usb 1-1.5-port1: cannot reset (err = -71)
Mar  8 01:17:11 s17 kernel: [   13.990997] usb 1-1.5-port1: cannot reset (err = -71)
Mar  8 01:17:11 s17 kernel: [   13.991627] usb 1-1.5-port1: cannot reset (err = -71)
Mar  8 01:17:11 s17 kernel: [   13.991637] usb 1-1.5-port1: Cannot enable. Maybe the USB cable is bad?
Mar  8 01:17:11 s17 kernel: [   13.992182] usb 1-1.5-port1: cannot disable (err = -71)
Mar  8 01:17:11 s17 kernel: [   13.992779] usb 1-1.5-port1: cannot reset (err = -71)
Mar  8 01:17:11 s17 kernel: [   13.993295] usb 1-1.5-port1: cannot reset (err = -71)
Mar  8 01:17:11 s17 kernel: [   13.993872] usb 1-1.5-port1: cannot reset (err = -71)
Mar  8 01:17:11 s17 kernel: [   13.994386] usb 1-1.5-port1: cannot reset (err = -71)
Mar  8 01:17:11 s17 kernel: [   13.994945] usb 1-1.5-port1: cannot reset (err = -71)
Mar  8 01:17:11 s17 kernel: [   13.994952] usb 1-1.5-port1: Cannot enable. Maybe the USB cable is bad?
Mar  8 01:17:11 s17 kernel: [   13.995477] usb 1-1.5-port1: cannot disable (err = -71)
Mar  8 01:17:11 s17 kernel: [   13.995515] usb 1-1.5-port1: unable to enumerate USB device
Mar  8 01:17:11 s17 kernel: [   13.996019] usb 1-1.5-port1: cannot disable (err = -71)

I saw this message for both port 1 and port 2 of usb 1-1.5.

Mar  8 01:17:40 s17 kernel: [   43.600792] ERROR::handle_hc_chhltd_intr_dma:2215: handle_hc_chhltd_intr_dma: Channel 0, DMA Mode -- ChHltd set, but reason for halting is unknown, hcint 0x00000002, intsts 0x06000001
Mar  8 01:17:40 s17 kernel: [   43.600792] 
Mar  8 01:17:40 s17 kernel: [   43.600848] ERROR::handle_hc_chhltd_intr_dma:2215: handle_hc_chhltd_intr_dma: Channel 7, DMA Mode -- ChHltd set, but reason for halting is unknown, hcint 0x00000002, intsts 0x06000001
Mar  8 01:17:40 s17 kernel: [   43.600848] 
Mar  8 01:17:40 s17 kernel: [   43.600907] ERROR::handle_hc_chhltd_intr_dma:2215: handle_hc_chhltd_intr_dma: Channel 1, DMA Mode -- ChHltd set, but reason for halting is unknown, hcint 0x00000002, intsts 0x06000021
Mar  8 01:17:40 s17 kernel: [   43.600907] 
Mar  8 01:17:40 s17 kernel: [   43.600983] hub 1-1:1.0: hub_ext_port_status failed (err = -71)
Mar  8 01:17:40 s17 kernel: [   43.601058] ERROR::handle_hc_chhltd_intr_dma:2215: handle_hc_chhltd_intr_dma: Channel 4, DMA Mode -- ChHltd set, but reason for halting is unknown, hcint 0x00000002, intsts 0x06000001
Mar  8 01:17:40 s17 kernel: [   43.601058] 
Mar  8 01:17:40 s17 kernel: [   43.601100] ERROR::handle_hc_chhltd_intr_dma:2215: handle_hc_chhltd_intr_dma: Channel 6, DMA Mode -- ChHltd set, but reason for halting is unknown, hcint 0x00000002, intsts 0x06000001
Mar  8 01:17:40 s17 kernel: [   43.601100] 
Mar  8 01:17:40 s17 kernel: [   43.601144] ERROR::handle_hc_chhltd_intr_dma:2215: handle_hc_chhltd_intr_dma: Channel 0, DMA Mode -- ChHltd set, but reason for halting is unknown, hcint 0x00000002, intsts 0x06000001
Mar  8 01:17:40 s17 kernel: [   43.601144] 
Mar  8 01:17:40 s17 kernel: [   43.601192] usb 1-1-port5: cannot reset (err = -71)
Mar  8 01:17:40 s17 kernel: [   43.601254] ERROR::handle_hc_chhltd_intr_dma:2215: handle_hc_chhltd_intr_dma: Channel 7, DMA Mode -- ChHltd set, but reason for halting is unknown, hcint 0x00000002, intsts 0x06000001
Mar  8 01:17:40 s17 kernel: [   43.601254] 
Mar  8 01:17:40 s17 kernel: [   43.601294] ERROR::handle_hc_chhltd_intr_dma:2215: handle_hc_chhltd_intr_dma: Channel 1, DMA Mode -- ChHltd set, but reason for halting is unknown, hcint 0x00000002, intsts 0x06000001
Mar  8 01:17:40 s17 kernel: [   43.601294] 
Mar  8 01:17:40 s17 kernel: [   43.601336] ERROR::handle_hc_chhltd_intr_dma:2215: handle_hc_chhltd_intr_dma: Channel 4, DMA Mode -- ChHltd set, but reason for halting is unknown, hcint 0x00000002, intsts 0x06000001
Mar  8 01:17:40 s17 kernel: [   43.601336] 
Mar  8 01:17:40 s17 kernel: [   43.601379] usb 1-1-port5: cannot reset (err = -71)


Do we think this is the fault of one bad USB device, like the usb stick or cell modem? Is usb 1-1 the external hub? If Kurt and Dan paid a visit to try and get s17 more reliably online, would it be better to swap out the whole DSM so we can troubleshoot this one back in Boulder, or are we fairly confident swapping out one component would fix it?

s17 is back down now, so I can't keep looking. Steve copied some of /var/log into /scr/tmp/oncley, but it didn't seem to get very far before the connection went down.

Also, Steve noted that ssh to isfs17.dyndns.org connects to s3 right now because the dyndns names haven't been updated, which is confusing.