Blog - SWEX Field Project - EOL/ISFS

21 Oct 2020 Teardown

Daniel Buonome posted on Oct 21, 2020

Weather: Cool and foggy morning, followed by overcast conditions for the remainder of the day.

Tasking:

S4 (San Marcos Foothills Preserve) teardown. Meeting ranger for access at 8:30 am. Went smoothly save for the following issues:

- After speaking with a local strolling by, it sounds as though a retention basin for a new housing development is slated to be installed at the current site location. This will have to be addressed for SWEX 2021.
- CSAT 3A stiffener brace weld broke, will need to be epoxied. Second sonic found with this issue.
- Battery voltage was measured around 1.5V. Unknown cause but the solar panel > station cable is suspect. Will be high pot testing all cables back in Boulder.
- Flora surrounding station grew much taller than expected, and enveloped rad tripod.

Plan for tomorrow

- S14 (Sedgwick Reserve) teardown
- Assist Bill and Mack in lowering their tower at Lake Cachuma

20 Oct 2020 Teardown

Daniel Buonome posted on Oct 20, 2020

Weather: Cool and foggy morning, with overcast conditions for the remainder of the day.

Tasking:

S1 (SBC Fire Station 18) teardown. Went smoothly save for the following issues:
- A crack formed in the lower collar/pivot point of the tripod assembly. Collar will have to be replaced.
- TRH fan inop.
- Unable to connect to DSM. Site was powered. May be a static IP issue, will investigate in Boulder.
- Tripod leg pads had play, ground on this site compacted. Future tripod installations will have to be releveled during maintenance visits due to soft ground.
- CSAT 3A stiffener brace weld broke, will need to be epoxied

Plan for tomorrow

S4 (San Marcos Foothills Preserve) teardown. Meeting ranger for access at 8:30 am.

19 Oct 2020 Teardown

Daniel Buonome posted on Oct 19, 2020

Weather: Cool and foggy morning followed by hot, sunny, and calm conditions.

Tasking:

S17 (Cuyama Peak) teardown. Approx. 3.5 hour drive one way. Upon arrival, found the cables in ports 5 & 7 had been yanked hard enough to remove the pins from the connector. Suspected vandalism. DSM was locked up, potentially from this incident. Teardown went smoothly.

Plan for tomorrow

S1 (SBC Fire Station 18) teardown.

18 Oct 2020 Teardown

Daniel Buonome posted on Oct 18, 2020

Weather: Cool and foggy morning followed by hot sunny, and breezy conditions.

Tasking:

S8 (Santa Ynez Peak) teardown. All sensors were operational upon arrival (including the TRH fan), and site was in excellent condition after seven months of no maintenance. No issues with removal.

Plan for tomorrow

S17 (Cuyama Peak). Will be a long day due to three hours of travel time one way to get to site.

17 Oct 2020 Teardown

Daniel Buonome posted on Oct 17, 2020

Weather: Sunny, hot, and breezy.

Tasking:

Teardown of S10
Reorganization of POD and Base Trailer

s1 PTB is down

Gary Granger posted on Apr 29, 2020

PTB210 at s1 was working up until 2020-04-12,15:23:03, except of course for periods when s1 was probably dead. There are a few messages after that, until 202-04-16,01:29:11, and then they stop. As of the swap of the s9 DSM, nothing has been received from the PTB.

The PTB was set for 7E1, so I've tried connecting to it on s9 with minicom in 7E1, but there is still no response. Maybe a blown fuse, or maybe the PTB itself is dead. The data messages received in 7E1 mode are recoverable with some code changes to NIDAS.

2020 04 12 15:22:59.8795  0.7426  1,  22      10 \xb1\xb1\xb1\xb2.\xb26\x8d\n
2020 04 12 15:23:00.5341  0.6546  1,  22      10 \xb1\xb1\xb1\xb2.\xb26\x8d\n
2020 04 12 15:23:01.3770  0.8429  1,  22      10 \xb1\xb1\xb1\xb2.\xb26\x8d\n
2020 04 12 15:23:02.0969  0.7199  1,  22      10 \xb1\xb1\xb1\xb2.\xb2\xb7\x8d\n
2020 04 12 15:23:02.7773  0.6803  1,  22      10 \xb1\xb1\xb1\xb2.\xb2\xb7\x8d\n
2020 04 12 15:23:03.6169  0.8396  1,  22      10 \xb1\xb1\xb1\xb2.\xb2\xb7\x8d\n

2020-04-29,10:42:57|INFO|opening: s1_20200415_144143.dat
2020 04 15 14:41:44.7493 2.567e+05  1,  22      10 \xb1\xb1\xb13.0\xb2\x8d\n
2020-04-29,10:42:57|INFO|opening: s1_20200415_144322.dat

2020-04-29,10:43:03|INFO|opening: s1_20200416_012833.dat
2020 04 16 01:29:10.0350 3.885e+04  1,  22      10 \xb1\xb109.5\xb2\x8d\n
2020 04 16 01:29:10.5671  0.5322  1,  22      10 \xb1\xb109.5\xb2\x8d\n
2020 04 16 01:29:11.2509  0.6837  1,  22      10 \xb1\xb109.53\x8d\n
2020-04-29,10:43:03|INFO|opening: s1_20200416_012957.dat
2020-04-29,10:43:03|INFO|opening: s1_20200416_013538.dat
2020-04-29,10:43:03|INFO|opening: s1_20200416_013838.dat
2020-04-29,10:43:03|INFO|opening: s1_20200416_120000.dat

1 Comment

s1 disdrometer is broken

Gary Granger posted on Apr 29, 2020

When Leila swapped the s9 DSM for s1, I discovered the disdrometer messages were broken. The quick summary is that the eeprom got erased, leading to these questions. The details follow.

Can we take up this problem with Ott Hydromet?
Any clues as to what could be causing the Parsivel2 to lose it's memory?
Is voltage or current supply borderline for reliable operation?

It looks like the first data file at setup for s1 is s1_20200325_120000.dat. Disdrometer was working as of 2020-03-25,17:49:13. I'm guessing the site was setup that day, and for whatever reason the disdrometer data messages start there, without the boot messages. Probably the time was not synchronized until then.

[isfs@barolo raw_data]$ data_stats -a -i 1,-1 s1_20200325_120000.dat 
2020-04-29,09:39:54|NOTICE|parsing: /h/eol/isfs/isfs/projects/SWEX/ISFS/config/swex.xml
Exception: EOFException: s1_20200325_120000.dat: open: EOF
sensor                           dsm sampid    nsamps |------- start -------|  |------ end -----|    rate        minMaxDT(sec) minMaxLen
s1:/dev/ttyDSM1                    1      8       371 2020 03 25 17:49:13.254  03 25 23:59:10.090    0.02      57.007   60.021 4142 4142
s1:/dev/gps_pty0                   1     10     44573 2020 03 25 17:48:16.892  03 25 23:59:59.024    2.00      -2.133    1.253   59  147
s1:/var/log/chrony/tracking.log    1     18      2654 2020 03 25 17:48:16.958  03 25 23:59:59.913    0.12       0.000 2578.788  100  100
s1:/dev/ttyDSM2                    1     20     22436 2010 02 01 00:00:57.185  03 25 23:59:59.275    0.00-1946506.446    1.342   42   44
s1:/dev/ttyDSM3                    1     22         0 ***********************  ******************     nan         nan      nan  nan  nan
s1:/dev/ttyDSM4                    1     40    446122 2010 02 01 00:00:57.159  03 25 23:59:59.974    0.00-1946507.325    0.493   60   60
s1:/dev/ttyPWRMONV                 1     60    445799 2010 02 01 00:00:56.308  03 25 23:59:59.422    0.00-1946506.527    1.172    3   90
s1:/dev/ttyDSM5                    1    100    223047 2020 03 25 17:48:17.032  03 25 23:59:59.976   10.00      -2.889    0.428   32   32
s1:/dev/ttyDSM7                    1 0x8000      4467 2020 03 25 17:48:20.498  03 25 23:59:57.411    0.20       1.616    5.689   26   60

The messages appeared to be fine, including reporting the serial number 450620:

2020 03 25 17:49:13.2549       0  1,   8    4142 450620;0000.000;00;20000;024;27025;00000;0;000;...

It booted up again 2020-03-26,01:35:11, for reasons unknown. It reported one good message, then started rebooting and reporting "POWERSUPPLY TEST FAILED !!!".

2020-04-29,09:13:52|INFO|opening: s1_20200326_013511.dat
2020 03 26 01:35:11.6336   121.6  1,   8       3 \r\n
2020 03 26 01:35:11.6346 0.001042  1,   8      22 BOOTLOADER PARSIVEL\r\n
2020 03 26 01:35:12.9390   1.304  1,   8       3 \r\n
2020 03 26 01:35:12.9400 0.001042  1,   8      21 *** PARSIVEL 2 ***\r\n
2020 03 26 01:35:12.9518 0.01182  1,   8      20 OTT HYDROMET GMBH\r\n
2020 03 26 01:35:12.9631 0.01129  1,   8      22 COPYRIGHT (C) 2019 \r\n
2020 03 26 01:35:12.9755 0.01238  1,   8      18 VERSION: 2.11.6\r\n
2020 03 26 01:35:12.9843 0.008857  1,   8      17 BUILD: 2112151\r\n
2020 03 26 01:35:13.0045 0.02015  1,   8       3 \r\n
2020 03 26 01:35:13.3090  0.3045  1,   9      40          0          0      20000         15          0          0          3          0          0          0 
2020 03 26 01:35:13.3090       0  1,   8    4142 450620;0000.000;00;.....;\r\n
2020 03 26 01:35:28.0125    14.7  1,   8       4 \x80\r\n
2020 03 26 01:35:29.6423    1.63  1,   8      22 BOOTLOADER PARSIVEL\r\n
2020 03 26 01:35:30.9507   1.308  1,   8       3 \r\n
2020 03 26 01:35:30.9517 0.001042  1,   8      21 *** PARSIVEL 2 ***\r\n
2020 03 26 01:35:30.9635 0.01181  1,   8      20 OTT HYDROMET GMBH\r\n
2020 03 26 01:35:30.9748 0.01131  1,   8      22 COPYRIGHT (C) 2019 \r\n
2020 03 26 01:35:30.9858 0.01094  1,   8      18 VERSION: 2.11.6\r\n
2020 03 26 01:35:30.9961 0.01028  1,   8      17 BUILD: 2112151\r\n
2020 03 26 01:35:31.0044 0.008336  1,   8       3 \r\n
2020 03 26 01:35:35.0523   4.048  1,   8      31 \xf8POWERSUPPLY TEST FAILED !!!\r\n

Eventually it starts repeating the messages "ERROR: No Valid Serial Number found !!!" and "ERROR: No Valid Hardware info found !!!".

2020 03 26 01:38:57.7429 0.001042  1,   8      22 BOOTLOADER PARSIVEL\r\n
2020 03 26 01:38:59.0512   1.308  1,   8       3 \r\n
2020 03 26 01:38:59.0522 0.001042  1,   8      21 *** PARSIVEL 2 ***\r\n
2020 03 26 01:38:59.0641 0.01184  1,   8      20 OTT HYDROMET GMBH\r\n
2020 03 26 01:38:59.0740 0.009899  1,   8      22 COPYRIGHT (C) 2019 \r\n
2020 03 26 01:38:59.0879 0.01393  1,   8      40 VERSION: \xf8POWERSUPPLY TEST FAILED !!!\r\n
2020-04-29,09:13:52|INFO|opening: s1_20200326_120000.dat
2020 03 26 19:49:41.7501 6.544e+04  1,   8      42 ERROR: No Valid Serial Number found !!!\r\n
2020 03 26 19:49:41.7597 0.009589  1,   8      43 ERROR: No Valid Hardware info found !!!\r\r\n
2020 03 26 19:49:58.6922   16.93  1,   8      22 BOOTLOADER PARSIVEL\r\n
2020 03 26 19:50:00.0337   1.341  1,   8      42 ERROR: No Valid Serial Number found !!!\r\n
2020 03 26 19:50:00.0577 0.02407  1,   8      43 ERROR: No Valid Hardware info found !!!\r\r\n

It keeps reporting the "No Valid Hardware info" messages until 2020-03-27,01:50, then some noise, then nothing until 20:13, when it starts reporting the default messages with the serial number of XXXXXXXX:

2020 03 27 01:50:38.3070 0.001042  1,   8      22 BOOTLOADER PARSIVEL\r\n
2020 03 27 01:50:39.0419  0.7349  1,   8      42 \x00\xe0\x00\x00\x00\x00\x00\xff\x00\x00\xfe\x00POWERSUPPLY TEST FAILED !!!\r\n
2020 03 27 01:50:42.5119    3.47  1,   8      33 \x00\x00\x00POWERSUPPLY TEST FAILED !!!\r\n
2020 03 27 01:50:44.8753   2.363  1,   8      44 \x00\x00\x00\x00\x00\x00\x00\x0e\x00\xff\x00\xfe\x00\x00POWERSUPPLY TEST FAILED !!!\r\n
2020 03 27 01:50:47.5648    2.69  1,   8      63 \x00\x00\xff\x02\x00\xfe\x00\xff\x00\x00\x00\x00\x9d(\x80\x00\xff\x00\xff\x00\xfd\xff\x00\x00\xfe\x00\xff\x00\x00\xff\xff\x00\x00POWERSUPPLY TEST FAILED !!!\r\n
2020 03 27 01:50:51.8081   4.243  1,   8      49 \x00\x00\x00\x00\x00@\x14\x00\x00\x00\xfe\x00\x00\x00\x0e\x00\x00\xff\x00POWERSUPPLY TEST FAILED !!!\r\n
2020-04-29,09:13:56|INFO|opening: s1_20200327_120000.dat
2020 03 27 20:13:06.5367       0  1,   8      71 XXXXXXXX;0000.000;0000.00;00;-9.999;20000;0000.00;025;27028;00000;0;\r\n
2020 03 27 20:14:06.5447       0  1,   8      71 XXXXXXXX;0000.000;0000.00;00;-9.999;19320;0000.00;025;27018;00000;0;\r\n

There are still some reboot messages later on and more error messages, so it's not like the disdrometer is stabilized again but just missing eeprom. Either way it's in a broken state, and I don't think this is the only one to have had this kind of problem.

For the moment, I have modified the NIDAS config to parse the messages but skip the serial number field. However, that is not a fix since the whole configuration of the data messages has been lost, and we don't know if losing the hardware info and any other eeprom settings makes the data useless.

s1

S1 DSM Replaced by S9 DSM

Daniel Buonome posted on Apr 26, 2020

A week ago, Leila and Charles visited Site 1 to find the DSM inop. For simplicity sake, this past Sunday (4/26) I had them replace the S1 DSM with the unused S9 DSM. Nothing was swapped between the two, so S9 has it's original cell modem and SD card.

S9 Comms

m59
eth0- 192.168.1.209 (DSM address)
eth1- 166.255.144.36 (cell modem address)

--

Dan

High winds

Steve Oncley posted on Apr 21, 2020

Just a quick note that s8 has routinely seen winds >20 m/s, with the highest 5-min average of 27 m/s on 17 Mar.

s15 is the next windiest, up to 17 m/s, though I suspect that s17 would have been high as well if it were being recorded.

The pattern is not surprising, though the magnitudes are rather high for what ISFS normally sees.

It seems that the s8 CSAT sometimes misses these events – is this the high-wind-speed-error that Sean saw several years ago?

1 Comment

Quickie status - 20 Apr

Steve Oncley posted on Apr 20, 2020

Still in idling mode after being pulled from set-up last month for COVID.

Sites deployed:

s1 - Barometer not reporting (configuration?). Station only comes up when batteries fully charged. (Had worked ok initially after setup.). Sometimes, files are not opened. Most of the time the time stamp is bad.

P.S. Site was visited by Leila and Charles on 18 Apr. They replaced batteries and Victron, which brought power up to DSM and sensors, but DSM still hasn't come up. This suggests either a fault or wrong setting (mode 4 instead of mode 3?) of the old Victron. Dan things the next step is a DSM change, but the DSM was swaged closed so the PIs couldn't get into the box. A DSM issue is odd, though, since it was working last week.

s3 - TRH died in rain on 6 Apr. Otherwise ok

s4 - ok

s8 - EC150 never installed. GPS not receiving most messages as of 14 Mar(!). (At that time, about 13 Mar 23:30 nsat drops from 10 to 7, then further drops to 0 about 14 Mar 02:30.) Otherwise ok

s10 - ok

s14 - Mote data never worked properly, last data on 3 Apr. Cable from DSM to mote found to have water during site visit by PIs on 11 Apr, but didn't have spare. Barometer highly intermittent (but Pirga ok). Otherwise, ok.

s15 - ok

s17 - Site pretty much never worked. Just a few hours of data early on. Last data 26 Mar. From log snapshot taken when station was last up, seemed to be a DSM USB issue.

1 Comment ·

Status - Mar 20

Steve Oncley posted on Mar 20, 2020

s1: not reporting. Last message (18 Mar) missing P, Vbatt was okay.

s3: all working

s4: all working

s8: ec150 not installed

s10: all working

s14: P, TRH, mote all down (mote has a lot of 0x00 characters before message)

s15: Qsoil needed power cycle

s17: not reporting, suspect DSM usb issue. Last message (13 Mar) had bad TRH fan, RH questionable, missing Ott, missing TP01 (might just have been timing, since prior message was okay), Vbatt was okay

Wind directions

Steve Oncley posted on Mar 19, 2020

So... we want to offer a dataset to the PIs in geo coordinates. Speaking with Kurt, he is confident that the tripods of each site were oriented with a compass to make the csat point out from the mast at an angle of 315 deg (NW), to within about 2 degrees. I have thus entered Vazimuth = 315- 180 - 90 = 45 into the cal files for s1, s3, s4, s8, s10, s14, s15, and s17.

Dan told me that the orientation of the Gill 2D could be any multiple of 90 degrees from the csat orientation. By creating a scatterplot of each site's csat vs Gill, I verified this and entered the appropriate multiple + 45 also into the cal files. Running statsproc with noqc_geo produces dir=Dir now, so I think we're close enough for an unsupported project.

IF the teardown crew has nothing better to do, it would be nice to actually measure these angles...

Status 19 Mar

Steve Oncley posted on Mar 19, 2020

I guess we can't leave this blog up in perpetuity without some explanation of what has happened in the last week!

Due to the world-wide Covid-19 coronavirus pandemic, all staff were recalled from the field. On 3/12, s13, which had been partially assembled but never transmitting data, was removed and the field crew started securing the base and Pod. On 3/13, Dan left and Kurt and Clayton serviced TRHs at s8 and s10. On 3/14, Kurt and Clayton left the site as well.

This left s1, s3, s4, s8, s10, s14, s15, and s17 installed. The EC150 was never installed at s8. The barometer at s1 seems to be flaky. s17 connects very intermittently, presumably due to a USB issue in the DSM that is rebooting it frequently – the last data came through 13 Mar.

We will continue to let these run, perhaps with a bit of servicing by UCSB, until we are next cleared for travel. At that point, we will send out a tear-down crew to pull everything and wait for SWEX2021...

s17 temporarily up

Isabel Suhr posted on Mar 12, 2020

Logged in to see what's up. Steve fixed the udev rules so now the pwrmon is reporting. I noticed in the dsm logs that dsm statsproc from relampago was still trying to run, so I disabled it and turned off the service. Steve rsynced the data files to barolo.

Looked at logs to see if I could figure out why it's been off the net so much. Looks like it's rebooting frequently due to USB problems:

Mar  8 15:15:10 s17 kernel: [   42.553988] usb 1-1.5-port1: cannot disable (err = -71)
Mar  8 15:15:10 s17 kernel: [   42.555830] usb 1-1.5: Failed to suspend device, error -71
Mar  8 15:15:10 s17 kernel: [   42.562454] usb 1-1.5: USB disconnect, device number 118
^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@
^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@
^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@Mar  8 01:17:06 s17 kernel: [    0.000000] Linux version 4.9.35-v7+ (dc4@dc4-XPS13-9333) (gcc version 4.9.3 (crosstool-NG crosstool-ng-1.22.0-88-g8460611) ) #1014 SMP Fri Jun 30 14:47:43 BST 2017
Mar  8 01:17:06 s17 kernel: [    0.000000] CPU: ARMv7 Processor [410fc075] revision 5 (ARMv7), cr=10c5387d
Mar  8 01:17:06 s17 kernel: [    0.000000] CPU: div instructions available: patching division code
Mar  8 01:17:06 s17 kernel: [    0.000000] CPU: PIPT / VIPT nonaliasing data cache, VIPT aliasing instruction cache
Mar  8 01:17:06 s17 kernel: [    0.000000] OF: fdt:Machine model: Raspberry Pi 2 Model B Rev 1.1
Mar  8 01:17:06 s17 kernel: [    0.000000] cma: Reserved 8 MiB at 0x3a800000
Mar  8 01:17:06 s17 kernel: [    0.000000] Memory policy: Data cache writealloc
Mar  8 01:17:06 s17 kernel: [    0.000000] On node 0 totalpages: 241664

Mar  9 01:17:14 s17 kernel: [   16.952991] usb 1-1.5: Product: USB 2.0 Hub
Mar  9 01:17:14 s17 kernel: [   16.954917] hub 1-1.5:1.0: USB hub found
Mar  9 01:17:14 s17 kernel: [   16.955420] hub 1-1.5:1.0: 4 ports detected
Mar  9 01:17:14 s17 kernel: [   17.171205] hub 1-1.5:1.0: hub_ext_port_status failed (err = -71)
Mar  9 01:17:14 s17 kernel: [   17.172380] usb 1-1.5: Failed to suspend device, error -71
Mar  9 01:17:14 s17 kernel: [   17.226561] usb 1-1.5: USB disconnect, device number 28
Mar  9 01:17:14 s17 kernel: [   17.520553] usb 1-1.5: new full-speed USB device number 29 using dwc_otg
^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^
@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^
Mar  9 01:17:06 s17 kernel: [    0.000000] Linux version 4.9.35-v7+ (dc4@dc4-XPS13-9333) (gcc version 4.9.3 (crosstool-NG crosstool-ng-1.22.0-88-g8460611) ) #1014 SMP Fri Jun 30 14:47:43 BST 2017
Mar  9 01:17:06 s17 kernel: [    0.000000] CPU: ARMv7 Processor [410fc075] revision 5 (ARMv7), cr=10c5387d
Mar  9 01:17:06 s17 kernel: [    0.000000] CPU: div instructions available: patching division code
Mar  9 01:17:06 s17 kernel: [    0.000000] CPU: PIPT / VIPT nonaliasing data cache, VIPT aliasing instruction cache
Mar  9 01:17:06 s17 kernel: [    0.000000] OF: fdt:Machine model: Raspberry Pi 2 Model B Rev 1.1
Mar  9 01:17:06 s17 kernel: [    0.000000] cma: Reserved 8 MiB at 0x3a800000
Mar  9 01:17:06 s17 kernel: [    0.000000] Memory policy: Data cache writealloc

Mar  8 15:19:10 s17 kernel: [   42.911045] 
Mar  8 15:19:10 s17 kernel: [   42.911078] ERROR::handle_hc_chhltd_intr_dma:2215: handle_hc_chhltd_intr_dma: Channel 3, DMA Mode -- ChHltd set, but reason for halting is unknown, hcint 0x00000002, intsts 0x06000021
Mar  8 15:19:10 s17 kernel: [   42.911078] 
Mar  8 15:19:10 s17 kernel: [   42.911112] ERROR::handle_hc_chhltd_intr_dma:2215: handle_hc_chhltd_intr_dma: Channel 4, DMA Mode -- ChHltd set, but reason for halting is unknown, hcint 0x00000002, intsts 0x06000021
Mar  8 15:19:10 s17 kernel: [   42.911112] 
Mar  8 15:19:10 s17 kernel: [   42.911181] ERROR::handle_hc_chhltd_intr_dma:2215: handle_hc_chhltd_intr_dma: Channel 7, DMA Mode -- ChHltd set, but reason for halting is unknown, hcint 0x00000002, intsts 0x06000021
Mar  8 15:19:10 s17 kernel: [   42.911181] 
Mar  8 15:19:10 s17 kernel: [   42.911214] ERROR::handle_hc_chhltd_intr_dma:2215: handle_hc_chhltd_intr_dma: Channel 5, DMA Mode -- ChHltd set, but reason for halting is unknown, hcint 0x00000002, intsts 0x06000021
Mar  8 15:19:10 s17 kernel: [   42.911214] 
Mar  8 15:19:10 s17 kernel: [   42.911247] ERROR::handle_hc_chhltd_intr_dma:2215: handle_hc_chhltd_intr_dma: Channel 2, DMA Mode -- ChHltd set, but reason for hMar  8 01:17:06 s17 kernel: [    0.000000] Booting Linux on physical CPU 0xf00
Mar  8 01:17:06 s17 kernel: [    0.000000] Linux version 4.9.35-v7+ (dc4@dc4-XPS13-9333) (gcc version 4.9.3 (crosstool-NG crosstool-ng-1.22.0-88-g8460611) ) #1014 SMP Fri Jun 30 14:47:43 BST 2017
Mar  8 01:17:06 s17 kernel: [    0.000000] CPU: ARMv7 Processor [410fc075] revision 5 (ARMv7), cr=10c5387d
Mar  8 01:17:06 s17 kernel: [    0.000000] CPU: div instructions available: patching division code
Mar  8 01:17:06 s17 kernel: [    0.000000] CPU: PIPT / VIPT nonaliasing data cache, VIPT aliasing instruction cache
Mar  8 01:17:06 s17 kernel: [    0.000000] OF: fdt:Machine model: Raspberry Pi 2 Model B Rev 1.1

Lots of these reboots in the logs. Interestingly when the system reboots it seems to always come up with a time right around 01:17:05 of the current day, even if it means jumping back in time by minutes or hours.

There were some other usb messages in the logs that didn't seem to trigger a reboot, but were still notable:

Mar  8 01:17:10 s17 kernel: [   13.663002] usb 1-1.5: Product: USB 2.0 Hub
Mar  8 01:17:10 s17 kernel: [   13.664314] hub 1-1.5:1.0: USB hub found
Mar  8 01:17:10 s17 kernel: [   13.664799] hub 1-1.5:1.0: 4 ports detected
Mar  8 01:17:11 s17 kernel: [   13.980630] usb 1-1.5.1: new full-speed USB device number 35 using dwc_otg
Mar  8 01:17:11 s17 kernel: [   13.982863] usb 1-1.5-port1: cannot reset (err = -71)
Mar  8 01:17:11 s17 kernel: [   13.983375] usb 1-1.5-port1: cannot reset (err = -71)
Mar  8 01:17:11 s17 kernel: [   13.983959] usb 1-1.5-port1: cannot reset (err = -71)
Mar  8 01:17:11 s17 kernel: [   13.984481] usb 1-1.5-port1: cannot reset (err = -71)
Mar  8 01:17:11 s17 kernel: [   13.984980] usb 1-1.5-port1: cannot reset (err = -71)
Mar  8 01:17:11 s17 kernel: [   13.984988] usb 1-1.5-port1: Cannot enable. Maybe the USB cable is bad?
Mar  8 01:17:11 s17 kernel: [   13.985569] usb 1-1.5-port1: cannot disable (err = -71)
Mar  8 01:17:11 s17 kernel: [   13.986115] usb 1-1.5-port1: cannot reset (err = -71)
Mar  8 01:17:11 s17 kernel: [   13.986665] usb 1-1.5-port1: cannot reset (err = -71)
Mar  8 01:17:11 s17 kernel: [   13.987144] usb 1-1.5-port1: cannot reset (err = -71)
Mar  8 01:17:11 s17 kernel: [   13.987727] usb 1-1.5-port1: cannot reset (err = -71)
Mar  8 01:17:11 s17 kernel: [   13.988208] usb 1-1.5-port1: cannot reset (err = -71)
Mar  8 01:17:11 s17 kernel: [   13.988215] usb 1-1.5-port1: Cannot enable. Maybe the USB cable is bad?
Mar  8 01:17:11 s17 kernel: [   13.988765] usb 1-1.5-port1: cannot disable (err = -71)
Mar  8 01:17:11 s17 kernel: [   13.989280] usb 1-1.5-port1: cannot reset (err = -71)
Mar  8 01:17:11 s17 kernel: [   13.989862] usb 1-1.5-port1: cannot reset (err = -71)
Mar  8 01:17:11 s17 kernel: [   13.990342] usb 1-1.5-port1: cannot reset (err = -71)
Mar  8 01:17:11 s17 kernel: [   13.990997] usb 1-1.5-port1: cannot reset (err = -71)
Mar  8 01:17:11 s17 kernel: [   13.991627] usb 1-1.5-port1: cannot reset (err = -71)
Mar  8 01:17:11 s17 kernel: [   13.991637] usb 1-1.5-port1: Cannot enable. Maybe the USB cable is bad?
Mar  8 01:17:11 s17 kernel: [   13.992182] usb 1-1.5-port1: cannot disable (err = -71)
Mar  8 01:17:11 s17 kernel: [   13.992779] usb 1-1.5-port1: cannot reset (err = -71)
Mar  8 01:17:11 s17 kernel: [   13.993295] usb 1-1.5-port1: cannot reset (err = -71)
Mar  8 01:17:11 s17 kernel: [   13.993872] usb 1-1.5-port1: cannot reset (err = -71)
Mar  8 01:17:11 s17 kernel: [   13.994386] usb 1-1.5-port1: cannot reset (err = -71)
Mar  8 01:17:11 s17 kernel: [   13.994945] usb 1-1.5-port1: cannot reset (err = -71)
Mar  8 01:17:11 s17 kernel: [   13.994952] usb 1-1.5-port1: Cannot enable. Maybe the USB cable is bad?
Mar  8 01:17:11 s17 kernel: [   13.995477] usb 1-1.5-port1: cannot disable (err = -71)
Mar  8 01:17:11 s17 kernel: [   13.995515] usb 1-1.5-port1: unable to enumerate USB device
Mar  8 01:17:11 s17 kernel: [   13.996019] usb 1-1.5-port1: cannot disable (err = -71)

I saw this message for both port 1 and port 2 of usb 1-1.5.

Mar  8 01:17:40 s17 kernel: [   43.600792] ERROR::handle_hc_chhltd_intr_dma:2215: handle_hc_chhltd_intr_dma: Channel 0, DMA Mode -- ChHltd set, but reason for halting is unknown, hcint 0x00000002, intsts 0x06000001
Mar  8 01:17:40 s17 kernel: [   43.600792] 
Mar  8 01:17:40 s17 kernel: [   43.600848] ERROR::handle_hc_chhltd_intr_dma:2215: handle_hc_chhltd_intr_dma: Channel 7, DMA Mode -- ChHltd set, but reason for halting is unknown, hcint 0x00000002, intsts 0x06000001
Mar  8 01:17:40 s17 kernel: [   43.600848] 
Mar  8 01:17:40 s17 kernel: [   43.600907] ERROR::handle_hc_chhltd_intr_dma:2215: handle_hc_chhltd_intr_dma: Channel 1, DMA Mode -- ChHltd set, but reason for halting is unknown, hcint 0x00000002, intsts 0x06000021
Mar  8 01:17:40 s17 kernel: [   43.600907] 
Mar  8 01:17:40 s17 kernel: [   43.600983] hub 1-1:1.0: hub_ext_port_status failed (err = -71)
Mar  8 01:17:40 s17 kernel: [   43.601058] ERROR::handle_hc_chhltd_intr_dma:2215: handle_hc_chhltd_intr_dma: Channel 4, DMA Mode -- ChHltd set, but reason for halting is unknown, hcint 0x00000002, intsts 0x06000001
Mar  8 01:17:40 s17 kernel: [   43.601058] 
Mar  8 01:17:40 s17 kernel: [   43.601100] ERROR::handle_hc_chhltd_intr_dma:2215: handle_hc_chhltd_intr_dma: Channel 6, DMA Mode -- ChHltd set, but reason for halting is unknown, hcint 0x00000002, intsts 0x06000001
Mar  8 01:17:40 s17 kernel: [   43.601100] 
Mar  8 01:17:40 s17 kernel: [   43.601144] ERROR::handle_hc_chhltd_intr_dma:2215: handle_hc_chhltd_intr_dma: Channel 0, DMA Mode -- ChHltd set, but reason for halting is unknown, hcint 0x00000002, intsts 0x06000001
Mar  8 01:17:40 s17 kernel: [   43.601144] 
Mar  8 01:17:40 s17 kernel: [   43.601192] usb 1-1-port5: cannot reset (err = -71)
Mar  8 01:17:40 s17 kernel: [   43.601254] ERROR::handle_hc_chhltd_intr_dma:2215: handle_hc_chhltd_intr_dma: Channel 7, DMA Mode -- ChHltd set, but reason for halting is unknown, hcint 0x00000002, intsts 0x06000001
Mar  8 01:17:40 s17 kernel: [   43.601254] 
Mar  8 01:17:40 s17 kernel: [   43.601294] ERROR::handle_hc_chhltd_intr_dma:2215: handle_hc_chhltd_intr_dma: Channel 1, DMA Mode -- ChHltd set, but reason for halting is unknown, hcint 0x00000002, intsts 0x06000001
Mar  8 01:17:40 s17 kernel: [   43.601294] 
Mar  8 01:17:40 s17 kernel: [   43.601336] ERROR::handle_hc_chhltd_intr_dma:2215: handle_hc_chhltd_intr_dma: Channel 4, DMA Mode -- ChHltd set, but reason for halting is unknown, hcint 0x00000002, intsts 0x06000001
Mar  8 01:17:40 s17 kernel: [   43.601336] 
Mar  8 01:17:40 s17 kernel: [   43.601379] usb 1-1-port5: cannot reset (err = -71)

Do we think this is the fault of one bad USB device, like the usb stick or cell modem? Is usb 1-1 the external hub? If Kurt and Dan paid a visit to try and get s17 more reliably online, would it be better to swap out the whole DSM so we can troubleshoot this one back in Boulder, or are we fairly confident swapping out one component would fix it?

s17 is back down now, so I can't keep looking. Steve copied some of /var/log into /scr/tmp/oncley, but it didn't seem to get very far before the connection went down.

Also, Steve noted that ssh to isfs17.dyndns.org connects to s3 right now because the dyndns names haven't been updated, which is confusing.

Quickie status - Mar 11

Steve Oncley posted on Mar 11, 2020

s3 now up. All sensors okay.

s3