11:45   Low lost communication with its emerald boards around 11:00.

Rebooted low and everything back up.

Information for Gordon:

root@low root# irqs
Counting interrupts over 5 seconds ...

IRQ      Interrupt Type            Total Int  Int/sec  
------------------------------------------------------
24:      GPIO-l eth0:              2          0.4      
36:      SC serial:                15         3        
37:      SC serial:                102        20.4     
42:      SC ost0:                  508        101.6    
114:     GPIO isp116x-hcd:usb1:    290        58       
115:     GPIO serial:              226        45.2     
116:     GPIO serial:              103        20.6     

end of dmesg

i2c i2c-0: i2c_pxa: timeout waiting for bus free
i2c i2c-0: i2c_pxa: timeout waiting for bus free
i2c i2c-0: i2c_pxa: timeout waiting for bus free
i2c i2c-0: i2c_pxa: timeout waiting for bus free
i2c i2c-0: i2c_pxa: timeout waiting for bus free
i2c i2c-0: i2c_pxa: timeout waiting for bus free
handle_IRQ_event called 4 times for IRQ 3
handle_IRQ_event called 4 times for IRQ 3
root@low root#
root@low root#

Added by Gordon, Jun 26:

/var/log/isfs/kernel has those dmesg messages, with timetags

Jun 23 16:49:50 low kernel: i2c i2c-0: i2c_pxa: timeout waiting for bus free
Jun 23 16:49:53 low last message repeated 5 times
Jun 25 09:27:58 low kernel: handle_IRQ_event called 4 times for IRQ 3
Jun 25 17:46:14 low kernel: handle_IRQ_event called 4 times for IRQ 3

Those were the only messages before the reboot, and they occurred at least 23 hours earlier, which means the problem is not due to a kernel oops, or any other atypical event that the kernel could detect. It is just the good ol' situation where there seems to be a very small possibility that a PC104 interrupt can be missed, and not retriggered, even though the PC104 IRQ interrupt line is high, such that the interrupt handler is never again called.

I believe restarting the dsm process with a ddn/dup, which closes and re-opens the serial ports, will bring it back too

I just updated the xml on the low DSM so that every sensor has a timeout. The dsm process should then close and reopen each port after detecting the timeout, which should also help to recover from this situation more quickly.

Seems that I need to install a PC104 interrupt watchdog module. There is some indication this has happened on the aircraft, also quite infrequently. A test is being setup out at RAF.

When the PC104 interrupts are being handled, the irqs listing looks like so, showing 275 interrupts/sec from the Emerald cards:

root@low root# irqs
Counting interrupts over 5 seconds ...

IRQ      Interrupt Type            Total Int  Int/sec
------------------------------------------------------
3:       ISA serial:               1376       275.2
24:      GPIO-l eth0:              62         12.4
25:      GPIO-l GPIO1-PC104:       1376       275.2
36:      SC serial:                15         3
37:      SC serial:                101        20.2
42:      SC ost0:                  509        101.8
114:     GPIO isp116x-hcd:usb1:    90         18
115:     GPIO serial:              228        45.6
116:     GPIO serial:              102        20.4