First off, check the contents of the
.error file for your job. That is where mpirun will dump errors.
Perhaps the number one cause of code crashing mysteriously is memory problems. Remember that there is at most 512MB available on the compute nodes in coprocessor mode. In virtual-node mode there is only 256MB available.
Problems linking with libraries
When compiling and linking code for the compute nodes, make sure you are linking with libraries under
/contrib/bgl. The libraries under
/contrib/fe_tools are for use with code running on the login nodes.
Reading from STDIN
Pass the filename for your code to read in over STDIN with the -i option to cqsub.
cqsub -i inputfile -n 32 -t ...
Core dumps with non-MPI (serial) code
Even if your code does not use any MPI commands, you still need to include MPI_Init at the beginning and MPI_Finalize at the end of your code. Otherwise your code may crash randomly, on things like print statements.
Compiling/Linking Fortran and C/C++
-qextname flag with Fortran code. For example, if the
flush symbol will not resolve properly, compile with
Opening a fortran SCRATCH file
Trying to open a Fortran scratch type file on Frost will result in an error code of 1525-014 or 1525-114.
You can get around this by setting the TMPDIR environment variable, which controls where scratch files are created:
cqsub -n 1 -t 10 -e TMPDIR=/ptmp/voran scratchtest
Job killed with signal 9 or 11
If you get output like this in your .error file, it usually means your job exceeded the walltime you specified to cqsub:
<Oct 01 13:44:30.918274> FE_MPI (WARN) : <Oct 01 13:44:30.918582> FE_MPI (WARN) : !------------------------------------------------! <Oct 01 13:44:30.918609> FE_MPI (WARN) : ! MPIRUN is now taking all the necessary actions ! <Oct 01 13:44:30.918635> FE_MPI (WARN) : ! to terminate the job and to free the resources ! <Oct 01 13:44:30.918659> FE_MPI (WARN) : ! occupied by this job. This may take a while... ! <Oct 01 13:44:30.918683> FE_MPI (WARN) : !------------------------------------------------! <Oct 01 13:44:30.918752> FE_MPI (WARN) : <Oct 01 13:44:33.230580> BE_MPI (WARN) : Received a message from Front-End <Oct 01 13:44:33.230733> BE_MPI (WARN) : Execution of the current command interrupted <Oct 01 13:44:49.365013> BE_MPI (ERROR): The error message in the job record is as follows: <Oct 01 13:44:49.365077> BE_MPI (ERROR): "killed with signal 9" <Oct 01 13:46:28.686193> FE_MPI (ERROR): Failure list: <Oct 01 13:46:28.686291> FE_MPI (ERROR): - 1. Execution interrupted by signal (failure #61)
Check the output of qhist -l, comparing the Runtime and Walltime columns to verify that your job indeed exceeded the walltime.
Partition is deallocating
If you get output like this in your .error file, the partition most likely failed to boot. Send an email to
<frost-help AT ucar DOT edu>.
<May 09 21:07:19.801756> BE_MPI (ERROR): Booting aborted - partition is in DEALLOCATING ('D') state <May 09 21:07:19.801896> BE_MPI (ERROR): Partition has not reached the READY ('I') state <May 09 21:07:19.943004> FE_MPI (ERROR): Back-end failed while preparing partition with return code 35. <May 09 21:07:25.516119> FE_MPI (ERROR): Failure list: <May 09 21:07:25.516217> FE_MPI (ERROR): - 1. Failed to boot the partition (failure #35)
Block was deallocated
These error messages most likely indicate a hardware error. Send an email to
<frost-help AT ucar DOT edu>.
<Jun 20 04:14:52.453111> BE_MPI (ERROR): The error message in the job record is as follows: <Jun 20 04:14:52.453277> BE_MPI (ERROR): "Job deleted because block was deallocated." <Jun 20 04:14:52.599736> BE_MPI (ERROR): The error message in the job record is as follows: <Jun 20 04:14:52.599819> BE_MPI (ERROR): "Job deleted because block was deallocated."
This error means you submitted an invalid executable. This can happen if you submitted an executable built for a different CPU architecture, or if you submitted a script but did not specify script mode (-m script).
<Feb 26 14:38:48.366311> BE_MPI (ERROR): The error message in the job record is as follows: <Feb 26 14:38:48.366392> BE_MPI (ERROR): "Load failed on 172.30.0.31: Magic value in ELF header of executable file is invalid" <Feb 26 14:38:48.508106> FE_MPI (ERROR): Job execution failed (error code - 50)
Compile/link error: relocation truncated to fit: R_PPC_REL24 against symbol ...
-Wl,-relax to your compile/link line.
Job is not running (and it should be)
Running jobs interactively
-I option to
cqsub is not available on Blue Gene systems. Instead, try the
debug queue for short jobs.
If all else fails...
Send an email to
<frost-help AT ucar DOT edu>