Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Comment: Add MPI troubleshooting

Documentation: https://docs.aws.amazon.com/parallelcluster/latest/ug/what-is-aws-parallelcluster.html

Table of Contents

Table of Contents

Troubleshooting

MPI job hangs across nodes

From jedi-tools issue #234, MPI jobs across multiple nodes running on non-EFA enabled compute nodes (e.g. m6i-xlarge) hang during MPI calls. A simple "Hello world" example is found here. Therefore, when enabling the option to run Skylab/EWOK jobs on different partitions during an experiment, the MPI fabrics environment needs to be modified. 

For Intel:

Code Block
I_MPI_FABRICS=shm srun --verbose -np 2 ./mpi_hello_world.x

For GNU: a similar solution might be needed and the behavior is the same for "mpirun". 

Solution: We need to configure the SLURM task_prolog option for non-EFA enabled partitions to set this environment variable according to the compiler/MPI library loaded.

Procedures

Setting up AWS ParallelCluster

Prerequisites: 

  • Access to JCSDA's AWS accounts (more information can be found here

Infrastructure's instructions for setting up an AWS ParallelCluster for Skylab is found in the README.