Documentation: https://docs.aws.amazon.com/parallelcluster/latest/ug/what-is-aws-parallelcluster.html

Table of Contents

Troubleshooting

MPI job hangs across nodes

From jedi-tools issue #234, MPI jobs across multiple nodes running on non-EFA enabled compute nodes (e.g. m6i-xlarge) hang during MPI calls. A simple "Hello world" example is found here. Therefore, when enabling the option to run Skylab/EWOK jobs on different partitions during an experiment, the MPI fabrics environment needs to be modified.

For Intel:

I_MPI_FABRICS=shm srun --verbose -np 2 ./mpi_hello_world.x

For GNU: a similar solution might be needed and the behavior is the same for "mpirun".

Solution: We need to configure the SLURM task_prolog option for non-EFA enabled partitions to set this environment variable according to the compiler/MPI library loaded.

Procedures

Setting up AWS ParallelCluster

Prerequisites:

Access to JCSDA's AWS accounts (more information can be found here)

Infrastructure's instructions for setting up an AWS ParallelCluster for Skylab is found in the README.

Space shortcuts

Page tree

Troubleshooting

MPI job hangs across nodes

Procedures

Setting up AWS ParallelCluster