You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 2 Current »

Documentation: https://docs.aws.amazon.com/parallelcluster/latest/ug/what-is-aws-parallelcluster.html

Table of Contents

Troubleshooting

MPI job hangs across nodes

From jedi-tools issue #234, MPI jobs across multiple nodes running on non-EFA enabled compute nodes (e.g. m6i-xlarge) hang during MPI calls. A simple "Hello world" example is found here. Therefore, when enabling the option to run Skylab/EWOK jobs on different partitions during an experiment, the MPI fabrics environment needs to be modified. 

For Intel:

I_MPI_FABRICS=shm srun --verbose -np 2 ./mpi_hello_world.x

For GNU: a similar solution might be needed and the behavior is the same for "mpirun". 

Solution: We need to configure the SLURM task_prolog option for non-EFA enabled partitions to set this environment variable according to the compiler/MPI library loaded.

Procedures

Setting up AWS ParallelCluster

Prerequisites: 

  • Access to JCSDA's AWS accounts (more information can be found here

Infrastructure's instructions for setting up an AWS ParallelCluster for Skylab is found in the README.

  • No labels