Documentation: https://docs.aws.amazon.com/parallelcluster/latest/ug/what-is-aws-parallelcluster.html
Table of Contents | |
---|---|
|
Troubleshooting
MPI job hangs across nodes
From jedi-tools issue #234, MPI jobs across multiple nodes running on non-EFA enabled compute nodes (e.g. m6i-xlarge
) hang during MPI calls. A simple "Hello world" example is found here. Therefore, when enabling the option to run Skylab/EWOK jobs on different partitions during an experiment, the MPI fabrics environment needs to be modified.
For Intel:
Code Block |
---|
I_MPI_FABRICS=shm srun --verbose -np 2 ./mpi_hello_world.x |
For GNU: a similar solution might be needed and the behavior is the same for "mpirun".
Solution: We need to configure the SLURM task_prolog
option for non-EFA enabled partitions to set this environment variable according to the compiler/MPI library loaded.
Procedures
Setting up AWS ParallelCluster
Prerequisites:
- Access to JCSDA's AWS accounts (more information can be found here)
Infrastructure's instructions for setting up an AWS ParallelCluster for Skylab is found in the README.