Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Comment: re-order

Documentation: https://docs.aws.amazon.com/parallelcluster/latest/ug/what-is-aws-parallelcluster.html

Table of Contents

Table of Contents

Procedures

Setting up AWS ParallelCluster

Prerequisites: 

  • Access to JCSDA's AWS accounts (more information can be found here

Infrastructure's instructions for setting up an AWS ParallelCluster for Skylab is found in the README.

Troubleshooting

MPI job hangs across nodes

From jedi-tools issue #234, MPI jobs across multiple nodes running on non-EFA enabled compute nodes (e.g. m6i-xlarge) hang during MPI calls. A simple "Hello world" example is found here. Therefore, when enabling the option to run Skylab/EWOK jobs on different partitions during an experiment, the MPI fabrics environment needs to be modified. 

...

Solution: We need to configure the SLURM task_prolog option for non-EFA enabled partitions to set this environment variable according to the compiler/MPI library loaded.

Procedures

Setting up AWS ParallelCluster

Prerequisites: 

  • Access to JCSDA's AWS accounts (more information can be found here

Infrastructure's instructions for setting up an AWS ParallelCluster for Skylab is found in the README.