Documentation: https://docs.aws.amazon.com/parallelcluster/latest/ug/what-is-aws-parallelcluster.html
Table of Contents | |
---|---|
|
Procedures
Setting up AWS ParallelCluster
Prerequisites:
- Access to JCSDA's AWS accounts (more information can be found here)
Infrastructure's instructions for setting up an AWS ParallelCluster for Skylab is found in the README.
Troubleshooting
MPI job hangs across nodes
From jedi-tools issue #234, MPI jobs across multiple nodes running on non-EFA enabled compute nodes (e.g. m6i-xlarge
) hang during MPI calls. A simple "Hello world" example is found here. Therefore, when enabling the option to run Skylab/EWOK jobs on different partitions during an experiment, the MPI fabrics environment needs to be modified.
...
Solution: We need to configure the SLURM task_prolog
option for non-EFA enabled partitions to set this environment variable according to the compiler/MPI library loaded.
Procedures
Setting up AWS ParallelCluster
Prerequisites:
- Access to JCSDA's AWS accounts (more information can be found here)
Infrastructure's instructions for setting up an AWS ParallelCluster for Skylab is found in the README.