Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Comment: Added info from doc

Landing page

...

: https://noaa.parallel.works

  • Requires NOAA RDHPCs account, case-sensative NOAA ID and password

NOAA AIM for managing ParallelWorks projects

...

: https://aim.rdhpcs.noaa.gov

  • PI only
  • Requires NOAA RDHPCs account
Table of Contents

Table of Contents

Instructions for setting up clusters (every time and one-time only)

Generic instructions for JCSDA and EPIC

AWS (not available to us at the moment): https://github.com/JCSDA/spack-stack/blob/develop/configs/sites/noaa-aws/README.md

...

Gcloud: https://github.com/JCSDA/spack-stack/blob/develop/configs/sites/noaa-gcloud/README.md

Additional instructions for JCSDA

https://github.com/JCSDA-internal/jedi-tools/blob/develop/ParallelWorks/README.md

JCSDA ParallelWorks

...

Covers logging in, user setup, how to use, ...

...

Quick Start

Instructions from Dom's google doc. Last modified by F. Hebert July 30, 2024.

  1. Log into https://noaa.parallel.works/ with case-sensitive NOAA ID and password
  2. Warning! The large green on/off buttons next to the storage and compute resources turn the shared resources on/off with little or no warning!
  3. Upload your public SSH key: click on your name on the top right → Account → Authentication → Add SSH Key; this key is then available for all clusters
  4. Log in: ssh [-i private_key_if_not_default] User.Name@IP
  5. IP gcloud: 34.172.131.70 (after logging in, you’re on gclusternoaav2usc1c2dv2-8)
  6. After logging in, check if X forwarding works: xclock 
  7. Set up necessary modules
    Code Block
    module purge
    module unuse /opt/cray/craype/default/modulefiles
    module unuse /opt/cray/modulefiles
    module use /contrib/spack-stack/modulefiles
    module load cmake/3.27.2
    module load ecflow/5.8.4

  8. Run a few basic sanity checks and one-offs
    Code Block
    ecflow_ui
    git lfs install --skip-repo
    git config --global credential.helper store
    git config --global user.name "Your Name"
    git config --global user.email "your.email@domain.com"
    # Create your .aws/{config,credentials} as per jedi-docs

  9. Set up Skylab root directory and script
    Code Block
    cd /lustre
    mkdir -p skylab_user.name
    cd /lustre/skylab_user.name
    git clone https://github.com/jcsda-internal/jedi-tools
    ln -sf jedi-tools/buildscripts/setup.sh .
    
    # Edit setup script:
    JEDI_ROOT=/lustre/skylab_user.name
    HOST=pw-gcloud
    COMPILER=intel
    # Further down (in section `Load JEDI modules`) update the FMS version
    module unload fms/release-jcsda
    module load fms/202304
    
    # Sourcing setup.sh will create your venv if it doesn’t exist
    source setup.sh

  10. Build and run ctests
    Code Block
    # Build everything - change branch names as needed in the script
    ./jedi-tools/buildscripts/build_skylab.sh 2>&1 | tee build_skylab.log
    
    # Run ctest on login node if so desired
    cd build 
    ctest 2>&1 | tee log.ctest

  11. Run your experiments

Sync R2D2 data stores and EWOK_STATIC_DATA

Check contents of the following scripts and run via copy & paste using your Orion user.

Code Block
# Run as Dom (sudo su  +   su - Dom.Heinzeller)
cat /contrib/jedi/rsync-ewok-static-data-from-orion.sh

# Run as root (sudo su):
cat /contrib/jedi/rsync-r2d2-4denvar-msu-from-orion.sh
cat /contrib/jedi/rsync-r2d2-archive-msu-from-orion.sh
cat /contrib/jedi/rsync-r2d2-gfsensemble-msu-from-orion.sh
cat /contrib/jedi/rsync-r2d2-mpasensemble-msu-from-orion.sh

Notes

  • The /lustre filesystem (where the JEDI/Skylab code is and where the experiments run) is somewhat fragile on Gcloud. I found that by reducing the number of parallel jobs on the head node and in the slurm queue it works better. I made the change in R2D2, so no changes needed on the user side. Don’t run too many experiments at once or you will end up with errors like “transport endpoint shutdown” (rerunning such failed jobs helps).
  • One problem on parallelworks is checking out git lfs code. Even with git lfs enabled and everything set up (.gitconfig etc), it always hits the bandwidth rate limit. What works best is to check out the code locally (when on a fast network) or on an HPC, run cmake, and then rsync it across to /lustre/skylab_user.name/jedi-bundle/ using the ssh key you stored in ParallelWorks in the beginning.