Landing page
...
- Requires NOAA RDHPCs account, case-sensative NOAA ID and password
NOAA AIM for managing ParallelWorks projects
...
- PI only
- Requires NOAA RDHPCs account
Table of Contents | |
---|---|
|
Instructions for setting up clusters (every time and one-time only)
Generic instructions for JCSDA and EPIC
AWS (not available to us at the moment): https://github.com/JCSDA/spack-stack/blob/develop/configs/sites/noaa-aws/README.md
...
Gcloud: https://github.com/JCSDA/spack-stack/blob/develop/configs/sites/noaa-gcloud/README.md
Additional instructions for JCSDA
https://github.com/JCSDA-internal/jedi-tools/blob/develop/ParallelWorks/README.md
JCSDA ParallelWorks
...
Covers logging in, user setup, how to use, ...
...
Quick Start
Instructions from Dom's google doc. Last modified by F. Hebert July 30, 2024.
- Log into https://noaa.parallel.works/ with case-sensitive NOAA ID and password
- Warning! The large green on/off buttons next to the storage and compute resources turn the shared resources on/off with little or no warning!
- Upload your public SSH key: click on your name on the top right → Account → Authentication → Add SSH Key; this key is then available for all clusters
- Log in: ssh [-i private_key_if_not_default] User.Name@IP
- IP gcloud: 34.172.131.70 (after logging in, you’re on gclusternoaav2usc1c2dv2-8)
- After logging in, check if X forwarding works:
xclock
- Set up necessary modules
Code Block module purge module unuse /opt/cray/craype/default/modulefiles module unuse /opt/cray/modulefiles module use /contrib/spack-stack/modulefiles module load cmake/3.27.2 module load ecflow/5.8.4
- Run a few basic sanity checks and one-offs
Code Block ecflow_ui git lfs install --skip-repo git config --global credential.helper store git config --global user.name "Your Name" git config --global user.email "your.email@domain.com" # Create your .aws/{config,credentials} as per jedi-docs
- Set up Skylab root directory and script
Code Block cd /lustre mkdir -p skylab_user.name cd /lustre/skylab_user.name git clone https://github.com/jcsda-internal/jedi-tools ln -sf jedi-tools/buildscripts/setup.sh . # Edit setup script: JEDI_ROOT=/lustre/skylab_user.name HOST=pw-gcloud COMPILER=intel # Further down (in section `Load JEDI modules`) update the FMS version module unload fms/release-jcsda module load fms/202304 # Sourcing setup.sh will create your venv if it doesn’t exist source setup.sh
- Build and run ctests
Code Block # Build everything - change branch names as needed in the script ./jedi-tools/buildscripts/build_skylab.sh 2>&1 | tee build_skylab.log # Run ctest on login node if so desired cd build ctest 2>&1 | tee log.ctest
- Run your experiments
Sync R2D2 data stores and EWOK_STATIC_DATA
Check contents of the following scripts and run via copy & paste using your Orion user.
Code Block |
---|
# Run as Dom (sudo su + su - Dom.Heinzeller)
cat /contrib/jedi/rsync-ewok-static-data-from-orion.sh
# Run as root (sudo su):
cat /contrib/jedi/rsync-r2d2-4denvar-msu-from-orion.sh
cat /contrib/jedi/rsync-r2d2-archive-msu-from-orion.sh
cat /contrib/jedi/rsync-r2d2-gfsensemble-msu-from-orion.sh
cat /contrib/jedi/rsync-r2d2-mpasensemble-msu-from-orion.sh |
Notes
- The /lustre filesystem (where the JEDI/Skylab code is and where the experiments run) is somewhat fragile on Gcloud. I found that by reducing the number of parallel jobs on the head node and in the slurm queue it works better. I made the change in R2D2, so no changes needed on the user side. Don’t run too many experiments at once or you will end up with errors like “transport endpoint shutdown” (rerunning such failed jobs helps).
- One problem on parallelworks is checking out git lfs code. Even with git lfs enabled and everything set up (.gitconfig etc), it always hits the bandwidth rate limit. What works best is to check out the code locally (when on a fast network) or on an HPC, run cmake, and then rsync it across to /lustre/skylab_user.name/jedi-bundle/ using the ssh key you stored in ParallelWorks in the beginning.