If you need immediate JEDI infrastructure support, please send a chat to the #jedi-infra-support Slack channel. As we build our knowledge base, we are going to try our best to document past issues and common questions in this wiki. It is recommended to search you issue here, in the slack channel, and also in the JEDI Documentation before reaching out for help.
Table of Contents |
---|
Troubleshooting Tips
- Make sure your code is up to date.
- Try deleting your old venv and starting with fresh installs of solo, r2d2, ewok, and simobs.
- Rebuild jedi-bundle using the scripts available in jedi-tools' build_skylab.sh.
- Make sure your environment is set up correctly. Protip: use jedi-tools' setup.sh. We keep the HPC setup scripts up to date with the most recent release of spack-stack.
- Did you restart the ecflow server?
Previous Issues
Many many ctests failing after new build
A lot of ctests failing after a fresh JEDI build is an indication that there is something very wrong with your build. One thing to check first before panicking is if git lfs
is working properly. You might have just failed to download the correct files needed to for the ctest. Check your ~/.gitconfig for the section:
[filter "lfs"] clean = git-lfs clean -- %f smudge = git-lfs smudge -- %f process = git-lfs filter-process required = true
If you are on an HPC, some require you to load the git-lfs module. This is the case for Hercules and Discover. More notes on git lfs can be found in JEDI docs git-lfs on HPCs and JEDI docs for developer git lfs information.
module show git-lfs module load git-lfs
FATAL: At least one pe in pelist is not used by any tile in the mosaic
The following error was received during the variational task in an EWOK experiment:
FATAL from PE 44: mpp_domains_define.inc: At least one pe in pelist is not used by any tile in the mosaic
Cause: Input from the ALGO team explained that this usually implies a mismatch between the number of MPI tasks used in the run vs the fv3 layout, i.e. how the fv3 tiles are split up in the geometry yaml section.
Solution: Check the layout and mpi config. For this case, it needed to have layout1 x layout2 x 6 = nodes x tasks per node.
Plots and figures not showing up on experiments.jcsda.org
The user was able to see in the log that media files were being copied to r2d2-experiments-jcsda-noaa-aws-us-east-1 bucket, but there were not being populated on the website.
Cause: The username for the experiment was blank. Publishing media uses a username harvested from the github credentials. It does not matter what the username is.
Solution: Update the ~/.gitconfig where you are running to include a section that looks like the following. Note, that R2D2 will strip out the <your_email> portion before the "@" symbol, so you can add a fake email to customize the name listed on the website.
[user] name = <github_user_name> email = <your_email>@gmail.com
Updating CMakeLists.txt to use TAG
Instead of building using the "BRANCH" keyword, typically pointing to develop. You can specify a github hash using the keyword "TAG". See the following example.
ecbuild_bundle( PROJECT oops GIT "https://github.com/jcsda-internal/oops.git" TAG <git commit hash> )
PR's CI test is stuck in the queue
Inside the pull request, the CI test shows the message:
Queued — Waiting to run this check …
Cause: the job exited as soon as the container was invoked without emitting any useful logs. Frustratingly github doesn't have a mechanism to set a job timeout so if the runner dies without updating the check-run the status is set as waiting forever (and github seems fine with this status even if it's hostile to users and developers). You shouldn't worry about leaving hanging check runs. Our runner backends do have useful timeouts and if they get disconnected from github they will clean up their resources even if they can't report back.
Solution: Retrigger CI
r2d2.error.RegistrationNotFound.RegistrationNotFound
The following error was given by R2D2:
Traceback (most recent call last): File "/work2/noaa/jcsda/smaticka/data_repos/feedback_files/c3762d_8dayAprMay_24HforeC_eval/r2d2_experiment_fetch.py", line 9, in <module> for search_result in R2D2Data.search(item='feedback', experiment=experiment): File "/work2/noaa/jcsda/smaticka/jedi_ioda_10apr_gnu/jedi-bundle/r2d2/src/r2d2/r2d2_data.py", line 711, in search r2d2_data.validate_search_kwargs(kwargs) File "/work2/noaa/jcsda/smaticka/jedi_ioda_10apr_gnu/jedi-bundle/r2d2/src/r2d2/r2d2_item.py", line 258, in validate_search_kwargs R2D2Item.process_kwargs(kwargs) File "/work2/noaa/jcsda/smaticka/jedi_ioda_10apr_gnu/jedi-bundle/r2d2/src/r2d2/r2d2_item.py", line 377, in process_kwargs R2D2Index.process_index_item_kwarg(kwargs, item) File "/work2/noaa/jcsda/smaticka/jedi_ioda_10apr_gnu/jedi-bundle/r2d2/src/r2d2/r2d2_index.py", line 171, in process_index_item_kwarg raise err.RegistrationNotFound(item, kwargs[item]) r2d2.error.RegistrationNotFound.RegistrationNotFound: c3762d is not registered in experiment yet! You must manually register this Name using R2D2Index.register() method.
Cause: the experiment "c3762d" was not found and was deleted by the R2D2 scrubber based on "lifetime".
Solution: the user will need to rerun the original experiment and update the expid. If a longer lifetime is required, then see R2D2's tutorial document for updating lifetime.
sbatch: error: Invalid account or account/partition
The following message was received when submitting a skylab experiment on Orion:
batch: error: Batch job submission failed: Invalid account or account/partition combination specified
Cause: the user did not have access to the correct groups in order to run experiments.
Solution: email the POC for the HPC to grant access to the jcsda groups.
skylab.jcsda.org or experiments.jcsda.org is not responding
Occasionally skylab.jcsda.org might not respond, if that is the case the easiest solution is to reboot the machine this runs from via the AWS Console. You can ask a member of the infrastructure team to do this for you. This procedure is documented in the Web Apps page.
Updating R2D2 file lifetime
If you cannot find previously ingested files in R2D2, chances are they have been scrubbed. This is done based the value set for the experiment's "lifetime". "lifetime" is typically set to "default" which is 2 weeks. You can update this using a simple R2D2 script:
from r2d2 import R2D2Index R2D2Index.update(item='experiment', name='b00b7f', key='lifetime', value='science')
The current "lifetime" values are "debug" set to 14 days, "science" set to 180 days (6 months), "publication" set to 1825 days (5 years), and "release" set to indefinite days. More information can be found at R2D2's tutorial document.
Experiment scrubber
There is a scrubber in place for cleaning up old experiments and based on R2D2's "lifetime" key. Some common questions around this scrubber are addressed below.
Missing experiments and plots at https://experiments.jcsda.org/: If it is over 2 weeks since running the experiment, it is likely "lifetime" was set to the default and R2D2's cleanup worked as expected. If you need experiments to stay around longer see the above section "Updating R2D2 file lifetime".
Older experiments are showing up: The scrubber is working hard but has a big backlog to get to. If you see older experiments then it could be a sign that the scrubber is still weeding through these.
R2D2 install protobuf error
If you are using spack-stack 1.7.0, you might get the following error message when installing R2D2:
ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts. cylc-flow 8.2.3 requires protobuf<4.22.0,>=4.21.2, but you have protobuf 3.20.1 which is incompatible. Successfully installed protobuf-3.20.1 r2d2-2.3.0
This message is safe to ignore for now as it says "Successfully installed protobuf-3.20.1 r2d2-2.3.0".
Resolved: R2D2 PR
MySQL exceptions and errors
Errors such as:
raise get_mysql_exception( mysql.connector.errors.ProgrammingError: 1146 (42S02): Table 'r2d2.item' doesn't exist
indicate that your local R2D2 MySQL database is out of date. In order to update your local database, follow the instructions in R2D2's tutorial document.