If you need immediate JEDI infrastructure support, please send a chat to the #jedi-infra-support Slack channel. As we build our knowledge base, we are going to try our best to document past issues and common questions in this wiki. It is recommended to search you issue here, in the slack channel, and also in the JEDI Documentation before reaching out for help.
...
Table of Contents | ||||
---|---|---|---|---|
|
Troubleshooting Tips
- Make sure your code is up to date.
- Try deleting your old venv and starting with fresh installs of solo, r2d2, ewok, and simobs.
- Rebuild jedi-bundle using the scripts available in jedi-tools' build_skylab.sh.
- Make sure your environment is set up correctly. Protip: use jedi-tools' setup.sh. We keep the HPC setup scripts up to date with the most recent release of spack-stack.
- Did you restart the ecflow server?
Previous Issues
Updating CMakeLists.txt to use TAG
Instead of building using the "BRANCH" keyword, typically pointing to develop. You can specify a github hash using the keyword "TAG". See the following example.
Code Block |
---|
ecbuild_bundle( PROJECT oops GIT "https://github.com/jcsda-internal/oops.git" TAG <git commit hash> ) |
PR's CI test is stuck in the queue
Inside the pull request, the CI test shows the message:
Code Block |
---|
Queued — Waiting to run this check … |
Cause: the job exited as soon as the container was invoked without emitting any useful logs. Frustratingly github doesn't have a mechanism to set a job timeout so if the runner dies without updating the check-run the status is set as waiting forever (and github seems fine with this status even if it's hostile to users and developers). You shouldn't worry about leaving hanging check runs. Our runner backends do have useful timeouts and if they get disconnected from github they will clean up their resources even if they can't report back.
Solution: Retrigger CI
r2d2.error.RegistrationNotFound.RegistrationNotFound
The following error was given by R2D2:
Code Block |
---|
Traceback (most recent call last):
File "/work2/noaa/jcsda/smaticka/data_repos/feedback_files/c3762d_8dayAprMay_24HforeC_eval/r2d2_experiment_fetch.py", line 9, in <module>
for search_result in R2D2Data.search(item='feedback', experiment=experiment):
File "/work2/noaa/jcsda/smaticka/jedi_ioda_10apr_gnu/jedi-bundle/r2d2/src/r2d2/r2d2_data.py", line 711, in search
r2d2_data.validate_search_kwargs(kwargs)
File "/work2/noaa/jcsda/smaticka/jedi_ioda_10apr_gnu/jedi-bundle/r2d2/src/r2d2/r2d2_item.py", line 258, in validate_search_kwargs
R2D2Item.process_kwargs(kwargs)
File "/work2/noaa/jcsda/smaticka/jedi_ioda_10apr_gnu/jedi-bundle/r2d2/src/r2d2/r2d2_item.py", line 377, in process_kwargs
R2D2Index.process_index_item_kwarg(kwargs, item)
File "/work2/noaa/jcsda/smaticka/jedi_ioda_10apr_gnu/jedi-bundle/r2d2/src/r2d2/r2d2_index.py", line 171, in process_index_item_kwarg
raise err.RegistrationNotFound(item, kwargs[item])
r2d2.error.RegistrationNotFound.RegistrationNotFound:
c3762d is not registered in experiment yet!
You must manually register this Name using R2D2Index.register() method. |
Cause: the experiment "c3762d" was not found and was deleted by the R2D2 scrubber based on "lifetime".
Solution: the user will need to rerun the original experiment and update the expid. If a longer lifetime is required, then see R2D2's tutorial document for updating lifetime.
sbatch: error: Invalid account or account/partition
The following message was received when submitting a skylab experiment on Orion:
Code Block |
---|
batch: error: Batch job submission failed: Invalid account or account/partition combination specified |
Cause: the user did not have access to the correct groups in order to run experiments.
Solution: email the POC for the HPC to grant access to the jcsda groups.
skylab.jcsda.org or experiments.jcsda.org is not responding
...