If you need immediate JEDI infrastructure support, please send a chat to the #jedi-infra-support Slack channel. As we build our knowledge base, we are going to try our best to document past issues and common questions in this wiki. It is recommended to search you issue here, in the slack channel, and also in the JEDI Documentation before reaching out for help. 

Table of Contents

Troubleshooting Tips

  • Make sure your code is up to date.
  • Try deleting your old venv and starting with fresh installs of solo, r2d2, ewok, and simobs.
  • Rebuild jedi-bundle using the scripts available in jedi-tools' build_skylab.sh.
  • Make sure your environment is set up correctly. Protip: use jedi-tools' setup.sh. We keep the HPC setup scripts up to date with the most recent release of spack-stack.
  • Did you restart the ecflow server?

Previous Issues

Updating CMakeLists.txt to use TAG

Instead of building using the "BRANCH" keyword, typically pointing to develop. You can specify a github hash using the keyword "TAG". See the following example.

ecbuild_bundle( PROJECT oops GIT "https://github.com/jcsda-internal/oops.git" TAG <git commit hash> )

PR's CI test is stuck in the queue

Inside the pull request, the CI test shows the message:

Queued — Waiting to run this check …

Cause: the job exited as soon as the container was invoked without emitting any useful logs. Frustratingly github doesn't have a mechanism to set a job timeout so if the runner dies without updating the check-run the status is set as waiting forever (and github seems fine with this status even if it's hostile to users and developers). You shouldn't worry about leaving hanging check runs. Our runner backends do have useful timeouts and if they get disconnected from github they will clean up their resources even if they can't report back.

Solution: Retrigger CI

r2d2.error.RegistrationNotFound.RegistrationNotFound

The following error was given by R2D2:

Traceback (most recent call last):
 File "/work2/noaa/jcsda/smaticka/data_repos/feedback_files/c3762d_8dayAprMay_24HforeC_eval/r2d2_experiment_fetch.py", line 9, in <module>
  for search_result in R2D2Data.search(item='feedback', experiment=experiment):
 File "/work2/noaa/jcsda/smaticka/jedi_ioda_10apr_gnu/jedi-bundle/r2d2/src/r2d2/r2d2_data.py", line 711, in search
  r2d2_data.validate_search_kwargs(kwargs)
 File "/work2/noaa/jcsda/smaticka/jedi_ioda_10apr_gnu/jedi-bundle/r2d2/src/r2d2/r2d2_item.py", line 258, in validate_search_kwargs
  R2D2Item.process_kwargs(kwargs)
 File "/work2/noaa/jcsda/smaticka/jedi_ioda_10apr_gnu/jedi-bundle/r2d2/src/r2d2/r2d2_item.py", line 377, in process_kwargs
  R2D2Index.process_index_item_kwarg(kwargs, item)
 File "/work2/noaa/jcsda/smaticka/jedi_ioda_10apr_gnu/jedi-bundle/r2d2/src/r2d2/r2d2_index.py", line 171, in process_index_item_kwarg
  raise err.RegistrationNotFound(item, kwargs[item])
r2d2.error.RegistrationNotFound.RegistrationNotFound: 
c3762d is not registered in experiment yet!
You must manually register this Name using R2D2Index.register() method.

Cause: the experiment "c3762d" was not found and was deleted by the R2D2 scrubber based on "lifetime".

Solution: the user will need to rerun the original experiment and update the expid. If a longer lifetime is required, then see R2D2's tutorial document for updating lifetime. 

sbatch: error: Invalid account or account/partition

The following message was received when submitting a skylab experiment on Orion: 

batch: error: Batch job submission failed: Invalid account or account/partition combination specified

Cause: the user did not have access to the correct groups in order to run experiments. 

Solution: email the POC for the HPC to grant access to the jcsda groups.

skylab.jcsda.org or experiments.jcsda.org is not responding

Occasionally skylab.jcsda.org might not respond, if that is the case the easiest solution is to reboot the machine this runs from via the AWS Console. You can ask a member of the infrastructure team to do this for you. This procedure is documented in the Web Apps page.

Updating R2D2 file lifetime

If you cannot find previously ingested files in R2D2, chances are they have been scrubbed. This is done based the value set for the experiment's "lifetime". "lifetime" is typically set to "default" which is 2 weeks. You can update this using a simple R2D2 script:

from r2d2 import R2D2Index
R2D2Index.update(item='experiment', name='b00b7f', key='lifetime', value='science')

The current "lifetime" values are "debug" set to 14 days, "science" set to 180 days (6 months), "publication" set to 1825 days (5 years), and "release" set to indefinite days. More information can be found at R2D2's tutorial document.

Experiment scrubber

There is a scrubber in place for cleaning up old experiments and based on R2D2's "lifetime" key. Some common questions around this scrubber are addressed below.

Missing experiments and plots at https://experiments.jcsda.org/: If it is over 2 weeks since running the experiment, it is likely "lifetime" was set to the default and R2D2's cleanup worked as expected. If you need experiments to stay around longer see the above section "Updating R2D2 file lifetime".

Older experiments are showing up: The scrubber is working hard but has a big backlog to get to. If you see older experiments then it could be a sign that the scrubber is still weeding through these. 

R2D2 install protobuf error

If you are using spack-stack 1.7.0, you might get the following error message when installing R2D2:

ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
cylc-flow 8.2.3 requires protobuf<4.22.0,>=4.21.2, but you have protobuf 3.20.1 which is incompatible.
Successfully installed protobuf-3.20.1 r2d2-2.3.0

This message is safe to ignore for now as it says "Successfully installed protobuf-3.20.1 r2d2-2.3.0".

Resolved: R2D2 PR

MySQL exceptions and errors

Errors such as: 

raise get_mysql_exception(
mysql.connector.errors.ProgrammingError: 1146 (42S02): Table 'r2d2.item' doesn't exist

indicate that your local R2D2 MySQL database is out of date. In order to update your local database, follow the instructions in R2D2's tutorial document.