Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Comment: toc formatting

...

Documentation: https://github.com/JCSDA-internal/ewok/blob/develop/README.md

Table of Contents

Table of Contents
maxLevel2

EWOK Developer Section

EWOK currently uses the ECMWSF's ecFlow software system for implementing workflows. This section will describe how EWOK and ecFlow interact to run a Skylab experiment. Instructions for setting up your environment to can be found in the JEDI Documentation.

...

As part of EWOK's set up, you will notice two variables that pertain to ecFlow which are needed to run an experiment. They are EWOK_WORKDIR and EWOK_FLOWDIR. The EWOK_WORKDIR is where all of your experiment files will be saved that are generated by the workflow - such as feedback files, background files, observations, and your forecasts. The EWOK_FLOWDIR will contain configuration files and the runtime files that get executed. Tip: for testing small on the fly changes, after kicking off an experiment you can edit the runtime files in EWOK_FLOWDIR and then restart the task. Although the runtime file in the EWOK repository will not be updated, this method is useful if you need to force something to work or if you want to troubleshoot without touching the repo

FAQ

How to remove an experiment from ecflow?

You can run:

Code Block
ecflow_client --delete=force yes /<exp_id>

To clean up all tasks run:

Code Block
ecflow_client --delete=_all_ force

Note: for the full cleanup also need to remove ${EWOK_FLOWDIR}/<exp_id>, ${EWOK_WORKDIR}/<exp_id>, <local experiments dir>/<exp_id>, <local experiments dir>/<model>/<type>/<exp_id>. Aa error “ClientInvoker: Connection error” indicates need to add extra port argument, where the port number is the value reported by "ecflow_start.sh".

Code Block
ecflow_client --port=<int> --delete=_all_ force

Where are the logs?

While the experiment is running, right-click on the task in ecflow UI, click output, pick the file to see (there would also be a path to that file). In some cases (variational experiments? others?) the stdout/stderr logs can be found in the path: ${EWOK_WORKDIR}/<exp_id>/<date>/. After the experiment has completed, the finishExperiment task will have cleaned up many of these logs. In most cases the yamls, jobs and logs for the latest cycle can still be found in ${EWOK_FLOWDIR}/<exp_id>. To prevent this cleanup of the ewok dir, suspend the finishExperiment task via the GUI or the command line after starting the experiment.

How to run the task with e.g. OOPS_DEBUG on?

In the ecflow UI, go to the task, right click and select edit. That will bring up a window where you can tick the pre-process box before editing, edit the script and set the environment variable or anything you want to edit. Then you can submit the edited script (on the top right).

How to check whether the ecflow_server is running?

Run the command:

Code Block
ps -ux | grep ecflow

Note, the server may run on a different node, eg: orion-login-4.

What is the best way to find out which experiment yaml was used for a particular experiment?

Follow the instructions available in the JEDI Documentation.

How to run 2 experiments with different versions of ewok on the same machine at the same time?

Once an experiment is running in ecflow it should not depend on your ewok repo any more. So in principle you can then change branch in ewok and start another experiment. That works if you use your own ewok, not the default one that’s installed orion for example. You should also be able to change ${JEDI_SRC} and ${JEDI_BUILD} to point to another set of executables/yamls between submitting experiments.

Debugging new experiments. Experiment failed on some task, and there was a need to update an experiment yaml file. Do I need to create a new experiment, or is there a way to restart the failed task?

You can edit the yaml files in ${EWOK_FLOWDIR}/<exp_id> (and/or rebuild the executable in ${JEDI_BUILD} if you are debugging). Then, in the ecflow UI, right click on the task and choose rerunYou can also select edit in the right-click menu, that will bring up the script for that task, there tick the pre-process box (upper right) and then you can edit the script before submitting it. Once you are done debugging, don’t forget to copy the change back in github. Otherwise, the pre-processing is done by ewok when create_experiment was executed so you have to create a new experiment.

When do I need to run pip install -e in ewok? Do changes in suites and/or tasks require that?

If you use "pip install -e" you only do it once, it will always use the current version (so you can edit, change branch, etc…).

How to rerun all failed tasks in the family?

Right-click on family, choose “Requeue aborted”.

ecflow shows that a task is running, but in fact it is not. How to resolve it?

If you can check the logfile and are sure it completed correctly, right click and set complete in the UI. If you know it failed, or cannot figure out, right click and set aborted then rerun.

How to limit the number of tasks submitted at a time for a particular family?

Code Block
ecflow_client --alter add limit maxtasks 200 /YOUR_EXPERIMENT_ID
ecflow_client --alter add inlimit maxtasks /YOUR_EXPERIMENT_ID/an
# NOTE: Use delete instead of add to remove the limit.

How to make sure the logs in ${EWOK_FLOWDIR} aren’t removed after the experiment is finished?

Suspend finishExperiment task before the experiment is finished.

How to make sure the logs in ${EWOK_WORKDIR} aren’t removed after the experiment is finished?

Suspend endCycle task before the cycle is finished. Our cleanup will remove cycle directories that are from the previous two cycles or older automatically during the endCycle task.