Experiments and Workflows Orchestration Kit

Github repository: https://github.com/JCSDA-internal/ewok

Documentation: https://github.com/JCSDA-internal/ewok/blob/develop/README.md

Table of Contents

EWOK Developer Section

EWOK currently uses the ECMWSF's ecFlow software system for implementing workflows. This section will describe how EWOK and ecFlow interact to run a Skylab experiment. Instructions for setting up your environment to can be found in the JEDI Documentation.

EWOK contains 3 important components to the workflow:

  1. Suites - reference tasks and task dependencies. The scripts used for suites is located at {JEDI_SRC}/ewok/src/ewok/suites/.

    A suite will typically build and set up JEDI, define a cycle and loop, and control tasks and triggers for the experiment. All of our suites are written in python. Within a suite file, you can identify the task it calls by the "suite.addTask()" function. A standard function will be passed a task and the configuration. Additional inputs to tasks are used as triggers. An example of this is:

    rawfile = suite.addTask(ewok.fetchObservations, obsconf)
    iodafiles = suite.addTask(ewok.convertObservations, obsconf, rawfile=rawfile)

    Here "rawfile" will be an output from the "fetchObservations" task. Since "rawfile" is used as an input for the "convertObservation" task, then "convertObservation" will not run until after the "fetchObservation" task completes and is what we are referring to as a "trigger".

  2. Tasks - select the runtime script and set up configuration needed at runtime. The task fils are kept at {JEDI_SRC}/ewok/src/ewok/tasks. These are all python based scripts. When adding a new task file, you must update ewok/src/ewok/__init__.py. Inside the task file, you can setup variables to be used at runtime. It is important to note that if your runtime script is a bash script then use "self.RUNTIME_ENV". If your runtime script is a python script then use "self.RUNTIME_YAML". The following line is how you tell the workflow which runtime file to execute:

    self.command = os.path.join(os.environ.get("JEDI_SRC"),
                                          "ewok/src/runtime/getObservationsRun.py")
  3. Runtime - script executed during experiment. Runtime files are located at {JEDI_SRC}/ewok/src/runtime. These are primarily written in bash or python and will be executed when all of the triggers are satisfied. 

ecFlow UI 

The ecFlow UI is user friendly and you can follow instructions in the JEDI Documentation to get started. After creating an experiment you can use the UI to suspend and rerun tasks, view log files and configuration, and much more. In order to execute a command on a task, you must click on the task name in the UI. The top bar will then show you which task you have clicked by saying "your_host → exp_id → task_name". You can right click on the task to pull up a menu of options and select what you want to run or view. Some helpful hot keys to use inside the UI are:

  • Suspend: command S (or ctrl S)
  • Rerun: command U (or ctrl U)
  • Execute: command E (or ctrl E)

The UI will also color the boxes to the left of a task to show the status of that task. The UI will update every 60 seconds. If you want to see the most recent status of the tasks, then click the green refresh arrow at the top left of the screen. The colors of the status boxes mean:

  • Red: aborted
  • Green: active
  • Yellow: completed
  • Blue: queued
  • Cyan: submitted
  • Orange: suspended
  • Grey: unknown status

ecFlow Directories

As part of EWOK's set up, you will notice two variables that pertain to ecFlow which are needed to run an experiment. They are EWOK_WORKDIR and EWOK_FLOWDIR. The EWOK_WORKDIR is where all of your experiment files will be saved that are generated by the workflow - such as feedback files, background files, observations, and your forecasts. The EWOK_FLOWDIR will contain configuration files and the runtime files that get executed. Tip: for testing small on the fly changes, after kicking off an experiment you can edit the runtime files in EWOK_FLOWDIR and then restart the task. Although the runtime file in the EWOK repository will not be updated, this method is useful if you need to force something to work or if you want to troubleshoot without touching the repo. 

FAQ

How to remove an experiment from ecflow?

You can run:

ecflow_client --delete=force yes /<exp_id>

To clean up all tasks run:

ecflow_client --delete=_all_ force

Note: for the full cleanup also need to remove ${EWOK_FLOWDIR}/<exp_id>, ${EWOK_WORKDIR}/<exp_id>, <local experiments dir>/<exp_id>, <local experiments dir>/<model>/<type>/<exp_id>. Aa error “ClientInvoker: Connection error” indicates need to add extra port argument, where the port number is the value reported by "ecflow_start.sh".

ecflow_client --port=<int> --delete=_all_ force

Where are the logs?

While the experiment is running, right-click on the task in ecflow UI, click output, pick the file to see (there would also be a path to that file). In some cases (variational experiments? others?) the stdout/stderr logs can be found in the path: ${EWOK_WORKDIR}/<exp_id>/<date>/. After the experiment has completed, the finishExperiment task will have cleaned up many of these logs. In most cases the yamls, jobs and logs for the latest cycle can still be found in ${EWOK_FLOWDIR}/<exp_id>. To prevent this cleanup of the ewok dir, suspend the finishExperiment task via the GUI or the command line after starting the experiment.

How to run the task with e.g. OOPS_DEBUG on?

In the ecflow UI, go to the task, right click and select edit. That will bring up a window where you can tick the pre-process box before editing, edit the script and set the environment variable or anything you want to edit. Then you can submit the edited script (on the top right).

How to check whether the ecflow_server is running?

Run the command:

ps -ux | grep ecflow

Note, the server may run on a different node, eg: orion-login-4.

What is the best way to find out which experiment yaml was used for a particular experiment?

Follow the instructions available in the JEDI Documentation.

How to run 2 experiments with different versions of ewok on the same machine at the same time?

Once an experiment is running in ecflow it should not depend on your ewok repo any more. So in principle you can then change branch in ewok and start another experiment. That works if you use your own ewok, not the default one that’s installed orion for example. You should also be able to change ${JEDI_SRC} and ${JEDI_BUILD} to point to another set of executables/yamls between submitting experiments.

Debugging new experiments. Experiment failed on some task, and there was a need to update an experiment yaml file. Do I need to create a new experiment, or is there a way to restart the failed task?

You can edit the yaml files in ${EWOK_FLOWDIR}/<exp_id> (and/or rebuild the executable in ${JEDI_BUILD} if you are debugging). Then, in the ecflow UI, right click on the task and choose rerun. You can also select edit in the right-click menu, that will bring up the script for that task, there tick the pre-process box (upper right) and then you can edit the script before submitting it. Once you are done debugging, don’t forget to copy the change back in github. Otherwise, the pre-processing is done by ewok when create_experiment was executed so you have to create a new experiment.

When do I need to run pip install -e in ewok? Do changes in suites and/or tasks require that?

If you use "pip install -e" you only do it once, it will always use the current version (so you can edit, change branch, etc…).

How to rerun all failed tasks in the family?

Right-click on family, choose “Requeue aborted”.

ecflow shows that a task is running, but in fact it is not. How to resolve it?

If you can check the logfile and are sure it completed correctly, right click and set complete in the UI. If you know it failed, or cannot figure out, right click and set aborted then rerun.

How to limit the number of tasks submitted at a time for a particular family?

ecflow_client --alter add limit maxtasks 200 /YOUR_EXPERIMENT_ID
ecflow_client --alter add inlimit maxtasks /YOUR_EXPERIMENT_ID/an
# NOTE: Use delete instead of add to remove the limit.

How to make sure the logs in ${EWOK_FLOWDIR} aren’t removed after the experiment is finished?

Suspend finishExperiment task before the experiment is finished.

How to make sure the logs in ${EWOK_WORKDIR} aren’t removed after the experiment is finished?

Suspend endCycle task before the cycle is finished. Our cleanup will remove cycle directories that are from the previous two cycles or older automatically during the endCycle task.

  • No labels