2020-05-07: Near real time H(x) flow

Yannick opened the meeting announcing today's topic: Near Real Time (NRT) H(x) flow. Then he turned it over to Mark O who presented this subject.

Mark O gave a mix of slides (below) and live demos. Please see the slides for details. Here is a link to the slides on Google Drive. The copy below matches (at this time) what is on the Google Drive, and is included in case of issues with the link.

Here is an overview of the presentation.

jedi-rapids is the JEDI system that manages workflows. The workflow itself is run by third party tools, and jedi-rapids does the configuration and execution of a specified workflow, and the posting of results from that workflow. jedi-rapids currently uses the Cylc workflow and support for other workflows (eg, Ecflow, Airflow) are in the plans.

jedi-rapids has a notion of an application which is a generic description of how accomplish a specific task (eg, hofx3d). The application is broken down into phases which can be specified and configured by the user to describe the task. The applications are placed into a graph representing the flow (flow graph) of a higher level task (NRT H(x)). The flow graph is then used to configure the 3rd party workflow (Cylc) and then execute the higher level task.

The first flow example in this system is the NRT H(x) flow. This flow takes current observations and GFS backgrounds, runs the FV3-GFS 3d_hofx application (from fv3-bundle), collects the results, generates plots, and posts the plots on the JCSDA website (see Products→NRT Observation Monitoring). This website gets updated six times per day. You can click on one of the plots to reveal details about that particular instrument and associated observation data.

Discussion started up at this point about what is meant by "real-time". This system is being called "Near Real Time" to emphasize that there is a delay from what would normally be expected as "real-time" The NRT system is roughly 48 hours behind, but it does demonstrate how jedi-rapids can be used to configure and set up a scientifically interesting flow. The delay in the flow is primarily due to the availability of observations and backgrounds for running the FV3-GFS hofx3d application.

Mark O then showed details (see the slides) of how the flow graph for the NRT H(x) flow is constructed and executed. Mark pointed out that the jedi-rapids applications that constitute the flow graph could be manually run from the command line which would produce the same results as having the 3rd party workflow run them automatically. Discussion ensued about how unavailable data (obs, backgrounds) is handled (Chris H question) and to explain more details about the jedi-rapids applications (Arun Chawla question). Mark emphasized that it is straightforward to write new jedi-rapids applications and encouraged users to do so. He could provide further training on this to anyone who is interested.

jedi-rapids currently is told when to expect data to show up and can do several retries if that data is not there. This was done to handle the short term manner in which data are made available to the NRT H(x) flow. This is not the same as a functional operational system (where data would be much more readily and reliably available) so the long term plan is to have jedi-rapids trigger the flow when data shows up (as opposed to waiting for expected arrival times).

The jedi-rapids 3d_hofx application will dynamically configure YAML right before its execution. This is done to take into consideration the available resources at run time. Resources include observations, model backgrounds, MPI tasks, etc. The YAML config is one of the programmable phases of a jedi-rapids application. A jedi-rapids application has a number of different programmable phases available for use which can be used to piece together a wide variety of applications (run 4D H(x), run 3Dvar, create diagnostic plots, etc.). In the NRT H(x) flow, the hofx3d jedi-rapids application has three phases: configure YAML, run the FV3-GFS hofx3d application, and archive the results files (preparation for diagnostics plotting which is a downstream jedi-rapids application).

Mark O demoed troubleshooting. First, the Cylc monitoring textual GUI can be used to see the state of the steps in the Cylc flow that jedi-rapids setup and ran. In the demo one could see each Cylc step and some were highlighted in red for "failed" and blue for "waiting". jedi-rapids keeps extensive log messaging and these can be queried to get details on why a particular Cylc step failed. In this demo, it was shown that truncated tar files containing inputs appeared which caused the failed step. Mark pointed out that jedi-rapids can simply wait until good files (matching expected checksum) show up, and will then pick up running the flow again. Emily commented that the failures are likely due to filesystem maintenance that has been occurring on Hera this week.

Mark O then gave us a look at the JSCDA/jedi-rapids github repository. This has a tagged version (v0.1) for jedi-rapids which is the version being used in the NRT H(x) flow. The README.md file gives extensive documentation which shows you how to set up and run jedi-rapids. The documentation includes a tutorial for setting up and running an example flow, including a video showing a screenshot of typing in the jedi-rapids commands for this example flow. The README.md file also mentions that jedi-rapids is working on S4, Discover, Cheyenne, and AWS. Mark mentioned that he is working now to add Orion to that list.

Mark O finished up by mentioning some of the handy jedi-rapids commands:

jedi-rapids list
- Show available obs, backgrounds, jedi-rapids applications, etc.
jedi-rapids sync
- Sync up local copies of obs, backgrounds
jedi-rapids build
- Build using different branches (develop, feature/xxx, etc) of components used in the flow
- Multiple builds can co-exist and there is a means for selecting a specific build
- Very handy for testing new JEDI features

Mark also mentioned that every jedi-rapids command has a help argument which gives usage details. jedi-rapids also has the ability to utilize HPC batch submission systems, such as SLURM. He demoed submitting build tests on S4 using the sbatch command. In this case, jedi-rapids created the run script for sbatch and then submitted the job.

Chris H expressed interest in jedi-rapids and wants to ask more questions about it. Details include how to handle the many failure modes in a cycling DA workflow and the boundary between where jedi-rapids ends and where third-party workflow engines such as cylc begin (Mark confirmed that that is a fluid/fuzzy boundary). He suggested setting up a followup meeting for this. Mark O will work on providing followup sessions for those who are interested.

Yannick closed the meeting at this point.

Space shortcuts

Page tree