The Intercycle Layer and Workflow Layer are the highest layers in the pyHWRF system: their purpose is to connect the Scripting Layer to the supercomputer's batch system.  Although these are two separate layers at the conceptual layer, they may be implemented as a single layer in practice.  In addition, it is possible to run without these top two layers by getting interactive compute node shells and running the Scripting Layer manually. 

Purposes of these Layers

Two understand the purpose and division between these two layers, one must first understand what a cycle is.  In weather forecasting, a model is initialized at its analysis time , with the best known state of the atmosphere and ocean.  It then produces a forecast made up of atmospheric and oceanic fields from times after the analysis time.  Then, a few hours later, the model is initailized again with the best known state at another analysis time, providing another forecast.  Each of these analysis-forecast steps is referred to as a cycle.  The Intercycle Layer handles the, sometimes very complex, dependencies between the jobs in each cycle.  The Workflow Layer handles the dependencies between each job within one cycle. 

Duties of the Intercycle Layer

The Intercycle Layer keeps track of dependencies between forecast cycles, and ensures that files are still available until they are no longer needed.  Generally when running a large test, it is critical for the Intercycle Layer to clean up earlier forecast cycles before moving on to too many later ones, otherwise the disk space will be quickly filled up.  When running a single cycle, such as for individual case studies, or developing and debugging new HWRF capabilities, the Intercycle Layer is an unnecessary complexity.  However, when running a large test with hundreds of forecast cycles from many storms, the Intercycle Layer is critical.

Duties of the Workflow Layer

Generally, due to its large resource requirements, the HWRF is run on a supercomputer or at least a cluster computer.  It is, for the most part, a distributed memory parallelized program, so it is made up of batch jobs, each of which requires a certain number of compute nodes, which each run MPI ranks across their assigned nodes.  Those batch jobs each run part of the HWRF workflow, resulting in inter-job dependencies.  For example, the job that runs the forecast model cannot start until the jobs that generate its initial and boundary conditions have completed. The inter-job dependencies within the present HWRF system are fairly simple: certain jobs cannot start until others have finished, and the total number of jobs per storm is fairly small.  Once a job's dependencies are met, the Workflow Layer must know how to submit the job to a batch system so it runs on compute nodes, and it must pass control on to the appropriate part of the Scripting Layer to execute the part of the HWRF that should run in that job.

2014 Operational HWRF Jobs

The 2014 Operational HWRF workflow and dependencies are described here:

  • FIXME: insert link to workflow diagram PDF here

Examples of HWRF Intercycle and Workflow Layers

NCEP Central Operations ecFlow-Driven HWRF

The Python-based automation system used by the European ECMWF model is called "ecflow" and NCEP uses the same automation system to run its entire modeling suite, including the HWRF model.  It represents a combined Intercycle and Workflow Layer, along with some human interaction on the part of the National Hurricane Center and (when things fail) NCEP Central Operations and the NCEP Environmental Modeling Center.  The ecflow HWRF has simple dependency capabilities: it knows to avoid submitting certain jobs before certain other ones are complete.  The HWRF system is designed to allow that simple of a level of dependency tracking. 

There are some somewhat unique aspects to this workflow, however.  The description of the storms to simulate is provided in real-time manually by National Hurricane Center forecasters examining model and observational data.  That is picked up automatically by the ecflow HWRF and used to trigger up to five storms.  Since the simulation is running on the same supercomputer used to run the parent GFS model, no complex data transfer jobs are needed.  Also, there are manual operators monitoring the workflow 24/7, with the ability to immediately contact the model developers.  This level of manual supervision allows somewhat more fault recovery than the other workflow systems, at the cost of sleep deprivation and headaches when something does fail.

To use this Intercycle and Workflow Layer, one must ensure the Scripting Layer is the NCEP Jobs/Scripts/USH Scripting Layer.

NCEP Environmental Modeling Center HHS and kick_scripts

This pair of layers is designed to look and act exactly like the ecFlow-driven HWRF described above, but without having to use ecFlow.  It is a wrapper around the NCEP Jobs/Scripts/USH Scripting Layer, just like the ecFlow-driven HWRF, and it results in the same set of batch jobs being submitted.  All that differs is how they are submitted and monitored.  This is what EMC has been using since 2004 for its Workflow Layer, and since 2008 for its Intercycle Layer.

The kick_scripts are a simple set of ksh scripts that make up a Workflow Layer with one critical limitation: each batch job is submitted by the batch job that is its dependent.  That makes it impossible to have a job with multiple dependencies, and it also makes more complex dependencies, such as time-based or file-based, impossible.  This is not a problem for the other automation systems described here.  That did not cause problems in the 2013 HWRF last year as it had no such dependencies.  It can cause problems when used for the 2014 HWRF unless one is very careful about how the kick_scripts are written, since the system does have multiple dependencies for some jobs.  Generally, this requires wait loops added within the scripts and carefully chosen wallclock limits.  It is for that reason that the Environmental Modeling Center plans to move to Rocoto, as explained in the next section.  However, the kick_scripts are being used in the present development of the pyHWRF.

Sitting on top of the kick_scripts level is an Intercycle Layer known as the HHS.  It is designed to have no knowledge of what lies within the Workflow Layer or below it, apart from some minimal knowledge of how to start a cycle, how to clean it up, and how to check certain aspects of the output and log files for correct execution.  It is independent of the version of HWRF used: the present HHS should run anything from the 2008 HWRF to some of the 2014 experimental configurations, and will soon be updated for the pyHWRF.  It has successfully been used to run tens of thousands of HWRF simulations over the past six years.  However, EMC plans on retiring it in favor of Rocoto, described in the next section.

Rocoto-Driven HWRF

Rocoto, formerly known as the NOAA Workflow Manager, was originally developed by Chris Harrop at NOAA ESRL, and is now a public open source project.  The Developmental Testbed Center (DTC) has used it for years to run their HWRF scripts.  One of its advantages is an ability to automatically resubmit jobs, and it also can track complex dependencies.  There are ways of generating vast numbers of jobs automatically from templates, such as to support ensemble forecasts.  Unfortunately, the ksh-based 2013 NCEP EMC/NCO scripts cannot be run using this system due to flaws in the design of those scripts.  However, the Python rewrite will fix this problem, and EMC hopes to move their parallels to Rocoto in 2014.  The NCO parallels will continue using ecFlow, however.

  • No labels