2021-09-02 Multi-tier Testing in JEDI

Yannick opened the meeting announcing that today we will be discussion our strategy for tiered testing.

Maryam presented the following slides:

At the center of Maryam's presentation is a table (slide 3) that shows a proposal for 5 different test tiers (please see slide for details). The table shows, for each proposed tier, the expected test times, where the test data is stored, what triggers the testing, what platforms the tests are run on, and some descriptive comments. The proposed test tiers are named: '1', '2', '3', 'Comprehensive' and 'Testbed application'. Tier 1 is what we are currently doing in our CI testing (unit tests that run every time a commit is pushed to an open PR).

The remainder of Maryam's presentation included examples (from jv3-jedi and saber repos) of how the user specifies which tier testing is to be built and run (slide 4), estimated costs of running tiers 1 and comprehensive (slide 5), and a list of prompts for discussion (slides 2 and 6). The prompts included topics such as how to organize the tiers, when (every commit, nightly, weekend) and where (AWS, HPC, etc) to run the tiers, criteria for declaring a test as passing, how to specify which tiers are to be run (Cmake variable vs Environment variables), and where the test data (AWS, git-lfs, UCAR DASH, etc.) are stored.

A lot of great discussion came up both during and after Maryam's presentation.

Yannick commented that we shouldn't have too many tiers so that management of tests and test files, and maintenance don't get out of hand. He suggested eliminating tier 2 by putting some of it in tier 1 and the rest in tier 3. Yannick also noted that the testbed application tier will included running more than one DA cycle adding that we could use a test tier that is like a cycling system except it is low resolution so it runs quickly.

Sergey mentioned that he liked the idea of collapsing some of the tiers, and offered two use cases that were not on the list. First, we need to do daily regression testing using the develop branches on HPC systems so that we can detect if we have broken develop builds. Second, we need to do regression testing on the releases since some developers may want to build upon a release (instead of develop).

Mark M added that a factor in defining tiers is single node vs a cluster, and that might be a reason to keep tier 2 (single node) and tier 3 (cluster). Mark M added that low resolution cycling could fit into the testbed application tier, plus the comprehensive tier could be used for checking develop builds on HPC systems (for example, SOCA on Hera, MPAS on Cheyenne, and FV3 on Hera and Discover).

Travis added that his perspective is that there are two sets of tiers: 2 tiers on AWS, and 2 other tiers on HPCs. The AWS tiers are "unit testing" which is like tier 1 and "daily cron for bigger tests". The HPC tiers are "daily develop build check" and "weekly run of big tests" such as high res cycling.

Chris S noted that the difference between tier 2 and tier 3 is essentially the "size" of the test, and asked if there are other aspects of these tiers that define differences between them. Mark M responded that tier 3 is geared more toward testing beyond unit level testing. For example: checking runtime performance and checking tasks containing more steps than unit level tasks and less steps than entire DA flows (testbed tier).

Eric asked if we are triggering testing with Github commits. The answer is yes for the tier1 CI testing, which gets triggered for every commit in all active PRs.

Ben J commented that we need a way to update releases (and trigger testing) with patches of which may not be desirable to merge back into develop. Ben noted that these kinds of patches can occur once a release goes into production, and they (CRTM) tend to keep releases around for ~2 years. Mark M noted that we are running into issues like this with UCAR DASH where we can only easily access the "latest" version of stored files.

Andrew L asked where scientific tests fit in. The answer is the "testbed" tier.

Ben J suggested that we could target laptops with low res testing, and there was general agreement with this comment.

Yonggang asked how often we should run MPAS level tests, and should these tests be manually triggered. It was suggested to use cron (as opposed to manual triggering) to run the tests on perhaps daily or weekly schedules.

Eric asked what is the best way to store test results (for later analysis and diagnostics). This goes with how the testbed is being used, and perhaps R2D2 is the means for storing test results of this nature.

Steve S asked if we do the same tiered testing for model interfaces, noting that he sees tiers 1, 2 and 3 for JEDI core repos, and the "comprehensive" tier for model interfaces. Yannick responded that we want at least tier 1 for all the repos including the model interfaces, but noted that tier 1 tests need to be small and fast.

During the presentation, Maryam noted that both fv3-jedi and saber repos have CMake configuration for specifying which test tiers are to be built and run, but these two repos use different methods (e.g., CMake variables vs environment variables). She mentioned that we want to end up with a consistent (across repos), well documented mechanism for specifying test tiers, and proposed moving forward with using CMake variables for that purpose. Maryam added that model testing has some differences compared to testing the JEDI core repos, so this needs to be considered. Yannick added his agreement with a single consistent mechanism for test tier specification.

There was lively discussion on the tier specification topic which is summarized here.

Mixed opinions on which is better between CMake variables vs environment variables, but it seemed more people were leaning toward CMake variables
It was suggested to always build all tests, and use the specification mechanism to control what tests are run. There were mixed opinions about doing this. The main advantage of always building all tests is that it significantly simplifies the CMake configuration, while the main disadvantage is the consumption of extra time to build tests that you end up not running.
- Build times can be reduced by cd'ing to particular subdirectories in your build area before running make and ctest.
The test tier specification tended to move toward the following variables and their meaning:
- `TEST_TIER=<level>` - global, default setting for all repos in the bundle
- `<repo>_TEST_TIER=<level>` - override for individual repo
For the SKIP_LARGE_TESTS control in fv3-jedi, using the memory requirement to define a "large test" worked better than using the number of required mpi tasks.
Placing valgrind into a test tier for automatic testing can be tricky

Yannick noted that the ecbuild developers (ECMWF) were discussing how to make build/test loops run faster. We need to check in with them and learn about what they are doing.

Travis asked about using jedi-build. Maryam responded that this is currently being used for the Code Pipeline. (Code Pipeline is used to test repos that are outside the dependencies of the repo you are testing, but are impacted by changes in the repo your are testing. For example oops testing does not check model repos, but oops changes impact the model repos.)

Steve S described nightly testing practices at the Met Office. This consisted of using a shared github account to clone all the repos being tested, and submitting the tests to Met Office clusters and HPC system. They are not using AWS yet. Steve S added for the testbed tier, they would need EWOK to be able to run with ROSE.

Sergey asked if each of the tasks in a cycling system (testbed tier) could be run in parallel. Yannick responded that in many places in the flow, there are dependencies from one task to the next task so it would be difficult to parallelize more than what EWOK already does. In fact, that is the purpose of EWOK which is to manage the dependencies between all the tasks in a work flow.

Here is a snapshot of the chat room comments:

Space shortcuts

Page tree