Yannick opened by bringing up  problems that multiple users have been reporting over the last 1-2 weeks.   In particular, users are reporting failures with 3DVar and HofX applications that had previously passed.  These problems seem to have started some time around April 15 or 16 so we are looking into the pull requests that were done around that time to try to determine what is causing it.  

One issue is reproducibility of results.  Users were seeing different results when running the same application again.  This may at least in part be due to different random seeds in the bump initialization.  This can be solved by setting default_seed: 1 in the bump section of the yaml file as Dan did with this commit to fv3-jedi two days ago.  But this does not solve all the problems - users are still seeing different results when the number of compute cores/mpi tasks is varied and test failures.

Possible sources of the problem include recent changes to bump (Yannick suspects the adjoint interpolation in particular) or recent changes to the pre-QC filters.  Subsequent discussion throughout the meeting seemed to suggest that the latter, namely the pre-QC filters, may be the more likely culprit so we should look there.

Yannick created a new branch in oops called bugfix/revert-bump that will revert to the version of bump that was in use on April 15.  This should allow us to further determine whether it is a problem with bump or if the problem is somewhere else.  There is also a new feature/qcflags in ufo that passes the qc flags and obs error to and from the filters that may address the problem.

BJ noted that all DA and H(x) tests now use the pre-QC filters and suspects the problem might be there.  Marek also expressed similar concerns.

Yannick then emphasized that we need more tests!  This problem went unnoticed until after code was merged into develop.  If we had a more comprehensive suite of tests we would likely have found it before the code was merged.  We need to be able to run tests with all models easily.  We should put more effort on this in the coming weeks.  We made tentative plans for a code sprint, possibly in August to focus on adding more tests.  Participation by all JEDI developers is encouraged! 

EMC

Dan confirmed that he achieved the same results on 6 and 24 processors using the feature/qcflags of ufo, providing further clues that the qcflags are the source of the problem.   Marek also reported later that if he omitted the QC, the problems went away.

He has also been working on implementing the ROTNG (question) radiation scheme and attending an AI workshop that is being held at NOAA.  And, he continues to work on improving the GEOS environment to enable cycling experiments.  He's now working on a change of resolution to enable inner loops to run with lower resolution.

Hamideh also has been attending the AI conference and continues to work on aircraft DA.  She has been seeing some of the same issues discussed above.

Ben Johnson reported on the need to add more crtm interfaces to ufo - aerosols in particular are missing.  He had a request for the group to identify the top-priority changes that need to be made in CRTM.  Yannick mentioned that they wrote a to-do list as part of the ufo code sprint last week.  Some crtm-related items include the need to get some of the output of crtm (e.g. Jacobians) into the QC filtering and the need to provide some missing coefficient files.  Ben is traveling for the next two weeks but those concerned agreed to meet afterwards to discuss further what needs to be done.  

Ben also mentioned that Jim has restructured the do loops in crtm to improve efficiency.

Stylianos announced a new member of the JEDI team, Jong Kim.  He also had a request for the JEDI team: Please provide sufficient notification when new software is introduced.  He is considering in particular the recent introduction of nccmp as a useful tool for comparing netcdf files and which we are considering using for some JEDI tests.  Stylianos says it often takes time to get new software installed on HPC systems such as Theia.

Cory is working on adding a 2m (question) temp obs operator into ufo and is debugging the GSI converter scripts.  He suspects that the nc-writer script is reordering the obs.

Work is also continuing at EMC on adding ozone to ufo

UKMO

Steve mentioned that they are using the parallel rose-stem tool at UKMO to manage the JEDI ctest suite and asked if anyone else is running ctest in parallel?  He's seeing the need to redirect output into different directories.  Yannick mentioned that, as far as he knows, everyone is running the ctests in parallel.  However, we are actively working to improve the JEDI testing.  Maryam is working on improving the testing framework and implementing continuous integration.  Meanwhile, we have a new software engineer in JEDI who will start soon and who will be responsible for developing parallel workflows using software tools such as Cycl (or something equivalent).  Yannick emphasized that the UKMO's input into this process will be requested and will be valued, given their experience with rose stem and other tools.  We still plan to use ctest but we would like to automatically launch parallel test suites that run with different compilers and mpi libraries.  The parties concerned tentatively agreed to meet sometime in May.

Boulder

Mark brought up a problem that some people might be finding when downloading large repositories such as ioda, crtm, and fv3-jedi from git LFS.  Sometimes these git clones will fail with a "smudge error...rate limit exceeded".  This is a bug in the git-lfs software that was fixed with a GitHub merge on Dec 26, 2018 and then made it into the release version in January.  So, as described on the JEDI GitHub Team page, the solution is to install a very recent version of git-lfs: version 2.7.2 seems to solve the issue.

BJ noticed that the MPAS H(x) application has been giving different results lately and asked if there have been any recent changes to CRTM that may be responsible for this.  Yannick mentioned that there were some changes about 2 weeks ago.

Xin has a pull request now in OOPS that refactors the ObsAux class to support the bias correction capability he added to ufo.  It can now read in the bias from GSI.



  • No labels