2019-02-21

Yannick opened the meeting by acknowledging some issues that JEDI users (including the MPAS and LFRic teams) have been reporting that appear to be related to mpi implementations and excessive memory usage, the latter possibly signifying a memory leak.

As a specific example, JJ then presented the slides below that summarize performance issues they are having on Cheyenne. These were first noticed for MPAS-bundle but then they were confirmed with FV3-bundle.

JJ's slides are below. They illustrate timings for a 3D-Var run with MPAS on Cheyenne and with FV3 on Cheyenne and Discover. The wall-clock time is dominated by the GetValuesTL+AD, as expected for this application - so it spends much of its time doing interpolations.

The problem illustrated on these slides is that the single-node performance on Cheyenne is currently a performance bottleneck. In particular, the single-node performance is much worse than the multi-node performance, for the same number of cores. This is seen for both MPAS and FV3. Specifically, for FV3, the wall time using 36 cores on one node is nearly 100x greater than the wall time using 36 cores spread across 6 nodes (6 cores/node). This difference is not seen on Discover, which has comparable timings on 1 and 6 nodes, and these timings are similar to the 6-node performance on Cheyenne.

JJ also described results obtained using the 1D and 2D implementations of the apply_opsob function in bump. He demonstrated that using the 2D implementation improved the single-node performance on Cheyenne substantially, though the wall time was still longer than Discover.

Hailing mentioned that she was also seeing issues on Cheyenne that may be related. She was running mpas-bundle using different configurations of nodes and cores per node and seeing large variations in wall time. Some configurations even led to code crashes.

There was speculation among the group on what might be causing these performance issues. Though the application is not memory-limited, the bottleneck may still be memory-related. For example, the cores on a single node may be competing for LLC cache usage or disk bandwidth.

Another, perhaps related, explanation may be the mpi implementation. All the applications described on the slides use openmpi, which is known to not be the optimal mpi implementation on either Cheyenne or Discover (or any HPC platform, for that matter). This speculation is further supported by a warning message that JJ has noticed, namely that the libpsm_infinipath library is not found on Cheyenne. This is part of intel's performance scaled messaging (psm) library that is designed to sit in between the high-level MPI library and the hardware to improve efficiency. Xin confirmed that he did not attempt to optimize the openmpi configuration when he developed the jedi software modules for Cheyenne.

The group then identified a promising path forward, which is to use the MPT mpi library on Cheyenne instead of openmpi. This is the recommended MPI implementation on Cheyenne for high performance. Mark offered to help with this.

Chris H agreed that this is a good approach. And, he also cautioned against making detailed comparisons between Cheyenne and Discover because there are many differences in the hardware between these two machines.

Mark then reported some progress on getting JEDI to run on the Amazon cloud. He has set up an Amazon Machine Image (AMI) that includes both jedi-gnu and jedi-intel modules. These modules can be used to build and run jedi using either gnu or intel compilers respectively. All ufo-bundle tests pass with the jedi-gnu module but he is seeing a large number (33) of test failures with the jedi-intel module. All of them are l95 tests. He suspected that this might have to do with a warning he's seeing about the iomp5 library.

Rahul mentioned that this library is part of the intel openmp implementation and suggested that the C++ wrapper for mpi might not be set correctly. He mentioned a startup script (called something like compile_vars.sh) that needs to be run to set up the intel compiler/mpi environment. Yannick added that one difference with L95 relative to other applications is that it is entirely set up in C++, so the C++ mpi wrapper may indeed be a good thing to check.

Mark also announced that there will be a change in the way we implement access to JEDI repositories on GitHub. Currently, by default, all members of the JCSDA organization on GitHub have read access to all JCSDA repositories. However, we plan to disable this default access at the organization level. Instead, we will grant access to each repository individually by means of GitHub teams. So, for example, all members of the JEDI team will be given read access (at least) to core JEDI repos like oops, ufo, and ioda. But, some repos may be restricted. The reason for this change is to allow JCSDA to host proprietary repositories such as ROPP and RTTOV that require licenses for access.

This change should be transparent to the user - you should have the same access as you did before. But, if you notice any change of access to JEDI repos over the next few days (for example, not being able to push to a repo anymore) as this new policy is implemented, let Mark know.

Yannick then mentioned that, in addition to the mpi issues reported by JJ and Hailing, others have been reporting memory issues. Yannick asked if anyone wanted to say anything further about this but no further information was forthcoming.

JEDI users are warned to be on the lookout for possible memory leaks.

We need to gather more information so we can reproduce this and fix it.

Yannick then announced that there was not enough time left in the meeting to do a comprehensive round-table report from everyone. Instead, he asked people to volunteer any other progress or issues that they would like to bring up. Since there were no responses, the meeting adjourned.

Space shortcuts

Page tree