Agenda

This is our kickoff meeting for this group. We will discuss mostly goals, expectations norms moving forward and also NWSC3 commissioning being less than an year ahead we will touch base on NWSC3 preparedness as well. We encourage comments and collaborations. While modifying content is not forbidden but it will be easier to manage communication using comments and we, the support members modify the main content based on your comments. But if it is more convenient to edit the content directly please do so.

TimeTopicSpeaker
9:00-9:05AMWelcomeSidd
9:05-9:25AMNWSC3 updateIrfan
9:25-10:00AMAbout NHUG, goals, communications, current priorities etc.Sidd and all

Minutes

Sidd gave a short welcome message mostly with logistics and then handed over to HPCD director Irfan Elahi. Irfan gave a brief presentation on current NWSC3 status. The recordings are available below. In brief here is the questions and answer that we have captured:

Jim Edwards: Will GPU direct storage be available. Jim clarified in an email later that he was asking about the possibility of using GPU nodes for asynchronous processing (compression) and I/O off CESM tasks. The compute tasks will be running on CPUs only. Whether this mixed jobs running on CPUs and GPUs will be supported along with GPU direct storage to help speed CESM a bit. Our response was yes we are open to such innovative usage.

Ethan Gutman: How will cloud bursts interact with GLADE and other storage? Irfan: Yes you need the data to run in the cloud. There are few different options, you know one is that you basically have some kind of a data common data Center environment that's very readily accessible it's on the edge of cloud.  You know, and there are ways to do that, the other is that you basically get some persistent storage on the cloud where you can pre staged your data before you run it. Now that's really costly, because you have to continuously pay for that persistent cloud the other option that we can do is you basically upload your data.  After you have your instance ready run your job when you're done you download your data back, but then that has its own challenges, because you have to pay for moving data to and from the cloud.  You know so right now, all of these things are being looked at.  You know, over the next few months and couple of years, we need to find a happy medium that works for us and is cost effective.  But as I said, you know, there are costs involved in all of these my my apologies, I have to jump into another abr meeting I will be more than happy to come back at the next meeting and answer additional questions.

John Clyne: asked about the luster file system and if it's going to be mounted on casper. Our answer is yes, it will be.

Sidd presented a small introductory slide deck. The last slide was presented by Brian from Application team which he and Rory heads. Brian explained the goal of this team, like getting user environment, applications ready for the new systems and solicited feedback during this meeting or outside by direct mail (nhug@ucar.edu) or to him (vanderwb@ucar.edu) or Rory (rory@ucar.edu). We opened the floor for questions at this point.

Ethan Gutman: I've had issues getting code to compile on Cray system it's just because they wrapped different compilers and all sorts of weird ways. Can we get access to some kind of small test system with the cray programming environment built on it. Brian responded saying that we are exploring various options, .e.g. getting the test systems delivered ahead of time including getting remote access to HPE system to test and try our codes.

Kartik IyerWhat will be the process of selecting those codes for testing performance of new machine from your end ? Sidd explained that this is part of the selection or procurement process that we create a benchmark suite comprising major codes that has been running on our system or we expect to run. This benchmark suite is called NCAR Benchmark Suite available here. The performance numbers obtained from different parts of the suite along with its' associated waits determine the capacity or size of the machine compared to our current machine (Cheyenne). Each vendor contractually commit to deliver the projected performance equivalent based on this number.  A followup question is about early benchmarking of user code in order apply for allocation. Sidd clarified, for ASD (Advanced Scientific Discovery) projects you will have to project data based on your runs on existing system e.g. Cheyenne. For projects after ASD or regular CHAP, NSC, WNA or divisional projects you will have get opportunity to conduct benchmark in the system itself as the system will be available for production runs by then. A related question came about approximate timeline for ASD project and Sidd said that he thinks it will be around September this year.

Dave Gill: Asked about available prior training. We all assured that there will be prior training delivered by CSG and HPE / NVIDIA etc.

Jeremy Sauer: Asked about GPU charging, this is something we have been discussing and will have a mechanism shortly. But GPU resource will be a separate pool outside allocation pie for sometime, till we have enough GPU usage.

Supreeth Suresh: Can jobs share GPU and CPU nodes ? Yes, the system is setup for that and if there is demand we will ensure such usage.

Question came about glade and campaign storage availability. We are working to make both glade and campaign storage accessible from the new machine.

Jeremy Sauer: We will only have 320 or so GPUs, is there any plans to increase the number like double over time NWSC3 life time ? Our response, we are looking for increasing the pool but we need to see the usage pickup. At this point we do not know whether it will be doubled or not but certainly we plan for increase.

Dave Gill: How did we select this group ? Is university users represented enough ? Our response, based on NSC, CHAP, WNA PI with at least 5M core hours allocation over last 2 years we created a master list that involves both NCAR and University folks about the same proportion. We sent invitation to all of these people and whoever positively responded we have included. Currently there are slightly more number of NCAR users than University. (this wiki does have a members page which contains all the names and affiliation)

Recordings

  1. Video recording from Zoom
  2. Automated audio transcripts (not too accurate!)
  3. Audio only
  • No labels