Taking Advantage of the Simultaneous Multi-Theading (SMT) on Bluevista and Blueice

SCD Documentation: The gory details

Background:

With the current crunch on computing resources at NCAR, it is important for users to maximize the efficiency of both the release and development versions of CCSM on bluevista and blueice. At this time, not all CCSM users may be taking advantage of the SMT capabilities of the IBM platforms, which can offer a roughly 20%-30% efficiency increase with minimal changes to the CCSM scripts and no changes
to the code. Note that with SMT, the model will run with double the number of MPI tasks per node, and this can be utilized to either increase model throughput or decrease model cost
significantly in most production runs.

Definitions

  • SMT threads/node = ptiles x OMP_THREADS
  • # nodes = MPI_task/ptiles
  • 4 threads/task
  • 4 tasks x 4 threads = 1 node
  • ptile (ntasks on node) = 4

    SMT Optimal Configurations

    machine

    OMP_THREADS

    ptiles

    SMT threads/node

    bluevista (See below)

    4

    4

    16

    blueice

    4

    8

    32

    CAM-MPI only mode

    1

    32

    32

  • Note:
    We have introduced new machine support for "bluevista16", beginning with the CCSM tag, ccsm_3_1_beta39. The user should simply use the machine name "bluevista16" instead of "bluevista" and the
    generated scripts will automatically take advantage of SMT to maximizing throughput. If users are utilizing pre-ccsm3_1_beta39 tags, then they should follow the directions below for the release-based
    modifications.
  • Option 1) pre ccsm3_1_beta39 tags.Increasing throughput:
    Double the number of MPI-tasks and use the same resources as before. You can expect a 30-40% increase in performace for the same cost.
    OR
  • Option 2) leave the MPI-tasks the same. You can expect a 20-30% decrease in throughput with a corresponding 50% decrease in cost as well.

    Example 1. Stand Alone Cam

  • Resolution: T85
  • nlat = 128 (Note that 128 threads is optimal)
  • nlon = 256
  • 4 x 4 = 16 (tasks) x 8 (nodes) = 128 threads
  • 32 total tasks (#bsub -n 32)
  • 64 PEs (processor Equivalents)with smt = 128 threads
    Configure in CAM run script:
  • #bsub -n 32 # number of MPI tasks
  • #bsub-R "span[ptile=4]" # max tasks per node
  • setenv OMP_NUM_THREADS 4
  • No labels