CAM/CLM (GPTL-based) Timers

The CAM timing library interface is simple and flexible, but there are limitations (some related to the design; some simply a function of the implementation).

Basic Interface

[0] To use the library interface routines in a routine or a module, add

 use perf_mod

[1] To enable the instrumentation,

 call t_initf('NlFilename', LogPrint=.true., mpicom=mpicom, MasterTask=.true.)

where

  • 'NlFilename' is the file containing the namelist prof_inparm
  • LogPrint is a (optional) logical indicating whether the timing library parameters should be output to standard out
  • mpicom is the MPI communicator for a subset of processes reading prof_inparm from 'NlFilename'.
  • MasterTask is a logical indicating whether process 0 (for this communicator) will read and broadcast the namelist or whether all processes read the namelist individually.
    If 'NlFilename' does not exist, or if prof_inparm is not
    found in 'NlFilename', then the default values will be used.
    (The job will not abort.) MasterTask and mpicom are optional
    parameters. If they both exist, then process 0 (for this
    communicator) will read the prof_inparm namelist and broadcast
    these values to the rest of the processes associated with this
    communicator. Otherwise, each process reads the namelist directly.
    (Note that the value of MasterTask is ignored.)
    [2] To define an 'eventX', surround the relevant code with calls of the form:
     call t_startf('eventX')
     ...
     call t_stopf('eventX')
    [3] To print out the performance data, call
     call t_prf('PrFilename', mpicom)
    Depending on the namelist parameters, this either puts all of the timer data into one file (named 'PrFilename'), or generates a separate file for each MPI process (named 'PrFilename.pid'). The information is the same - there is no reduction or sampling currently. Note that subsets of processes (defined by the mpi_commid) can call t_prf separately. In this case, each subset should use a unique filename.
    Performance data are recorded per event for each process, and for each thread within a process. Event data include number of occurrences, total time (inclusive), maximum time, minimum time, and total performance data collection overhead (estimated). Events are output in order of their occurrence, with indentation indicating the nesting of events.
    In comparison with the POP timers, the events do not need to be defined during the initialization, nor do they have to be the same within all processes or threads (an advantage). Event statistics are also not generated across processes (a disadvantage that this is not an option. Note that writing out data for each process is not an option for the POP timers, though the data is collected internally.)

    Other Commands

    [a] clean-up and shutdown the instrumentation:
      call t_finalizef()
    [b] add an instrumented barrier (that is enabled only when specified in the namelist):
      call t_barrierf(event='sync_eventX', mpicom=mpicom)
    Both arguments are optional. If the string is omitted, then an event is not generated for the barrier. If mpicom is omitted, then MPI_COMM_WORLD is used. If executed within a threaded region, the command is ignored.
    [c] defined the level of detailed represented by subsequent events:
      call t_adj_detail(detail_adjustment)
    If executed within a threaded region, the command is ignored.
    [d] disable event profiling for a section of code, then re-enable it:
      call t_disablef()
      ...
      call t_enablef()
    If executed within a threaded region, these commands are ignored.
    [e] query wallclock, user, and system times:
      call t_stampf(wall, usr, sys)
    This command does nothing on the Cray XT system (because of the Catamount OS).

    Namelist Arguments
     profile_disable

  • logical indicating whether perf_mod routines should be disabled for the duration of the run. The default is .false. .
     profile_barrier
  • logical indicating whether calls to t_barrierf are enabled. The default is .false. .
     profile_single_file
  • logical indicating whether the performance timer output should be written to a single file (per component communicator) or to a separate file for each process. The default is .true. .
     profile_depth_limit
  • integer indicating maximum number of levels of timer nesting. When the nesting exceeds this maximum, further event profiling is disabled. This controls the detail and size of the profile output. It also (usually) controls the overhead of the profiling in that the higher frequency short events typically occur deeper in the event nesting. The default is 99999 .
     profile_detail_limit
  • integer indicating maximum detail level to profile. The command t_adj_detail allows the user to define the level of "detail" at a given point in the source code. The namelist parameter then specifies what levels of detail will be profiled, similar to the control on the nesting depth. In CAM, this is used to disable profiling during the intialization routines, and within the loops of chunks (which occur at a higher frequency when chunk sizes are small). The default is 0 .
     profile_timer
  • integer indicating which timer to use (as defined in gptl.inc). This does nothing yet, but will provide runtime control of the timer used for profiling if/when we move to Jim Rosinksi's latest version of the GPTL timing library. The default is
     #ifdef UNICOSMP
        integer, parameter :: def_perf_timer = GPTLrtc       ! default
     #else
        integer, parameter :: def_perf_timer = GPTLmpiwtime  ! default
     #endif
  • Note 1: these entry points are available only from fortran. C routines could call the GPTL library directly, but this would not be identical to calling the perf_mod routines. Providing equivalent C entry points might be one generalization that we would want to consider.)
  • Note 2: The nesting indentation is approximate. In the original Rosinski timing library, events that did not occur the first time through a timestep were relegated to the end of the list. Jim Edwards modified this to put the events in their correct location. However, if an event occurs multiple places (within multiple other events) it is listed as occurring in only one location.

    Future extensions

  • events defined by character string and by nesting within other events.
  • statistical summary over all processes (might be too difficult, especially is add "call-site" profiling)
  • No labels