Benchmarks for the PIO-PNETCDF Restart code used by the CAM physics package and the HOMME dycore. These numbers should be similar to what PIO/PNETCDF can achieve for CAM history when that code is finished.
Methodology
Setup CAM/HOMME for aqua planet simulations ( see Running CAM-HOMME )
Set restart_option = 'end' in the drv_in namelist.
Code changes
Add instrumentation to PIO calls by adding -DTIMING to USER_CPPDEFS line in Makefile
Output
- ne120nv4 case (1/4 degree average grid spacing at the equator), 26 levels.
- restart file: 12,370 MB
- Runs on 128 processors (~1.5GB per processor)
- BG/P 512 nodes vn mode (512mb per core): wrote PIO restart files, but ran out of memory on surface restart files.
- BG/P 512 nodes smp mode (2gb per processor) ?
- BG/P 1024 nodes vn mode ?
Results
- homme_cam3_6_19 branch
- NCPUS: number of cores (MPI threads)
- io_cpus: PIO num_iotasks
- stripe: number of Lustre OST's the file is stripped across
- All times in seconds
- MB/s computed from pio_write_nf() time. (does not include re-arranger or other CAM and PIO overhead)
SNL Blackrose (intel/openmpi/infiniband linux cluster, Lustre filesystem)
| NETCDF | |||
|---|---|---|---|
| NCPUS/io_cpus/stripe | cam_write_restart | pio_write_nf | MB/s | 
| 128/128/64 | 170.4 | 151.5 | 82 | 
| 128/128/16 | 149.5 | 128.2 | 96 | 
| 128/128/4 | 183.9 | 168.5 | 75 | 
| 128/128/1 | 333.5 | 317.7 | 40 | 
| 128/32/32 | 149.8 | 143.9 | 
 | 
| 128/32/8 | 144.2 | 138.5 | 
 | 
In the NETCDF case, the difference between cam_write_restart and pio_write_nf is mostly due to the data re-arranger.
| Parallel NETCDF | |||
|---|---|---|---|
| NCPUS/io_cpus/stripe | cam_write_restart | pio_write_nf | MB/s | 
| 128/128/128 | 663.4 | 385.4 | 32 | 
| 128/128/64 | 485.8 | 121.0 | 102 | 
| 128/128/32 | 844.5 | 601.8 | 21 | 
| 128/32/32 |  146.7  |  98.5  | 126 | 
| 128/8/16 |  222.5  |  174.0  | 71 | 
| 128/8/8 | 156.3 | 99.9 | 124 | 
In the PNETCDF case, sometimes the calls to "pio_put_var_0d_int" take a significant amount of time, ~200s.
ORNL Jaguar Cray XT4 with Lustre
ANL BG/P with GPFS
Note: for comparison, the standalone HOMME dycore on BG/P can write restart files using MPI-I/O directly. On 8192 cores, writing a 22.8GB restart file:
- MPI collective with a derived type: 7.2s (3.2 GB/s)
- Asynchronous, non-overlapping MPI_File_write_at(): 8.5 MB/s (ouch!).
| NETCDF | |||
|---|---|---|---|
| NCPUS/io_cpus | cam_write_restart | pio_write_nf | MB/s | 
| 2048/2048 | 207.8 | 201.6 | 61 | 
| 2048/128 | 137.7 | 136.1 | 91 | 
| Parallel NETCDF | |||
|---|---|---|---|
| NCPUS/io_cpus | cam_write_restart | pio_write_nf | MB/s | 
| 8192/2048 | 86.0 | 16.9 | 732 | 
| 2048/2048 | 71.1 | 19.4 | 638 | 
| 2048/512 | 37.8 | 20.8 | 595 | 
| 2048/128 |  41.9  | 32.7 | 
 |