PIOVDC is an extension to NCAR's Parallel IO software used by NCAR and various other organizations for easily writing massive data sets in an optimized, parallel manner. The extensions involve incorporating both extra logic and a subset of the VDF library from the VAPOR visualization package in order to allow for data written through PIO to be compressed on the fly into a Vapor Data Collection (VDC). This allows for scientists writing massive data sets to have the option of either writing less data at a lower than full resolution or write it all into files with progressive compression access that can be visualized with the VAPOR package. Compared to the offline tools provided by the VAPOR package, using PIOVDC allows data to go from a user program's memory straight into the vdf format, with no manual conversion post-process needed.
PIOVDC is currently comprised of two different projects that must be fetched, compiled, and installed separately.
The prerequisite software necessary to build/install PIOVDC are:
The first project is the PIO library. To retrieve PIO, you must use svn to download the PIO software.
svn co https://parallelio.googlecode.com/svn/trunk_tags/pio1_5_7/ |
After retrieving the PIO source, the software must be configured using your installed PNetCDF location. PIO uses configure options to enable/disable the optional VDC components. Execute the commands:
cd pio1_5_7/pio ./configure --enable-pnetcdf=yes PNETCDF_PATH=/path/to/installed/pnetcdf --enable-netcdf=no --enable-compression=yes |
After configuration the software is ready to be built by running GNU make
WARNING: if the configuration completes and MPICC has not been detected (if you see the string "{MPICC=}" in the list of Output Variables generated by configure), you should rerun configure manually setting the environment variable to a known MPI C compiler: ./configure MPICC=mpicc --enable-pnetcdf=yes PNETCDF_PATH=/path/to/installed/pnetcdf --enable-netcdf=no --enable-compression=yes Without the MPICC compiler there will be files that cannot be built using the make files. |
Running GNU make will generate the necessary PIO files, libpio.a, and pio.mod:
make |
For now you are finished with PIO
Next the source for building the VDF library will be needed:
The VDF library subset can be downloaded from the PIO repository.
svn co https://parallelio.googlecode.com/svn/libpiovdc/trunk/ . |
This will download a folder containing all the necessary parts to build the VDF libraries. To start, run autoreconf to generate the platform specific files, then run the configure script:
./configure --with-pnetcdf=/path/to/pnetcdf WARNING: If the expat header installed into the system and not detected then it is in a non-standard path. Use the --with-expat option in order to tell the configure script where expat is installed. |
After configuration is complete you may build by running GNU make. This will generate two static libraries, libpiovdc.a, and libpiocommon.a in the vdf and common directories respectfully, both of which are necessary for running PIOVDC.
Once all of your source code has been compiled into the three static libs and the pio module file: libpio.a, libpiovdc.a, and libpiocommon.a, you can link any test code to the libs. The PIO library is Fortran, while the VDF libraries are C++ which makes linking a delicate operation. The process for doing it changes depending on what compiler suite you are using, and possibly what version of compiler. For example, to use the intel compiler suite to link to a test_lib.F90 user program:
Build your user program into an object file using ifort through the mpi wrapper script
mpif90 -c test_lib.F90 |
Link your user program to the libraries (static libs and pio.mod are located in current dir):
mpif90 test_lib.o -o TestLib -cxxlib -L. -L/path/to/pnetcdf/lib -lpio -lpiovdc -lpiocommon -lpnetcdf -lexpat |
Build your user program into an object file
mpif90 -c test_lib.F90 -ffree-line-length-none |
Link your user program to the librares, using either gfortran or g++ through the mpi wrappers:
GFORTRAN
mpif90 test_lib.o -o TestLib -L. -L/path/to/pnetcdf/lib -lpio -lpiovdc -lpiocommon -lpnetcdf -lexpat -lstdc++ |
G++
mpiCC test_lib.o -o TestLib -L. -L/path/to/pnetcdf/lib -lpio -lpiovdc -lpiocommon -lpnetcdf -lexpat -lgfortran |
WARNING: Depending on your installation the default mpi library might not contain the appropriate methods. If you get warnings about undefined mpi symbols in either the Fortran libpio or the C++ libpiovdc, then the mpi compiler script is not providing the symbols for said language and those MPI functions must be imported separately. For the OpenMPI implementation, the specific libraries needed are libmpi_cxx and libmpi_f77 when linking with gfortran and g++, respectively.
PIOVDC functions as a well integrated extension to PIO, all that is required to use PIOVDC is overloading a few normal PIO api calls and omitting a few unnecessary PIO steps if you do not plan on using non compressed data. If the user is familiar with PIO, then using PIOVDC will take almost no additional effort. For those unfamiliar with PIO, I will explain the basic workflow. Complete PIO documentation is available here.
This workflow assumes that your program is running in an MPI environment, with multiple MPI tasks.
prepared user data - as PIOVDC only works with in-memory data, the program must already have loaded the data in memory. PIO provides facilities for reading data into memory, but the file formats are not guaranteed to support the user program data (PNetCDF, NetCDF +HDF5, MPI-IO, and raw binary are supported by PIO) and PIOVDC is not coded to use these facilities to get the data into memory as part of the regular operation.
available space - PIO uses special data rearrangement in order to ensure that IO gets good performance. As a result, the memory requirements for a PIOVDC using program can be 2-3X the size of the data set you intend to write. Please ensure that all of the MPI tasks together have enough memory to run your data set beforehand, as the performance can be slow and unreliable when not enough memory is supplied.
WARNING: PIO is highly optimized IO software, and as a result is very dependent on the underlying performance of the machine it is running on. Depending on the architecture and the way that the machine is setup it is also possible that you may see non-linear scaling. For example, on the Janus super computing cluster a 1024^3 sample data set can be written using 64 computational and 64 IO tasks, but it takes 4x the amount of tasks (256 comp, 256 IO) to safely write a 2048^3 data set.
An example program can be found here. This is the program used to test the PIOVDC compression with in data held in memory. A quick summary of the general usage of PIOVDC and PIO follows: