Here are some notes to help seed our discussion for this week's meetings.
Here's how I see the data organization in IODA:
The Record is there primarily to help avoid MPI distribution breaking up groupings that need to stay intact. A particular sounding, or all channels of an instrument are candidates for a Record.
Some design goals for IODA that provide context for this discussion are:
As a first stab at the interface for IODA, I've organized it into three "views". One for each of the Record, Meta Data and Obs Data as shown above.
The Record interface appears as a list, that contains record id's such as Station ID or Flight ID.
Rec ID 1 |
Rec ID 2 |
Rec ID 3 |
... |
The Meta Data interface appears as a 2D table
Lat | Lon | Time | Level | ... | |
---|---|---|---|---|---|
Loc 1 | |||||
Loc 2 | |||||
Loc 3 | |||||
... |
The Obs Data interface appears as a 3D table
T | Q | U | V | ... | |
---|---|---|---|---|---|
ObsValue | |||||
ObsError | |||||
ObsQc | |||||
O minus A | |||||
O minus B | |||||
... |
(not showing depth dimension - goes into the page)
Create three structures corresponding to the three interface views (Record, Meta Data, Obs Data).
Associations
Bookkeeping quantities
Missing values
Last week we decided to take a look at Boost MultiIndex as a possible first stage implementation. This is a header-only extension that allows one to define an arbitrary C structure and then be able to attach indexing to it for storing multiple items and enabling fast access to those items.
Two possible alternatives to MultiIndex are Redis and SQLite. These are both in-memory database implementations that could bring us closer to the SQL-like access that we want, but they incur more overhead than MultiIndex.
Both Redis and SQLite have C++ interfaces, and they both work with their own file formats. We would have to translate netcdf and ODB2 files into Redis or SQLite files, but both packages provide the file I/O routines (to their file formats). In addition to a new file format, Redis also requires a server process to be running where the C++ interface is used by a client (IODA) to access the database. The Redis download from the Redis website is the sever code, and you need to go to a third party source for the client code. There is one in github called bredis which uses Boost ASIO (asynchronous I/O, which is header-only) to do the communication to the server process.
The in-memory aspect of both Redis and SQLite, I believe, is that the data is read into memory from the file and then all access is operated on the memory image which increases performance, but then you are limited to how big of memory image you can create during execution by the operating system.
In my opinion, Redis seems too messy. You have to compile code from two sources (server from Redis website, client from another source), and you have to have server processes running alongside the JEDI process.
SQLite seems to have promise for a longer term solution since it would give us the SQL command interface that we would like to have.
MultiIndex seems promising for an initial implementation. For code development, it is much simpler than Redis or SQLite and we should be able to have a solution in place much faster as a result.
The top shows how observation data are stored. Call this piece ObsData.
The bottom shows how meta data are stored. Call this piece MetaData.
The right side (blue) shows the interface to other pieces of JEDI
The pages in ObsData are similar to the page in MetaData. These have locations along their x-axes, and variables along their y-axes. The MetaData page doesn't fit into the ObsData scheme since it contains a different set of variables (y-axis).
The handling of missing values needs to be done carefully. We could use the QC page to enter a code that represents "missing", but then what if an actual QC code from an obs file collides with the "missing" code. Perhaps we should use a separate page for marking missing values. The obs vector operations will need access to the "missing" marks to do their methods correctly.
The basic building block would look like the following struct example. The storage is a collection of 1D arrays (vectors) that hold values for all locations for one variable.
struct Variable { // index keys std::string ObsGroup; // ObsValue, ObsErr, HofX, QC, etc. std::string VarName; // T, Q, U, V, etc. // data std::unique_ptr(T [])& VarData; // 1D array, nlocs long } |
The above example is for the ObsData section. It provides two keys so that the data can be accessed by observation group (ObsValue, ObsError, QC, HofX, etc.) or by variable name (T, Q, U, V, etc.) or by both. The MetaData section would only need the variable name key.
Perhaps we could use classes where the data members hold the keys and data vector. Then we could form the MetaData table from a base class that has the VarName key and VarData vector, and form the ObsData table from a derived class that adds on the ObsGroup key.
Schematically, the memory for ObsData would look like: