HDF5 support
-
theory parsers use HDF for large data (e.g. MD)
- as a HDF "sibling"-file to the archive .msg, but placed in the raw files
- written via the ServerContext
- if numpy quantity has HDF5Reference type, the values are written to HDF5
- if numpy quantity has annotation, the values are written to HDF5
- for both annotations and HDF5Reference the archive browser opens H5Web
-
problems
- we create more raw files even though it should be part of the archive
- if published h5groove cannot work with the HDF5 raw files anymore
- entries with these HDF5 files cannot be reprocessed, because the raw files are immutable
- transparently resolving HDF5 references in the archive api
-
solution
- move archive HDF5 actually into the archive folder
- publish the archive in a way that the HDF5 stays usable (h5groove)
- publish the raw-files in a way the the HDF5 stays usable (h5groove)
- bigger idea: replace the archive with hdf5 ..., even bigger idea: replacing files with s3, object storage via HSDS
- fully replace
- hybrid
- whole entry, whole upload
- always have both + transparent NOMAD API
unpublished-upload: raw/** (/<entry_id>.hdf5) archive/<entry_id>.msg
published-upload: raw-...-.zip raw/**/*.hdf archive-...-.msg
- archive/<entry_id>.hdf5
- archive-...-.hdf5
- / ...
architure: h5web | nomad GUI | h5grove | HSDS | archive API | hd5 | msg | parques s3, fs | fs | fs
tasks:
- suite of benchmark
- generic benchmark
- impl for .msg/archive
- impl for .hdf5
- impl for .parquet
- investigate if h5groove could read HDF from within a .zip file