... | ... | @@ -92,3 +92,56 @@ Its calculation is preformed by the Nomad Lab base layer, possibly in multiple s |
|
|
This allows the parser to be as small as possible (keeping the fact that we want to parse "everything"), and still write code that works across parsers.
|
|
|
For example the extraction of average and statistics data from an MD, and possibly of a short trajectory sample (if that is something that will be stored in the repository) should not be done in each parser again, but just once on the normalized full data.
|
|
|
Changing what exactly is extracted can then be done in one place, and some things (like BSSE correction) can be done only after having parsed multiple calculations.
|
|
|
|
|
|
# Storage
|
|
|
|
|
|
What is on labenv-nomad (and labenv2-nomad):
|
|
|
|
|
|
## /raw_data -> /nomad/nomadlab/raw_data
|
|
|
|
|
|
shared file system, one should have at least 10TB for it for now (we are seeing excellent compression down to 20%).
|
|
|
|
|
|
### /raw_data/data
|
|
|
|
|
|
is populated with zip archives using the proposed bagit standard
|
|
|
|
|
|
https://tools.ietf.org/html/draft-kunze-bagit-13
|
|
|
|
|
|
The name of the archives stored is built with R+checksum of the files and their modification dates and recursively of all contained directories, so it uniquely represents the data in the bag.
|
|
|
The archives are in a directory named after the first 3 letters of the archive, to avoid having too many files in the same directory.
|
|
|
|
|
|
In /raw_data/data files are stable, and same name will always mean the same file.
|
|
|
Bagit files can e verified, so corruption can e detected.
|
|
|
|
|
|
For more info see the [raw data description](raw-data-description).
|
|
|
|
|
|
### /raw_data/metadata
|
|
|
|
|
|
will contain the information to link back the repository, citations,...
|
|
|
|
|
|
### Replication
|
|
|
|
|
|
To replicate the whole data only /raw_data/data and /raw_data/metadata need to be replicated.
|
|
|
In particular parsing needs only /raw_data/data, which can be replicated in any way rsync,...
|
|
|
All other data can be regenerated.
|
|
|
|
|
|
## /parsed -> /nomad/nomadlab/parsed
|
|
|
shared storage currently, might be generated on demand in the future, or archived (many small files)
|
|
|
contains normalized json files, one per calculation, organized by parser id.
|
|
|
The name of the normalized file is P(checksum of "nmd://archive/path/to/main/file")
|
|
|
|
|
|
## /normalized -> /nomad/nomadlab/normalized
|
|
|
|
|
|
shared storage, should be fast for enabling quick analysis
|
|
|
Will contain normalized data in HDF5 format, named as N<name of original raw data>, should be on fast shared.
|
|
|
A single files might contains many calculations, avoids having too many files.
|
|
|
In the file each calculation is identified with C(checksum of "nmd://archive/path/to/main/file")
|
|
|
|
|
|
# /scratch
|
|
|
|
|
|
Local storage (currently 2TB per VM)
|
|
|
|
|
|
# /scratch/work-local/<UUID>
|
|
|
|
|
|
local storage used by one of the single calculation workers, where files are decompressed
|
|
|
|