Mohamed, Fawzi Roberto (fawzi) · 34b81b0b
--- a/NoMaD-Base-Layer.md
+++ b/NoMaD-Base-Layer.md
+# Nomad Lab base layer #
+
+# Parsing Infrastructure
+
+An overview of the parsing infrastructure, this is what we want to have, not yet what we do have.
+
+Every element can be used semi independently, and they are connected through queues when working in a full scale deployment. The queues (kafka, flink) only exchange relatively little information and have been demonstrated to be able to handle very high throughput. Thus we can scale each one of these pieces independently (adding extra workers).
+We can also experiment with various pipelines
+
+## Raw Data Preparation
+
+* runs on separate machine, needs access to both nomad repository data and database, and to the laboratory raw data storage
+* goal: extract the public data of the repository (and only that one), plus some identifiers to link back to the repository, prepare archives so that they are not too large (manual file describing split of huge datasets like aflow), use unique identifier for each archive (built from 'R'+base64 encoding of the first 168 bits of the sha of uncompressed tar) and store them in nomad laboratory raw data
+* mapping db: enable quick mapping between internal id and external repository id
+* give also rest API to access single files
+
+## Tree Parser
+
+* takes a nomad uri referring to one archive or a sub directory o it
+* scans those files (reading only the first bytes)
+* identify mime type, and then parser for them, publish this to queue
+
+## Single Run Parser Manager
+
+* takes a path or and uri referring to a main file, and the parser needed for it
+* prepares files for the parser (if needed by uncompressing the files required to work storage)
+* starts the low level parser with the main file
+* define a calculation gid, store relationship with uri in relationship DB (through publishing to a queue)
+
+## Low Level Parser
+
+* takes a main file and emits a stream of events about the quantities it did parse
+* quantities are already described with meta info, but possibly still code specific
+
+## Normalizer
+
+* take the stream of parsing events, 
+* completes code specific calculations
+
+# Out writer
+
+* writes out normalized data either to json or netcdf (hdf5)
+
+# Pipeline Manager/Webserver
+
+* Sets up/ connects the queues needed to make the various pieces work together (roughly as in the (old) sketch below)
+* gives a rest interface to
+  * get single values
+  * get group of values (given metadata and calculation ids)
+  * metadata info
+  * queues: trigger reparses,...
+
+# Testers
+
+list of integration tests that should be written
+
+* test single low level parser
+  * speed, ratio extracted, various self consistency
+* test Single Run Parser Manager
+  * check if given uri, files needed are extracted
+* test Normalizer
+  * see if some normalized values in fixed examples really get added
+* test tree parser
+* test normalized files
+* test scaling
+
+* parser debugger infrastructure
+  * special output to debug parsers text -> extracted quantities side by side
+  * help to keep track of failed tests, rerun parser in failed or random sample
+  * re run on all
+
+# Older schema (still quite Ok)
+
+![Nomad Lab base layer schema](NomadLab-small.jpg)
+
+After a parallel scan (identifying the file types using the [tika library](https://tika.apache.org/)), those that are found to contain calculations are parsed producing standard hdf5 files.
+This first parsing tries to extract all the data, but performs little normalization (mainly units conversions).
+Then the trigger for new data starts further transformations that might add extra normalized data, or remove unneeded properties.
+
+That is the reason the system is a reactive system, and slowly converges toward a fully described calculation, with properties that might be calculated collectively.
+Calculation of some properties might automatically trigger an insertion in the database.
+
+flink is used to query the properties stored in the hdf5  and json files.
+
+# Sql database #
+
+An Sql database (postgres) is used to store some often used properties.
+
+Every value stored in the database has a nomad meta data corresponding to it.
+Its calculation is preformed by the Nomad Lab base layer, possibly in multiple steps, and then added to the database.
+
+This allows the parser to be as small as possible (keeping the fact that we want to parse "everything"), and still write code that works across parsers.
+For example the extraction of average and statistics data from an MD, and possibly of a short trajectory sample (if that is something that will be stored in the repository) should not be done in each parser again, but just once on the normalized full data.
+Changing what exactly is extracted can then be done in one place, and some things (like BSSE correction) can be done only after having parsed multiple calculations.