Mohamed, Fawzi Roberto (fawzi) · b307aca8
--- a/Parsing.md
+++ b/Parsing.md
+# Parsing #
+
+Parsing goal is to extract the interesting pieces of information from an output and prepare them for further processing.
+
+It is necessarily dependent on the source output (i.e. on the code used for the calculation).
+Writing a parser is a bit tedious and needs specialized knowledge of the code (depending on how well it is documented), its units, and the ability to map its output to the common model we use.
+
+Being so code dependent it means that any change to the parser infrastructure is multiplied by 50 (or however many we have), and has to be done keeping in account the unique way n that code to get the different quantities.
+
+Finally parsing performance is very important, as it has to go through basically all the data we have.
+
+Thus it should be clear that
+
+1) we do not want to do changes that touch all the parsers at once, we might want to update a single parser from time to time, but we definitely want to avoid the factor 50 or more when doing changes
+2) parsers should be fast
+3) when we need a parser we should try to be able to use our parsers (again to avoid the factor 50, or having access to just a small subset)
+
+# Standardized data
+
+A common parser usage looks for few quantities and then does some computation with them.
+The idea conversion layer is to prepare basically all the possible quantities performing all conversions to give them in a good standardized form to be used for the further processing.
+
+Here we have already an important issue: how you get the data out of this standardized form?
+In a way this will probably entail some form of parsing, but is should be much faster than the original parsing and conversion for it to make sense.
+
+Here we choose hdf5 as a good compromise: it can store large amounts of data, and has indexes, so that skipping to a given piece of data is quick.
+It has two main drawbacks:
+
+1) it is relatively slow to create/write a single arrays, so storing many small arrays (maybe with only a single value) is a bit expensive
+2) it needs the hdf/netcdf library to be read (which can be used basically from any language, on any machine)
+
+So we have json to complement it, as it is human readable, it can be parsed even by hand, and plays well with the web browser.
+
+# Direct parsing
+
+If the data was already parsed to this standarized form, then further analysis is faster, but if you start from the raw data it adds an extra step.
+It is too early to see all applications but direct parsing (for example if one wants to some data *now*) is so commonly used that will probably still be relevant some time.
+
+Ideally we want to be able to use our parser also for this kind of applications.
+The major hurdle in this is the time lost in parsing and writing out data that is not relevant for the current application (think for example parsing the wavefunctions).
+If one parses only the quantities needed there is still a bit of overhead, but it is small as the parsing of the output format is optimized to be fast, and well offset by the flexibility that one gains (in some cases one can remove also this overhead).
+
+What we do is having a way for the parser to tell us what he can extract, and one to tell the parser to prepare a specialized parser just for the pieces we are interested it.
+
+# Metadata
+
+To define what is extracted, and model it from the conceptual point of view we use metadata.
+Our metadata model is pretty flexible, can accommodate extensions amd is described in [NomadMetaInfo]().
+
+This is a central thing to make the whole system interoperable.
+
+Large changes of the metadata model will require changes to the parsers, but description changes, and additions will not need any real changes.
+In any case all the parser always specify the metadata they use, so changes can be detected and either handled automatically or added as warning.
+
+# Reducing the factor 50
+
+To reduce the work that needs to be repeated a classical approach is to write a library and reusable components.
+
+Unfortunately between a parser for Xml and one for free text there is little in common.
+Also some want to write in python (as it is a nice language, but can easily become ~10 times slower than C/java/scala), others (like octopus) have already a C/yacc based parser for their input that they want to reuse.
+
+Still some things can be abstracted (for example metadata handling, and writing to the common format), and for a class of parsers one can have a common infrastructure.
+
+The ideal approach is to have declarative parsers, that specify the structure and position of the things to extract and their mapping. Then the actual parser is "built" reading this specification.
+This allows lot of flexibility on the optimization of the parsers and to evolve them (one can improve the parser generation independently from the actual description of the things to parse.
+
+Such an approach is what I did try to use for fhi-aims (it is not perfectly declarative, but close), and it will hope allow its evolution and improvement.
+The code written should also be reusable to parse other outputs that have a "free form" output like fhi-aims.
+
+Xml parsing can also make the declarative approach easy, by having a mapping of the entities to our abstract model.
+
+Indeed one can then even write a parser generator that calls functions with the parsed values without ever writing them out, removing the overhead of passing through a standarized format.
+
+# Conversion layer
+
+The parsers are an important component of the conversion layer, but are not the conversion layer.
+Indeed (again to avoid the factor 50) the parsers should do the minimal conversion possible to bring the data to a model that is well defined, further conversions should be done by a common code, so they need to implemented only once.
+
+For the conversion layer this driver is written in scala and takes care of identifying the files up to the point in which it knows which parser to use, then take the parsing output and normalize it.
+
+# Practical parsing #
+
+### Process Api ###
+Different codes store the data in different ways, some like exciting use an xml format that is easy to parse, but most codes take a more free-form format.
+In general a different parser is used for each code.
+
+All parsers should conform to the following interface:
+They should accept either an input from stdin in json format as:
+
+    {
+        "version": "nomadparsein.json 1.0",
+        "tmpDir": "/tmp/"
+        "metaintoToKeep": []
+        "metaintoToSkip": []
+        "mainFile": "path/to/main/file"
+        "mainFileOffset": null
+        "mainFileLenght": null
+    }
+
+or command line arguments defining
+ - tmpDir: a path to a directory that can be used to create temporary files,
+ - metainfoToSkip: regular expression matching the metainfo to skip,
+ - metainfoToKeep: regular expressions matching metainfo to keep (has precendence on metainfoToSkip,
+ - mainFile: the main file to parse.
+
+metainfoToSkip and metainfoToKeep allow (but do not require) the parser to optimize the parsing so that only the interesting properties are parsed.
+
+The parser then should reply with a json list containing dictionaries that give
+ - nomaddata or nomadinfo either inline or giving a path to a file with it (the file should be ine the given tmpDir to avoid the risk of leaving it around) of all the data extracted for a calculation run.
+ - files: a list of the files connected with the calculation run described
+ - mainFileSha: the sha (checksum) of the part of the main file representing the calculation
+ - collectiveSha: sha of all the files connected to the calculation
+
+The parser can be implemented with any language as long it conforms with this interface.
+All the parsers currently used are implemented in python and there is already an infrastructure to simplify the implementation of such parsers using python as detailed in the [NomadParsing wiki](http://nomad-dev.rz-berlin.mpg.de/wiki/NomadParsing}).
+
+output:
+
+    [
+        "nomadparseout.json 1.0",
+        { "defaultMetaInfo": "path/to/file or metainfo list" },
+        {
+            "mainFileSha": ""
+            "collectiveSha": ""
+            "nomaddata": pathToFileOr [...],
+            "nomadinfo": pathToFileOr [...],
+            "files": [...],
+            "warnings": [...],
+        },
+    ]
+
+
+### Python Api ###
+
+### Free form parser ###
+
+It is possible to define nested sections and using regular expressions.
+This gives a relatively declarative definition of the parser can be done.
+This allows one to compile different parsers and reaching good performance.
+
+fhi-aims parser written in python takes this approach, but might be 
+
+### Xml Parsing ###
+
+If an xml schema is available it can be used to associate properties to simple transformations of some entries, but probably 
+
+# Current Status #
+
+FHI parser was written with this approach in mind, but has to be fixed for the changes that were  done when implementing in practice.
\ No newline at end of file