... | ... | @@ -80,55 +80,43 @@ For the conversion layer this driver is written in scala and takes care of ident |
|
|
|
|
|
# Practical parsing #
|
|
|
|
|
|
### Process Api ###
|
|
|
Different codes store the data in different ways, some like exciting use an xml format that is easy to parse, but most codes take a more free-form format.
|
|
|
In general a different parser is used for each code.
|
|
|
|
|
|
All parsers should conform to the following interface:
|
|
|
They should accept either an input from stdin in json format as:
|
|
|
|
|
|
{
|
|
|
"version": "nomadparsein.json 1.0",
|
|
|
"tmpDir": "/tmp/"
|
|
|
"metaintoToKeep": []
|
|
|
"metaintoToSkip": []
|
|
|
"mainFile": "path/to/main/file"
|
|
|
"mainFileOffset": null
|
|
|
"mainFileLenght": null
|
|
|
}
|
|
|
|
|
|
or command line arguments defining
|
|
|
- tmpDir: a path to a directory that can be used to create temporary files,
|
|
|
- metainfoToSkip: regular expression matching the metainfo to skip,
|
|
|
- metainfoToKeep: regular expressions matching metainfo to keep (has precendence on metainfoToSkip,
|
|
|
- mainFile: the main file to parse.
|
|
|
|
|
|
metainfoToSkip and metainfoToKeep allow (but do not require) the parser to optimize the parsing so that only the interesting properties are parsed.
|
|
|
|
|
|
The parser then should reply with a json list containing dictionaries that give
|
|
|
- nomaddata or nomadinfo either inline or giving a path to a file with it (the file should be ine the given tmpDir to avoid the risk of leaving it around) of all the data extracted for a calculation run.
|
|
|
- files: a list of the files connected with the calculation run described
|
|
|
- mainFileSha: the sha (checksum) of the part of the main file representing the calculation
|
|
|
- collectiveSha: sha of all the files connected to the calculation
|
|
|
|
|
|
The parser can be implemented with any language as long it conforms with this interface.
|
|
|
All the parsers currently used are implemented in python and there is already an infrastructure to simplify the implementation of such parsers using python as detailed in the [NomadParsing wiki](http://nomad-dev.rz-berlin.mpg.de/wiki/NomadParsing).
|
|
|
|
|
|
output:
|
|
|
|
|
|
[
|
|
|
"nomadparseout.json 1.0",
|
|
|
{ "defaultMetaInfo": "path/to/file or metainfo list" },
|
|
|
{
|
|
|
"mainFileSha": ""
|
|
|
"collectiveSha": ""
|
|
|
"nomaddata": pathToFileOr [...],
|
|
|
"nomadinfo": pathToFileOr [...],
|
|
|
"files": [...],
|
|
|
"warnings": [...],
|
|
|
},
|
|
|
]
|
|
|
### Event based parser ###
|
|
|
|
|
|
A parser is expected to be at lowest level a push parser, that calls a series of callbacks (or if you prefer that emits a series of parsing events).
|
|
|
|
|
|
The callbacks available to the parser are defined in eu.nomad_lab.parsing.PaserBackendExternal, and the corresponding possible parsing events are those defined in eu.nomad_lab.parsing.ParseEvents.
|
|
|
|
|
|
In python this is in python-common parser_backend.py.
|
|
|
|
|
|
The idea is that you connect the backend in such a way that it writes to stdout (and take care of writing eventual error messages to stderr).
|
|
|
|
|
|
backend = JsonParseEventsWriterBackend(metaInfoEnv, sys.stdout)
|
|
|
|
|
|
then you start with
|
|
|
|
|
|
backend.startParsingSession(fileUri, filePath, { "name":"<parser_name>", "version":"<parser_version>"})
|
|
|
|
|
|
and probably you open some sections
|
|
|
|
|
|
runId = backend.openSection("section_run") #place for a whole run
|
|
|
methodId = backend.openSection("section_method") # description of the method used
|
|
|
sysDescId = backend.openSection("section_system_description") # description of the system
|
|
|
spId = backend.openSection("section_single_point_evaluation")
|
|
|
|
|
|
then you add values
|
|
|
backend.addValue("program_name", "bbb")
|
|
|
backend.addValue(" program_version", "xx")
|
|
|
they can be in any order of any open section (run, single point, ....)
|
|
|
|
|
|
you might open further sections
|
|
|
|
|
|
scfId = backend.openSection("section_scf_iteration")
|
|
|
|
|
|
and later close it with
|
|
|
|
|
|
backend.closeSection("section_scf_iteration", scfId)
|
|
|
|
|
|
currently you have to pass the Id, but we might decide to make it superfluous if there is only one section of that type open (most cases).
|
|
|
|
|
|
### Python Api ###
|
|
|
|
... | ... | @@ -138,7 +126,21 @@ It is possible to define nested sections and using regular expressions. |
|
|
This gives a relatively declarative definition of the parser can be done.
|
|
|
This allows one to compile different parsers and reaching good performance.
|
|
|
|
|
|
fhi-aims parser written in python takes this approach, but might be
|
|
|
simple_parser.py takes this approach.
|
|
|
|
|
|
The idea for it is to have *matchers* that actually consist of a regular expression, possibly some sub *matchers* that match subsequent lines and finally an optional closing regular expression.
|
|
|
|
|
|
These give a way to organize a group of regular expressions, specifying the sequence they are expected to be matched.
|
|
|
|
|
|
The actual parsing by default tries to match forward, or fall back at the to an upper level, this copes reasonably well with different verbosity and optional values, or extra debugging.
|
|
|
|
|
|
There are flags to allow for repetitions or matching lines in any order
|
|
|
|
|
|
Once a match is found the name of the group is used to get the corresponding metadata and the value in the group is assumed to correspond to it.
|
|
|
|
|
|
Limitations: algorithmic parsers (if you read x parse y otherwise,...) have to be handled adHoc.
|
|
|
|
|
|
"Compilation" of finite state machine to match it can be optimized, or generate debugging parsers
|
|
|
|
|
|
### Xml Parsing ###
|
|
|
|
... | ... | |