How to write a parser
Try to use the python-common simple_parser.py infrastructure. With it you just need to copy what is in the sample-parser repository and adapt it for your code. Simple parser gives an infrastructure to parse free format outputs.
SimpleMatcher
The central thing in a parser using this infrastructure is the SimpleMatcher object. This defines an object that matches some lines of a file and extracts some values out of them.
The simplest kind of matcher looks like this
SimpleMatcher(r"\s*\|\s*Kinetic energy\s*:\s*(?P<electronic_kinetic_energy_scf__hartree>[-+0-9.eEdD]+) *Ha\s*(?P<fhi_aims_electronic_kinetic_energy_scf_eV>[-+0-9.eEdD]+) *eV")
This matcher uses a single regular expression (regular expressions documentation) to match a line. An online tool to quickly verify regular expressions and to see what they match can be found here.
Note the following things:
- we use a raw string (starting with r"), in it \ does not need to be escaped, otherwise we would need to double every backslash
- we use \s as general space (it matches tab, space,...)
- we extract two expressions using named groups (?PexpressionToExtract)
- group names can have the units of the value they match given after two underscores (__hartree means that the values is in hartree).
- it is worthwhile to import SimpleMatcher as SM, to have a more concise code, from now on this is assumed to be the case
The two expressions extracted are automatically assigned to the corresponding meta infos, namely electronic_kinetic_energy_scf and fhi_aims_electronic_kinetic_energy_scf_eV. If the value is a scalar then looking at the definition of the meta information the correct type is extracted from the string and the value passed on to the backend. electronic_kinetic_energy_scf like all code independent values uses SI units, as the matched value is declared to be in hartree (with __hartree) it is automatically converted before storing it.
A matcher can also begin a group of related expressions defined again through other matchers, for example in
SM(name = 'ProgramHeader',
startReStr = r"\s*Invoking FHI-aims \.\.\.\s*",
subMatchers = [
SM(r" *Version *(?P<program_version>[0-9a-zA-Z_.]*)"),
SM(r" *Compiled on *(?P<fhi_aims_program_compilation_date>[0-9/]+) at (?P<fhi_aims_program_compilation_time>[0-9:]+) *on host *(?P<program_compilation_host>[-a-zA-Z0-9._]+)"),
SM(name = "nParallelTasks",
startReStr = r"\s*Using\s*(?P<fhi_aims_number_of_tasks>[0-9]+)\s*parallel tasks\.",
sections = ["fhi_aims_section_parallel_tasks"],
subMatchers = [
SM(name = 'parallelTasksAssignement',
startReStr = r"\s*Task\s*(?P<fhi_aims_parallel_task_nr>[0-9]+)\s*on host\s*(?P<fhi_aims_parallel_task_host>[-a-zA-Z0-9._]+)\s*reporting\.",
sections = ["fhi_aims_section_parallel_task_assignement"])
]),
SM(name = 'controlInParser',
startReStr = r"\s*Parsing control\.in *(?:\.\.\.|\(first pass over file, find array dimensions only\)\.)")
])
you can see several things:
- you can give a name to the matcher for debugging purposes
- you can nest SimpleMatchers by passing a list of them in the subMatchers argument. If the startReStr regular expression matches then parsing continues in the subMatchers.
- subMatchers by default are executed sequentially, and are optional, so after "Invoking FHI-aims ..." the parser might skip ahead directly to "Compiled on ...". From it the parser cannot "go back" and match "Version ..." but it can match nParallelTasks as the subMatchers are sequential.
- After having matched the "Compiled on ..." part the parser might skip ahead to the nParallelTask matcher or controlInParser, but not to parallelTasksAssignement: the startReStr of a matcher has to match before its subMatchers are considered.
- the nParallelTasks matcher opens the section fhi_aims_section_parallel_tasks. A section can be seen like a context or a dictionary that starts before the match and is closed when the matcher and all its subMatchers are exited. Thus information extracted by nParallelTasks (fhi_aims_number_of_tasks) or its subMatchers (for example fhi_aims_parallel_task_nr) whose meta information says that the are in the fhi_aims_section_parallel_tasks (both in this case) get added to this newly created section (again think dictionary).
To recapitulate a parser knows its current position, i.e. which (sub) matcher just matched, and from it knows which matchers can come next. By default this means going inside the current matcher to one of its direct subMatchers, then going forward at the same level and finally going forward in the superMatchers until the root one is reached.
This matching strategy copes well with print level dependent output (as matchers are optional if absent the parser automatically skips ahead), and group of lines that are not explicitly closed (matching something after the group implicitly closes it) but does not cover all use cases. For this reason a SimpleMatcher has various arguments to tweak the matching strategy.
- subFlags can be set to SubFlags.Unordered instead of the default Sequenced. This makes the parser consider the direct submatchers in any order. This is for example helpful for input files
- if repeats is true this matcher is expected to repeat (for example a matcher starting an scf iteration)
- if endReStr is provided, the matcher will finish when this regular expression is encountered. Notice that this will not finish a SimpleMatcher with repeats=True, but will only tell when a new repetition should start. If you want to completely stop a repeating SimpleMatcher when a certain regexp is encountered, you should put it inside another SimpleMatcher with a certain endReStr and forward=True.
- matching a matcher with weak = True is attempted unless it is the expected next possible match (does not "steal" the position of the parser). Useful to match patterns that are not distinctive, and could steal the place incorrectly. E.g. the weak flag can be given to a parent matcher if it shares a regular expression with some of it's submatchers. This way the submatchers will have elevated priority over the parent.
- if required is true that matcher is expected o match if the enclosing matcher does. This allows to detect errors in parsing early. Currently not enforced.
- if floating is true then this section does not steal the position of the parser, it is matched, but then the position of the parser reverts to its old place. This is useful for example for low level debugging/error messages that can happen at any time, but are not supposed to "disturb" the flow of the other output. They are valid from the point they are read until the exit from the enclosing matcher
- forwardMatch does not eat the input and forwards it to the adHoc and the subMatchers. This is useful for adHoc parsers, or to have a group that can start with x|y, and then have submatchers for x and y in it. You have to take care that something matches or eats the value if you use forwardMatch, otherwise if used in conjunction with repeat it might loop forever.
- adHoc lets you define a function that takes over parsing. This function is called with a single argument, the parser. parser.fIn is the PushbackLineFile object from which you can readline() or pushbackLine(lineNotMatched). parser.lastMatch is a dictionary with all groups from startReStr after type and unit conversion, which can be used in simple adHoc functions (e.g. for definition of custom units, see below. parser.backend is the backend where you can emit the things you parse. With it you have full control on the parsing. If you return None, then the parser continues normally, but you can also return 2*targetMatcher.index (+ 1 if you are at the end of the matcher) to continue from any other matcher, or -1 to end he parsing.
- fixedStartValues can be supplied to set default/fallback values in case optional (by ? quantifier) groups in startReStr are not matched. Is a dictionary of (metaInfoName: value), unit and type conversion are not applied. Can be also used to emit additional, fixed values not present in startReStr.
- fixedEndValues same as fixedStartValues, but applied to endReStr
- The onClose argument is a dictionary linking sections into callback functions that are called when a section opened by this SimpleMatcher closes. These callbacks are very similar to the global onClose functions that can be defined on the superContext object. They have the same call syntax: onClose(backend, gIndex, section), but are specific to this SimpleMatcher and are called before the global counterparts. You can use this e.g. to save the gIndex of section_single_configuration_calculation for "frame_sequence_local_frames_ref" when a "section_single_configuration_calculation" is closed within a SimpleMatcher that parses MD or optimization.
- The onOpen argument is very similar to onClose, just called when section in this SimpleMatcher is opened instead of closed.
- With startReAction you can provide a callback function that will get called when the startReStr is matched. The callback signature is startReAction(backend, groups). This callback can directly access any matched groups from the regex with argument groups. You can use this to e.g. format a line with datetime information into a standard form and push it directly to a corresponding metainfo, or do other more advanced tasks that are not possible e.g. with fixedStartValues.
- Often the parsing is complex enough that you want to create an object to keep track of it and cache various flags, open files,..., to manage the various parsing that is performed. You can store this object as superContext of the parser.
Nomad Meta Info Primer
The meta information has a central role in the parsing. It represents the conceptual model of the extracted data, every piece of information extracted by the parsers is described with a meta information.
Nomad Meta Info is described in detail in NomadMetaInfo, here we give a quick introduction to the main ideas.
A piece of data is described with a meta information with kindStr = "type_document_content" (the default, so it can be left out). It must have:
- a dtypeStr string that describes the type: floats, integer, boolean, strings, arrays of bytes and json dictionaries are supported.
- a shape list that gives the dimensions of multidimensional arrays that can contains strings to represent symbolic sizes, and is the empty list for scalars.
i.e. at lowest level all data can only be a (multidimensional) array of simple types.
These values can be grouped together through sections (metadata with kindStr = "type_section"). A section can be seen as a dictionary that groups together some values.
When parsing you can open a section (create a new dictionary) and add values to it. For example section_scf_iteration groups together all values of an scf iteration. The next scf iteration will open again the section (create a new dictionary) and add all the values of the new iteration to it.
Metadata of the same type can be grouped together using meta info with kindStr = "type_abstract_document_content". This kind of metadata has no values associated directly with it (and thus is not directly relevant for parsing). For example all energies inherit from the abstract type energy.
The meta info is described in the repository nomad-meta-info, but you can have an interactive visualization or browse it.
Parsing and meta info
The simple parser infrastructure is closely connected to the meta info. Normally you will write a code specific meta info in the nomad-meta-info repository that imports the common meta info as dependency. In it you can write all code specific meta info (all these names should start with the code as prefix).
As said all named groups should correspond to a meta info (with kindStr = "type_document_content"), and the type declared is used to convert the string value extracted by the regular expression. This works for all scalar values. Sections can be used to organize the values, you can give the section to open with the sections argument. If an unknown section or group name is used, the parser will complain and also give a sample meta info that you should add to the meta info file.
The code specific values do not need to be specified in detail, but you should try to connect them to the common meta info they are related to. For example assigning "settings_XC_functional" to a code specific setting, allows the analysis part to answer queries like "are all parameters that influence the XC_functional equal in these two calculations", or "show me the differences in setting classified by type". This kind of queries is required when building data-sets that one wants to use for interpolation. Currently there are no methods to answer these queries, and one simply has to build the datasets himself (often with scripts) to be sure that the calculations are really consistent.
What to parse
You should try to parse all input parameters and all lines of the output.
On one side as explained in the previous section this is required to fulfill the Idea of NOMAD of reusing calculations (and build a tool useful in general for many kind of analysis). But the other point, that is just as important is to be able of detecting when a parser fails and needs improvement. This is the only way to keep the parser up to date with respect to the data that is in the repository.
Clearly one should first prioritize the extraction of the most important/relevant parts, but then automatic tools will help find the missing parts in a systematic way (see write analyzer of parts ignored by a parser and write parser tester help).
Technical excursus: finite automata
The matching done by the parser that is generated is efficient because at each point it collects all possible alternatives and generates a single regular expression. This can be evaluated efficiently without backtracking using with finite automata. These put the possible matches in the state, so that it can always advance. For example to match "ab|ac" a finite automata with the following states is generated:
NoMatch | MatchA | MatchAB | MatchAC |
---|---|---|---|
a -> MatchA | b -> MatchAB | finished | finished |
_ -> NoMatch | c -> MatchAC | ||
_ -> NoMatch |
This kind of automata can be created to match or searach patterns. The re2 library is guaranteed to do this, and this issue is about using it. Still the python own implementation is not too bad.
Thus you are encouraged to use the SimpleParser not only because it frees you from some tedious work, makes parsers more uniform, and allows optimizations, but also because it will likely be more efficient than what you would do by hand.
Unit conversion
The code independent meta info should use SI units, codes typically do not. As shown in the examples matchers can convert automatically the value matched by the their group, if the units of the value read is added to the meta info name with two underscores. A group name g__hartree means that the the value read is in hartree and it should be stored in the meta info named g. Group names cannot contain any special character, so put more complex units you have to rewrite the expression using just * and ^, for example m/s becomes m*s^-1 then you remove the ^ and replace * and - with _, so finally m/s becomes m_s_1.
You can find the units that are defined in python-common/python/nomadcore/unit_conversions/units.txt and some constants in python-common/python/nomadcore/unit_conversions/constants.txt
You might extend these lists if you find important units are missing. Units names should never use _, use camel case.
Sometime a code uses a unit that is not fixed, but might be defined in the input or some other part. In this case you can use
nomadcore.unit_conversion.unit_conversion.register_userdefined_quantity(quantity, units, value=1)
from python-common. With it you can (for example in the adHoc callback of a SimpleMatcher) define the usint usrMyCodeLength (user defined units should always start with "usr" and use just letters) to be angstrom with
register_userdefined_quantity("usrMyCodeLength", "angstrom")
this call needs to be done before any use of that unit. The unit can then be used just like all others: add __usrMyCodeLength to the group name.
Caching
Values parsed can be cached by the active backend. What exactly is cached, forwarded or ignored can be controlled with nomadcore.active_backend.CachinLevel.
- CachingLevel.Forward for values it forwards the values immediately to the super backend that writes them out and does not cache them, for sections it emits section open and close.
- CachingLevel.Cache For a concrete value caches the values in their section and does not write them out, and for a section means that the section when closed will be stored in its super section. Open and close are not forwarded (will give problems if values in the section are forwarded).
- CachingLevel.ForwardAndCache forwards and caches the values or section
- CachingLevel.Ignore completely ignores the values or section opening and closing
These values can be set for a single value/section with the dictionary cachingLevelForMetaName that maps each meta info name to its caching value. The default value for concrete values is ForwardAndCache, but can be changed with defaultDataCachingLevel. For sections the default value is Forward, and it can be changed with defaultSectionCachingLevel.
Logging
By default the parsers will stream the parsed information into the standard output stream (sys.stdout). The receiving end will expect the stdout to contain only this information, and nothing else. For this reason you shouldn't print anything to sys.stdout directly. To print debug messages you should use the logging package that will redirect the messages to the correct output. Here is a quick example on how to use the logging package:
import logging
import nomadcore.ActivateLogging # This will activate the "nomad" logging environment
logger = logging.getLogger("nomad")
logger.warning("This is a warning message.")
logger.error("This is an error message.")
Triggers
When a section is closed a function can be called with the backend, gIndex and the section object (that might have cached values). This is useful to perform transformations on the data parsed before emitting it.
The simplest way to achieve this is to define methods called onClose_ and then the section name in the object that you pass as superContext. For example
def onClose_section_scf_iteration(self, backend, gIndex, section):
logging.getLogger("nomadcore.parsing").info("YYYY bla gIndex %d %s", gIndex, section.simpleValues)
defines a trigger called every time an scf iteration section is closed.
Controlling what is Parsed
By default the SimpleMatchers will go through the entire file an parse everything that is declared in the 'parsing tree', i.e. the hierarchy of nested SimpleMatchers. Sometimes it is however only necessary to parse only a part of the file, for example if the parser is later extended with additional metainfo and all the calculations have to be reparsed. Also the testing process can be sped up by parsing only the areas that are relevant to the test.
To allow this, the parser can be given a list of metainfos that should be parsed, and the rest ignored. Currently this list has to be given as a command line argument using the specialization dictionary:
python SimpleFhiParser.py --specialize Au2.relax_light.out <<EOF
{"type":"nomad_parser_specialization_1_0","metaInfoToKeep":"program_version"]}
EOF
By default all the submatchers of the matcher that contain the wanted metainfo will also be parsed. This allows any cached values to be used. To further optimize the process you can declare the dependencies between metainfos explicitly by using the 'dependencies' attribute of a SimpleMatcher. There you can state the dependencies between metainfos to leave only the minimum amount to be parsed.
Debugging
Currently the main way to debug is to use the --verbose flag an look at the output or detailed output, add logging statements or place a nomadcore.utils.goInteractive(locals()) in strategic places.
When running the scala side you can set the environment variable NOMAD_CODE_ROOT to the nomad-lab-base directory to use the python from there instead of the one stored in the scala resources. This lets one instead of doing
sbt
tool/run parse --main-file-path <absolute path to output file> --json
to sbt tool/assembly
and then
export NOMAD_CODE_ROOT=<nomad-lab-base>
java -Djava.library.path=<nomad-lab-base/hdf5/hdf5-support/lib> -jar <jar file of tool/assembly> parse --main-file-path <absolute path to output file> --json
or similar to run the updated python.
There are several issues related with improving the testing / debugging situation,in particular:
- write analyzer of parts ignored by a parser
- write name consistency test for parser
- write parser debugging help
- write parser tester help
Feel free to grab one of them and work on it.
NOMAD Archive
The parsers should populate the NOMAD archive with files that contain the data extracted by the parsers. These files are stored in the /parsed directory.
The real test of a parser is how it behaves on all the data. This means monitoring the behavior of a parser on the whole archive and then debug the problems that arise.
The basics of the NOMAD Archive and how to quickly run a parser on a specific file are described in the NOMAD Archive for parser developers wiki.
Practical hacking
Adding a new parser to the project:
- Create a new repository for the parser under nomad-lab group in gitlab
- Assign a test runner for the new repository and make sure that deploy keys are setup for the runner.
- Add the new repository as a submodule to nomad-lab-base repository
- Add the parser definitions to the scala environment in file build.sbt
- Make a metainfo file for the parser in the nomad-meta-info repository. Add this filename also to all.nomadmetainfo.json. Also add the metainfo file name to the scala environment in file in KnownMetaInfoEnvs.scala
Quick recapitulation
To sum up you should be able to write the parsers by:
- defining regular expressions
- defining metadata, possibly code specific, but really try to use the code independent parts as much as possible, or at least connect the code specific parts to the code independent parts through superNames.
- cache values you parsed, and write triggers that use the cached values for more complex transformations
Git
- There is a separate repository for each code, this will allow one to use branches in that repository to track different versions of the code (if the code is so different that writing a single parser is cumbersome)
- If there is no repository for the code you want to work on you can ask, to have one with already a code skeleton create for you.
Shell commands
Basically the first time you do
# set up a python environment
virtualenv -p python3 labEnv3
# activate it
. labEnv3/bin/activate
# get the whole nomad lab code
git clone --recursive git@gitlab.mpcdf.mpg.de:nomad-lab/nomad-lab-base.git
# make sure that you use the master branch of the modules you will update
# you can do that on a per module basis (cd <submodule>; git checkout master)
# here we do it globally
git submodule foreach git checkout master
# install the python modules required
cat nomad-lab-base/python-common/requirements.txt | xargs -n1 pip install --upgrade
and then every time
# activate the python environment if needed
. labEnv3/bin/activate
test a parser
# go in the parser directory
cd nomad-lab-base/parsers/fhi-aims/parser/parser-fhi-aims
python SimpleFhiParser.py ../../test/examples/Au2.final_scaled_zora.out
cd -
if you need to update one module
# go inside the repo
cd nomad-lab-base
# update to the latest version (you might need to commit or stash your changes before)
git pull --rebase
git submodule foreach git pull --rebase
work on your parser, i.e. mainly on these modules/paths
parsers/<yourCode>
parser/parser-<yourCode>/parser_<yourCode>.py
...
nomad-meta-info
meta_info/nomad_meta_info/<yourCode>.nomadmetainfo.json
meta_info/nomad_meta_info/common.nomadmetainfo.json
and maybe check (or fix the infrastructure) in
python-common
python/nomadcore/simple_parser.py
python/nomadcore/parser_backend.py
python/nomadcore/caching_parser.py
when you are finished check in that module:
git commit
git pull --rebase
git push
(remember you must first be in the master branch, see git status)
if you want to relase a new version (should use major.minor.patch format, for example 1.2.4) of the parser
cd parsers/<myParser>
git tag 1.2.4
git push --tags
happy coding...